SQL-Generating LLM Traces Flaky Tests in Seconds

Mendral's AI agent traced a flaky test to a dependency bump three weeks prior by writing its own SQL queries, scanning hundreds of millions of log lines across a dozen queries, and following a trail from job metadata to raw log output. The whole investigation took seconds.

Every week, about 1.5 billion CI log lines and 700K jobs flow through their system. All of it lands in ClickHouse, compressed at 35:1. All of it is queryable in milliseconds.

SQL interface for the agent

They expose a SQL interface to the agent, scoped to the organization it's investigating. The agent constructs its own queries based on the question. No predefined query library, no rigid tool API. A constrained tool API like get_failure_rate(workflow, days) would limit the agent to anticipated questions. A SQL interface lets it ask questions never thought of, which matters when debugging novel failures.

The agent queries two main targets:

Job metadata: a materialized view with one row per CI job execution. The agent uses this 63% of the time for questions like "how often does this fail?", "what's the success rate?", "which jobs are slowest?", "when did this start failing?"
Raw log lines: one row per log line. The agent uses this 37% of the time for questions like "show me the error output for this job", "when did this log pattern first appear?", "how often does this error message occur across runs?"

Query patterns and scale

They analyzed 8,534 agent sessions and 52,312 queries from their observability pipeline. The agent doesn't stop at one query. It investigates - starts broad, then drills in.

Total rows scanned across all queries to answer one question:

Typical question: 335K rows across about 3 queries
P75: 5.2 million rows
P95: 940 million rows
Heaviest raw-log sessions scan 4.3 billion rows

The agent averages 4.4 queries per session. A typical investigation begins with job metadata (cheap queries, median 47K rows) against a compact, pre-aggregated materialized view. When it finds something interesting, it drills into raw logs (expensive queries, median 1.1M rows).

Data architecture

For the agent to query this fast, the data needs to be structured for it. Up to 300 million log lines flow through on a busy day. They use ClickHouse.

Every log line carries 48 columns of metadata: the full context of the CI run it belongs to. Commit SHA, author, branch, PR title, workflow name, job name, step name, runner info, timestamps, and more.

In ClickHouse's columnar format, denormalizing 48 columns onto every single log line is essentially free. A column like commit_message has the same value for every log line in a CI run, and a single run can produce thousands of log lines. ClickHouse stores those thousands of identical values in sequence. The compression algorithm sees the repetition and compresses it to almost nothing.

Compression ratios: