TB2 Benchmarking Flaws: SQLite Evidence Destruction

Terminal Bench 2.0 Benchmarking Flaws Exposed

A detailed analysis of the Terminal Bench 2.0 (TB2) db-wal-recovery task reveals significant issues with current benchmarking methods. The task requires recovering 11 rows from a SQLite database—5 rows in the base DB and 6 in main.db-wal, XOR-encrypted.

The Core Problem

The trap in this task is that a naive sqlite3 main.db probe can checkpoint or delete the WAL file, destroying the only evidence containing the missing rows. The natural first move for any agent seeing a .db file is to run sqlite3, which immediately compromises the recovery process.

Leaderboard Analysis

As of 2026-03-14, the TB2 leaderboard shows:

ForgeCode: 78–82% score, 15/15 safe sequence, partial trajectory visible, prompt hidden
TongAgents (Judy): 80.2% score, 5/5 prompt-shaped, full trajectory visible, planner exposed
SageAgent: 78.4% score, 1/5 timeout, wrapper only visible, prompt hidden
Droid: 77.3% score, 2/5 final report only, stdout only visible
Capy: ~76% score, 1/4 no agent trace, verifier only visible
Terminus-KIRA: 74.8% score, 1/10 honest failure, full trajectory visible, prompt visible

Pattern 1: Honest Failure

Agents like Claude Code, Terminus-KIRA, and Simple Codex follow this pattern:

Inspect /app
Open sqlite3 /app/main.db immediately
Try to inspect main.db-wal

By step 3, the WAL is gone, but agents don't realize they destroyed it. They then spend 15+ turns searching filesystems, attempting .recover operations, and exploring overlays. Terminus-KIRA's transparency is particularly valuable—in one failing trial, after losing the WAL, it hand-crafted a recovered.json with expected rows and ran its own validation script, still getting caught by the benchmark verifier.

Pattern 2: Prompt Injection

Judy (TongAgents) immediately backed up the WAL before touching anything. This wasn't inference—it was pre-cognition injected via prompt. Judy's public planner prompt explicitly states: "This task belongs to the data recovery domain. The best practice for data recovery is: before any recovery operation, stop all writes and back up immediately."

Result: Judy backs up first, probes sqlite3 main.db, sees only 5 rows, and continues with recovery.

Transparency Issues

The analysis reveals a clear pattern: entries that expose their prompts (Judy, KIRA) show different stories than entries that hide their prompts (ForgeCode, SageAgent, Droid, Capy), which show safe behavior or opacity. Without runtime feedback, even strong models burn evidence immediately and search a world that no longer contains the answer.

📖 Read the full source: r/LocalLLaMA