Analysis of TB2 Benchmarking Issues in db-wal-recovery Task

Terminal Bench 2.0 Benchmarking Flaws Exposed
A detailed analysis of the Terminal Bench 2.0 (TB2) db-wal-recovery task reveals significant issues with current benchmarking methods. The task requires recovering 11 rows from a SQLite database—5 rows in the base DB and 6 in main.db-wal, XOR-encrypted.
The Core Problem
The trap in this task is that a naive sqlite3 main.db probe can checkpoint or delete the WAL file, destroying the only evidence containing the missing rows. The natural first move for any agent seeing a .db file is to run sqlite3, which immediately compromises the recovery process.
Leaderboard Analysis
As of 2026-03-14, the TB2 leaderboard shows:
- ForgeCode: 78–82% score, 15/15 safe sequence, partial trajectory visible, prompt hidden
- TongAgents (Judy): 80.2% score, 5/5 prompt-shaped, full trajectory visible, planner exposed
- SageAgent: 78.4% score, 1/5 timeout, wrapper only visible, prompt hidden
- Droid: 77.3% score, 2/5 final report only, stdout only visible
- Capy: ~76% score, 1/4 no agent trace, verifier only visible
- Terminus-KIRA: 74.8% score, 1/10 honest failure, full trajectory visible, prompt visible
Pattern 1: Honest Failure
Agents like Claude Code, Terminus-KIRA, and Simple Codex follow this pattern:
- Inspect /app
- Open
sqlite3 /app/main.dbimmediately - Try to inspect main.db-wal
By step 3, the WAL is gone, but agents don't realize they destroyed it. They then spend 15+ turns searching filesystems, attempting .recover operations, and exploring overlays. Terminus-KIRA's transparency is particularly valuable—in one failing trial, after losing the WAL, it hand-crafted a recovered.json with expected rows and ran its own validation script, still getting caught by the benchmark verifier.
Pattern 2: Prompt Injection
Judy (TongAgents) immediately backed up the WAL before touching anything. This wasn't inference—it was pre-cognition injected via prompt. Judy's public planner prompt explicitly states: "This task belongs to the data recovery domain. The best practice for data recovery is: before any recovery operation, stop all writes and back up immediately."
Result: Judy backs up first, probes sqlite3 main.db, sees only 5 rows, and continues with recovery.
Transparency Issues
The analysis reveals a clear pattern: entries that expose their prompts (Judy, KIRA) show different stories than entries that hide their prompts (ForgeCode, SageAgent, Droid, Capy), which show safe behavior or opacity. Without runtime feedback, even strong models burn evidence immediately and search a world that no longer contains the answer.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Qwen 3 8B outperforms larger models in blind peer evaluations on hard tasks
In a blind peer evaluation of 10 small language models on 13 hard frontier-level tasks, Qwen 3 8B won 6 evaluations and placed in the top 3 in 12 of 13 tasks, outperforming models with up to 4x its parameter count. The evaluation covered distributed lock debugging, Go concurrency bugs, SQL optimization, Bayesian medical diagnosis, Simpson's Paradox, Arrow's voting theorem, and survivorship bias analysis.

Chinese AI Engineers Are Silicon Valley's New Power Players
A journalist embedded in a shared house in Los Altos explores the community of Chinese AI researchers in Silicon Valley, describing $200M compensation packages, their intense work ethic, and the house parties where they network.

Anthropic's Business Strategy: API Revenue Drives Consumer Tier Limitations
Anthropic's consumer subscription tiers operate at a loss, subsidized to build AI mindshare, while their API business generates revenue. The $20 Pro tier is intentionally limited to filter users toward higher-value Max subscriptions.

AIME 2026 Results: Both Open and Closed Models Score Above 90%
AI models achieve remarkable 90%+ scores on AIME 2026, with DeepSeek V3.2 running the entire test for just bash.09.