413K AI Agent Runs: What Makes Them Succeed?

A new analysis of 413,278 AI software engineering agent runs from the CoderForge-Preview dataset reveals what separates successful from failing runs. The study examined 17 billion tokens of behavioral data, comparing passing versus failing runs on identical problems.

Key Findings from the Data

The analysis shows that common human software engineering practices can actually reduce AI agent performance. Here are the specific patterns that emerged:

Stop telling agents to "look around first": Forcing agents to grep or view files before editing reduces effectiveness. Unlike humans with limited working memory, agents already have the codebase in their context window. Early turns spent searching and exploring indicate the agent is flailing rather than learning.
Test-driven approaches are mandatory: The single biggest predictor of successful runs is the fraction of early bash commands dedicated exclusively to running tests. Agents should not edit blindly—system prompts should enforce running the test suite immediately.
Keep agents on a tight leash: If an agent tries to edit 3 or more files in the first 30% of its run, success rates drop significantly. Scattering edits across multiple files indicates confusion. Force agents to fix one thing at a time.
Perseverance is an illusion: If an agent runs the exact same bash command twice early in the run, it's stuck in a loop rather than "thinking hard" or "trying again." Break the loop or restart the run.

Practical Implementation Changes

The analysis recommends specific changes to agent scaffolding:

Stop using prompts like: "Explore the codebase, read the relevant files, and figure out the bug."
Instead, use: "Run the test suite immediately to verify the baseline. Make targeted changes to a maximum of 1 or 2 files. Rerun tests."

The key insight is to stop projecting human limitations onto LLMs. Let them use their massive context windows and force them to prove their work with tests.

📖 Read the full source: r/LocalLLaMA

Analysis of 413K AI Agent Runs Reveals What Makes Them Succeed

Key Findings from the Data

Practical Implementation Changes

👀 See Also

The Open Claw Overnight Test: A Leap Forward in AI Automation

Wikipedia bans AI agent Tom-Assistant for violating bot approval process

Opus 4.7's attention degradation: MRCR scores drop from 92% to 59% at 256k context

OpenAI's Sam Altman Supports Anthropic's Pentagon Red Lines, Proposes Technical Safeguards