Agentic Context Engine: 34.2% Accuracy Gain in Automated Agent Fix Loop

Automating the Agent Improvement Loop

A developer has open-sourced a system that automates the entire process of improving AI agents by letting them self-analyze and self-correct. The tool addresses the common problem of manually reading logs, tweaking prompts, and hoping for improvements.

The Five-Step Process

The automated loop follows five distinct steps:

Trace analysis: Analyzes traces to determine not just what failed but why, whether it's a one-off or systemic issue, and what category of failure it is. Outputs a structured breakdown of failure modes rather than just error lists.
Eval generation: Creates specific evaluations to validate the analysis and measure fixes. Generic evals don't catch specific failures. LLM-as-a-judge serves as a fallback when trace data isn't structured enough for deterministic evals.
Baseline measurement: Runs evals against the current agent before making fixes to establish baselines and validate the evals themselves.
Fix implementation: A developer examines the analysis and codebase to decide what to change. The key decision is whether the fix belongs in the prompt or in the surrounding code (e.g., when the harness handles tool outputs poorly or doesn't pass the right context).
Verification and compounding: After fixes, evals run again to verify improvement, with changes kept, rolled back, or reworked.

Implementation Details

The solution automates this entire loop end-to-end with one command that invokes a self-analyzing agentic system. Trace analysis happens in a REPL environment with agents tuned for this specific use case. The system provides analysis through CLI access to Claude Code to handle the rest with a set of skills.

Since Claude can live inside the codebase, it validates the analysis and decides on the best course of action in the fix stage (prompt vs. code).

Results and Operation

Benchmarked on Tau-2 Bench using only one iteration, the first pass achieved a 34.2% accuracy gain without manual intervention. The system is designed to compound improvements: new traces reveal new problems, leading to new fixes in each cycle.

You can set it to fully loop autonomously. A human-in-the-loop option exists if you want to approve fixes before step 4, but in testing, the developer "just let it rip."

The tool is open-sourced at GitHub: https://github.com/kayba-ai/agentic-context-engine

📖 Read the full source: r/ClaudeAI