Agentic Context Engine: Automated Agent Improvement Loop with 34.2% Accuracy Gain

Automating the Agent Improvement Loop
A developer has open-sourced a system that automates the entire process of improving AI agents by letting them self-analyze and self-correct. The tool addresses the common problem of manually reading logs, tweaking prompts, and hoping for improvements.
The Five-Step Process
The automated loop follows five distinct steps:
- Trace analysis: Analyzes traces to determine not just what failed but why, whether it's a one-off or systemic issue, and what category of failure it is. Outputs a structured breakdown of failure modes rather than just error lists.
- Eval generation: Creates specific evaluations to validate the analysis and measure fixes. Generic evals don't catch specific failures. LLM-as-a-judge serves as a fallback when trace data isn't structured enough for deterministic evals.
- Baseline measurement: Runs evals against the current agent before making fixes to establish baselines and validate the evals themselves.
- Fix implementation: A developer examines the analysis and codebase to decide what to change. The key decision is whether the fix belongs in the prompt or in the surrounding code (e.g., when the harness handles tool outputs poorly or doesn't pass the right context).
- Verification and compounding: After fixes, evals run again to verify improvement, with changes kept, rolled back, or reworked.
Implementation Details
The solution automates this entire loop end-to-end with one command that invokes a self-analyzing agentic system. Trace analysis happens in a REPL environment with agents tuned for this specific use case. The system provides analysis through CLI access to Claude Code to handle the rest with a set of skills.
Since Claude can live inside the codebase, it validates the analysis and decides on the best course of action in the fix stage (prompt vs. code).
Results and Operation
Benchmarked on Tau-2 Bench using only one iteration, the first pass achieved a 34.2% accuracy gain without manual intervention. The system is designed to compound improvements: new traces reveal new problems, leading to new fixes in each cycle.
You can set it to fully loop autonomously. A human-in-the-loop option exists if you want to approve fixes before step 4, but in testing, the developer "just let it rip."
The tool is open-sourced at GitHub: https://github.com/kayba-ai/agentic-context-engine
📖 Read the full source: r/ClaudeAI
👀 See Also

Benchmark Results: GitHub CLI vs MCP Approaches for AI Agents
An independent benchmark compared GitHub CLI, MCP, MCP with Tool Search, and MCP with Code Mode for AI agent tasks. GitHub CLI was the most cost-effective, while MCP approaches showed trade-offs in cost, latency, and failure modes.

2026 Hermes Agent Alternatives Roundup: Self-Hosted Options from OpenClaw to memU Bot
A developer who has been running Hermes since launch tested every self-hosted and managed alternative after the ClawHub security mess. Key findings: OpenClaw (370k stars) but 9 CVEs in 4 days and ~20% malicious packages; TrustClaw rebuilt with OAuth/sandboxing; nanobot at ~4K lines Python with MCP; memU Bot with unique structured memory. Managed options include Perplexity Computer (19 models, $200/mo), Claude Cowork (opens real Mac apps), and KimiClaw (40GB RAG, locked to K2.5, Chinese data law). Full roundup at source.

Soul MCP Server Adds Persistent Memory and Safety for Local LLMs
Soul is an open-source MCP server that provides persistent memory across sessions for local LLMs with two commands: n2_boot at start and n2_work_end at end. It includes Ark safety features that block dangerous commands like rm -rf and DROP DATABASE at zero token cost, plus cloud storage configuration.

Agents Observe: Real-time Dashboard for Monitoring Claude Code Agent Teams
Agents Observe is a local dashboard that provides real-time observability for Claude Code agent sessions using hooks instead of OTEL. It captures every tool call, agent hierarchy, and event with filtering and search capabilities, running as a Docker container that auto-starts with Claude sessions.