Claude Code Autoresearch: 93% Failure Rate on Production Codebase

Autoresearch Experiment on Production Codebase

A developer tested Karpathy's autoresearch approach on a real production system using Claude Code, running 60 iterations across two rounds while away from the computer. The target was a hybrid search system built with Django, pgvector, and Cohere embeddings.

Key Results and Findings

Out of 60 iterations, only 3 changes were kept while 57 were reverted. The overall score improvement was marginal (+0.03), but the knowledge gained was significant:

Title matching as a search signal proved to be net negative, demonstrated in just 2 iterations
Larger candidate pools had no effect - the problem was ranking, not recall
Hand-built adaptive weighting actually worked - removing it caused regressions
Fiddling with keyword damping formulas barely moved scores
Round 2 targeting the Haiku metadata prompt yielded zero improvements because ranking weights from Round 1 were co-optimized to the original prompt's output
Discovered a Redis caching bug: keys were on query hash, not prompt hash, which would have shipped to production unnoticed

Practical Takeaways

The biggest insight was that autoresearch helps map where the ceiling is, not just find improvements. Having 60 data points saying "You can stop tuning this" provides concrete evidence rather than relying on intuition. The developer notes this approach saved manual experimentation time on optimizations that wouldn't have paid off.

The full writeup is available at the blog link, and the open source Claude Code autoresearch skill is on GitHub. The developer is curious about others trying this on non-ML codebases and what metrics they're using.

📖 Read the full source: r/ClaudeAI

Autoresearch with Claude Code on Production Codebase: 60 Experiments, 3 Changes Kept

Autoresearch Experiment on Production Codebase

Key Results and Findings

Practical Takeaways

👀 See Also

Building Persistent Memory for Claude with Four Markdown Files

Building a Voice Assistant with OpenClaw, Alexa, and Local LLM

OpenClaw user builds character chat app with agentic coding approach

Onboarding AI agents like junior contractors: CLAUDE.md and production lessons