Autoresearch with Claude Code on Production Codebase: 60 Experiments, 3 Changes Kept

Autoresearch Experiment on Production Codebase
A developer tested Karpathy's autoresearch approach on a real production system using Claude Code, running 60 iterations across two rounds while away from the computer. The target was a hybrid search system built with Django, pgvector, and Cohere embeddings.
Key Results and Findings
Out of 60 iterations, only 3 changes were kept while 57 were reverted. The overall score improvement was marginal (+0.03), but the knowledge gained was significant:
- Title matching as a search signal proved to be net negative, demonstrated in just 2 iterations
- Larger candidate pools had no effect - the problem was ranking, not recall
- Hand-built adaptive weighting actually worked - removing it caused regressions
- Fiddling with keyword damping formulas barely moved scores
- Round 2 targeting the Haiku metadata prompt yielded zero improvements because ranking weights from Round 1 were co-optimized to the original prompt's output
- Discovered a Redis caching bug: keys were on query hash, not prompt hash, which would have shipped to production unnoticed
Practical Takeaways
The biggest insight was that autoresearch helps map where the ceiling is, not just find improvements. Having 60 data points saying "You can stop tuning this" provides concrete evidence rather than relying on intuition. The developer notes this approach saved manual experimentation time on optimizations that wouldn't have paid off.
The full writeup is available at the blog link, and the open source Claude Code autoresearch skill is on GitHub. The developer is curious about others trying this on non-ML codebases and what metrics they're using.
📖 Read the full source: r/ClaudeAI
👀 See Also

Building Persistent Memory for Claude with Four Markdown Files
A developer built a system to overcome Claude's session-based context limitation using four markdown files loaded via project context: Protocol, CONVERGEHERE, Daily Capture, and Continuity. The system maintains context across sessions by having Claude read all files at boot and update Continuity and CONVERGEHERE at session close.

Building a Voice Assistant with OpenClaw, Alexa, and Local LLM
A developer built a voice-first assistant using OpenClaw as the AI agent backbone, Alexa for voice input, and a local LLM (Ollama with Qwen 2.5 3B) to handle general knowledge queries with sub-second response times and reduced API costs.

OpenClaw user builds character chat app with agentic coding approach
A self-described non-technical OpenClaw user developed a working character chat application in 7 days using agentic coding, noting that their role shifted to reviewing AI-generated work rather than traditional programming.

Onboarding AI agents like junior contractors: CLAUDE.md and production lessons
A store run entirely with AI agents treated onboarding like hiring a junior contractor, finding that clear constraints in a CLAUDE.md document consistently outperformed 'smarter' models with vague instructions.