Code Evolution Method Triples LLM Performance on ARC-AGI-2 Benchmark

Code Evolution Boosts LLM Reasoning on ARC-AGI-2
Researchers from Imbue have published results showing how code evolution can significantly improve LLM performance on the ARC-AGI-2 benchmark. Their method combines fitness-based sampling and code mutation driven by a base LLM, achieving substantial gains across different model types.
Performance Results
The evolution method produces different improvements depending on the base model:
- Kimi K2.5 (open-weights): 2.8x performance gain, from 12.1% to 34.0% accuracy on the public evaluation set, at $2.67 per task. This represents the highest performing open-source/open-weights solution for ARC-AGI-2 currently available.
- Gemini 3 Flash: 1.8x performance gain, from 34.0% to 61.4% accuracy, at $2.42 per task.
- Gemini 3.1 Pro: Improved from 88.1% to 95.1% accuracy, at $8.71 per task. This result is competitive with the current state of the art (97.9% at $11.77/task by Confluence Lab).
All runs used the exact same evolution framework and prompts. The researchers note that scores on the public evaluation set used for these results are not directly comparable to the semi-private data set used for the official ARC-AGI-2 leaderboard.
How Code Evolution Works
The method iteratively improves upon an initial solution using fitness-based sampling and code mutation. The mutation step is driven by an underlying base LLM but is agnostic to the specific model chosen. This approach can be applied across a wide range of reasoning and optimization tasks beyond ARC-AGI-2.
For context, ARC-AGI (Abstraction and Reasoning Corpus) was proposed by François Chollet in 2019 as a way to measure "general fluid intelligence" - a system's ability to efficiently learn solutions to novel problems. Each task presents 2-5 input/output examples (rectangular grids with color values) and requires deducing transformation rules to predict outputs for challenge inputs.
📖 Read the full source: HN LLM Tools
👀 See Also

OpenClaw Kubernetes Operator with Embedded Ollama Support
A community member has created an OpenClaw Kubernetes operator that includes embedded Ollama support, allowing AI agents to run with local models in the same namespace. The setup includes installation commands, configuration details for both local and cloud Ollama models, and dashboard access instructions.

Phaselock: An AI Agent Control System Inspired by Parenting Techniques
Phaselock is an open-source Agent Skill that implements four control mechanisms for AI coding agents: explicit gates before action, immediate feedback on mistakes, constrained choices, and mechanical rule enforcement. It works with Claude Code, Cursor, Windsurf, and any tool supporting hooks.

Discord Bridge for Autonomous Claude Code Sessions
A bridge.js script (~50 lines, discord.js v14) creates real-time two-way chat between Discord and Claude Code via WebSocket + local file queue, replacing 2-minute polling with microsecond file reads. Tested on 27K lines analyzed overnight.

Approval Boundary Tool for Claude Code Repository Work
A developer built an approval boundary tool that adds a review step before local execution when using Claude Code for repository work. The tool follows a loop: see the plan first, approve once, let the run happen locally, and keep proof afterward.