ATLAS: Adaptive Test-time Learning Framework Outperforms Claude Sonnet on Coding Benchmarks with $500 GPU

What ATLAS Does
ATLAS (Adaptive Test-time Learning and Autonomous Specialization) is a framework that wraps a frozen smaller model in intelligent infrastructure to compete with frontier API models. It uses structured generation, energy-based verification, and self-verified repair without fine-tuning, API calls, or cloud dependencies. The system is fully self-hosted with no data leaving the machine.
Benchmark Results
Hardware: RTX 5060 Ti 16GB | Model: Qwen3-14B-Q4_K_M (frozen)
- LiveCodeBench v5: 74.6% pass@1-v(k=3) on 599 tasks
- GPQA Diamond: 47.0% on 198 k=5 multiple-choice knowledge reasoning tasks
- SciCode: 14.7% on 341 k=1 cross-domain scientific coding tasks
Note: pass@k-v(k=3) means one solution submitted per task, generated via best-of-3 candidates + Lens selection + iterative repair on failures. Not single-shot generation.
V3 Pipeline Ablation Breakdown
- Baseline (no V3): 54.9%
- +Phase 1 (PlanSearch + BudgetForcing + DivSampling): 67.3% (+12.4pp)
- +Phase 1+2 (Lens routing): 67.3% (+0.0pp)
- +Phase 1+3 (self-verified refinement): 74.6% (+7.3pp)
Phase 3 uses self-generated test cases for internal verification — the model never sees the answer key during repair. PR-CoT rescues 36/42 tasks (85.7% of Phase 3 rescues).
Cost and Performance Comparison
- DeepSeek V3.2 Reasoning: 86.2% LCB pass@1, ~$0.002/task (API, single-shot)
- GPT-5 (high): 84.6%, ~$0.043/task (API, single-shot)
- ATLAS V3 (pass@1-v(k=3)): 74.6%, ~$0.004/task (local electricity only, best-of-3 + repair pipeline)
- Claude 4.5 Sonnet: 71.4%, ~$0.066/task (API, single-shot)
- Claude 4 Sonnet: 65.5%, ~$0.066/task (API, single-shot)
ATLAS cost calculation: electricity at $0.12/kWh (~165W GPU, ~1h 55m for 599 tasks). ATLAS trades latency for cost — the pipeline takes longer per task than a single API call.
How It Works
The V3 pipeline has three phases:
- Phase 1: Generate — PlanSearch with constraint extraction and diverse plans, Budget Forcing with thinking token control
- Verify — Geometric Lens with energy scoring (5120-dim self-embeddings) and sandbox code execution
- Phase 3: Repair — Self-Test Generation with model-generated I/O pairs and PR-CoT Repair with multi-perspective chain-of-thought
The workflow: PlanSearch → Budget Forcing → k=3 candidates → Geometric Lens → energy-sorted → Sandbox → if all fail → Self-Test Generation → PR-CoT Repair → repaired code → Sandbox.
A single patched llama-server runs on K3s, providing both generation with speculative execution and embedding services.
📖 Read the full source: HN AI Agents
👀 See Also

P2PCLAW: A Peer-to-Peer Network for AI Agents to Publish Formally Verified Science
P2PCLAW is a peer-to-peer network where AI agents and human researchers can publish scientific results validated through formal mathematical proofs in Lean 4. The system uses GUN.js and IPFS, with post-quantum cryptography and privacy features for secure participation.

Claude Code Undocumented Features: Hooks, Memory, YOLO Classifier & More
The Claude Code source reveals hidden configs: YOLO Classifier for auto-permission, hooks that rewrite commands, persistent agent memory, auto-mode rules in plain English, and dream loops.

Sandra: open-source persistent graph memory MCP for Claude
Sandra is a graph + vector memory backend with a native MCP server that gives Claude persistent structured memory across sessions, supporting exact, fuzzy, and semantic search.

OpenClaw Agent Relay Plugin Fixes Telegram Delivery in Multi-Agent Setups
The openclaw-agent-relay plugin addresses the persistent issue where sessions_send responses go to webchat instead of Telegram by using gateway WebSocket RPC to trigger agent turns with deliver:true, eliminating the need for workarounds like explicit message tools or announce steps.