ATLAS: 74.6% on Coding Benchmarks With $500 GPU, Beats Claude Sonnet

What ATLAS Does

ATLAS (Adaptive Test-time Learning and Autonomous Specialization) is a framework that wraps a frozen smaller model in intelligent infrastructure to compete with frontier API models. It uses structured generation, energy-based verification, and self-verified repair without fine-tuning, API calls, or cloud dependencies. The system is fully self-hosted with no data leaving the machine.

Benchmark Results

Hardware: RTX 5060 Ti 16GB | Model: Qwen3-14B-Q4_K_M (frozen)

LiveCodeBench v5: 74.6% pass@1-v(k=3) on 599 tasks
GPQA Diamond: 47.0% on 198 k=5 multiple-choice knowledge reasoning tasks
SciCode: 14.7% on 341 k=1 cross-domain scientific coding tasks

Note: pass@k-v(k=3) means one solution submitted per task, generated via best-of-3 candidates + Lens selection + iterative repair on failures. Not single-shot generation.

V3 Pipeline Ablation Breakdown

Baseline (no V3): 54.9%
+Phase 1 (PlanSearch + BudgetForcing + DivSampling): 67.3% (+12.4pp)
+Phase 1+2 (Lens routing): 67.3% (+0.0pp)
+Phase 1+3 (self-verified refinement): 74.6% (+7.3pp)

Phase 3 uses self-generated test cases for internal verification — the model never sees the answer key during repair. PR-CoT rescues 36/42 tasks (85.7% of Phase 3 rescues).

Cost and Performance Comparison

DeepSeek V3.2 Reasoning: 86.2% LCB pass@1, ~$0.002/task (API, single-shot)
GPT-5 (high): 84.6%, ~$0.043/task (API, single-shot)
ATLAS V3 (pass@1-v(k=3)): 74.6%, ~$0.004/task (local electricity only, best-of-3 + repair pipeline)
Claude 4.5 Sonnet: 71.4%, ~$0.066/task (API, single-shot)
Claude 4 Sonnet: 65.5%, ~$0.066/task (API, single-shot)

ATLAS cost calculation: electricity at $0.12/kWh (~165W GPU, ~1h 55m for 599 tasks). ATLAS trades latency for cost — the pipeline takes longer per task than a single API call.

How It Works

The V3 pipeline has three phases:

Phase 1: Generate — PlanSearch with constraint extraction and diverse plans, Budget Forcing with thinking token control
Verify — Geometric Lens with energy scoring (5120-dim self-embeddings) and sandbox code execution
Phase 3: Repair — Self-Test Generation with model-generated I/O pairs and PR-CoT Repair with multi-perspective chain-of-thought

The workflow: PlanSearch → Budget Forcing → k=3 candidates → Geometric Lens → energy-sorted → Sandbox → if all fail → Self-Test Generation → PR-CoT Repair → repaired code → Sandbox.

A single patched llama-server runs on K3s, providing both generation with speculative execution and embedding services.

📖 Read the full source: HN AI Agents