Scaling Karpathy's Autoresearch with 16 GPUs: Results and Methods

What is Autoresearch?
Autoresearch is Andrej Karpathy's project where a coding agent autonomously improves a neural network training script. The agent edits train.py, runs a 5-minute training experiment on a GPU, checks validation loss, and loops - keeping changes that help, discarding those that don't. In Karpathy's first overnight run, the agent found ~20 improvements that stacked up to an 11% reduction in time-to-GPT-2 on the nanochat leaderboard.
How Autoresearch Works
The project has three files:
prepare.py- Downloads data, trains a tokenizer, provides the dataloader and evaluation function. Read-only. The agent cannot touch it.train.py- The GPT model, optimizer, and training loop. This is the only file the agent modifies.program.md- Instructions for the agent: what it can change, how to evaluate results, when to keep vs. discard changes.
The constraint is a fixed 5-minute wall-clock training budget. The agent's job is to minimize val_bpb (validation bits per byte) within that window. Everything in train.py is fair game - architecture, hyperparameters, optimizer settings, batch size, model depth - as long as the code runs without crashing.
The Bottleneck: One GPU, One Experiment
Running experiments sequentially means the agent spends most of its time waiting. A typical cycle looks like:
- Agent edits train.py (~30 seconds)
- Training runs (~5 minutes)
- Agent reads the result, plans the next experiment (~30 seconds)
Step 2 dominates. During step 2, the agent is idle - it could be preparing the next experiment, or the next ten. With sequential execution, testing combinations of parameters means waiting another 5 minutes for each test.
Giving the Agent Cloud GPUs
The team used SkyPilot, an open-source tool that launches jobs across clouds and Kubernetes from a YAML file. It includes a skill that teaches coding agents to use it. The agent reads the skill, then launches and manages GPU clusters on its own - no manual cloud setup.
Each experiment is defined in a short YAML (experiment.yaml) that specifies the GPU type, installs dependencies, runs train.py, and prints metrics to stdout. The agent checks results with sky logs.
Results: ~910 Experiments, ~8 Hours, 16 GPUs
Claude Code used the SkyPilot skill to launch and manage GPU experiments across 16 GPUs. Over 8 hours it submitted ~910 experiments and drove val_bpb from 1.003 down to 0.974 - a 2.87% improvement over baseline.
How Parallelism Changed the Agent's Research Strategy
With one GPU, the agent does greedy hill-climbing - try one thing, check, repeat. With 16 GPUs, it ran factorial grids of 10-13 experiments per wave, catching interaction effects between parameters that sequential search would miss.
For example, the agent tested six model widths in a single wave, saw the trend immediately, and zeroed in on the best one - one round instead of six.
The agent also discovered it had access to multiple GPU types (H100s and H200s) and developed a strategy to exploit the performance difference across heterogeneous hardware: screen ideas on cheaper H100s, promote winners to H200 for validation.
Performance Comparison
With 16 GPUs, the parallel agent reached the same best validation loss 9x faster than the simulated sequential baseline (~8 hours vs ~72 hours).
Experiment Phases
- Phase 1: Hyperparameter sweeps (~first 200 experiments)
- Phase 2: Architecture discovery (~experiments 200-420)
- Phase 3: Fine-tuning the wider model (~experiments 420-560)
- Phase 4: Optimizer tuning (~experiments 560-700)
- Phase 5: Diminishing returns (~experiments 700-910)
The agent found that scaling model width mattered more than any single hyperparameter.
📖 Read the full source: HN AI Agents
👀 See Also

Silos Dashboard: Open-source web UI for managing OpenClaw agents
Silos Dashboard is an MIT-licensed web UI for managing OpenClaw agents, replacing config files and CLI with a single interface. It offers agent management, live chat with streaming, skill installation, task boards, channel integrations, and analytics.

T9OS: An AI Orchestration System Built Entirely with Claude Code
An economics student built T9OS, a complete AI orchestration layer using Claude Code as the only programming tool. The system includes 18 production pipelines, a 12-state lifecycle engine, and 7 AI 'Guardians' that review every output.

Mnemos: an MCP server for persistent Claude Code memory
Mnemos is an open-source MCP server that gives Claude Code persistent memory across sessions, recording corrections as structured patterns and pushing ranked context at startup. Single 15 MB Go binary, no Docker or vector DB needed.

Fixing OpenClaw Browser CAPTCHAs with Camoufox and CLI Wrapper
OpenClaw's built-in Chromium browser triggers bot detection through Chrome DevTools Protocol, JavaScript injection artifacts, and hardware fingerprinting inconsistencies. The solution uses Camoufox (a Firefox fork) modified at the C++ level and wrapped in a CLI that returns accessibility-tree snapshots to reduce token usage.