Scaling Karpathy's Autoresearch with 16 GPUs: Results and Methods

✍️ OpenClawRadar📅 Published: March 19, 2026🔗 Source
Scaling Karpathy's Autoresearch with 16 GPUs: Results and Methods
Ad

What is Autoresearch?

Autoresearch is Andrej Karpathy's project where a coding agent autonomously improves a neural network training script. The agent edits train.py, runs a 5-minute training experiment on a GPU, checks validation loss, and loops - keeping changes that help, discarding those that don't. In Karpathy's first overnight run, the agent found ~20 improvements that stacked up to an 11% reduction in time-to-GPT-2 on the nanochat leaderboard.

How Autoresearch Works

The project has three files:

  • prepare.py - Downloads data, trains a tokenizer, provides the dataloader and evaluation function. Read-only. The agent cannot touch it.
  • train.py - The GPT model, optimizer, and training loop. This is the only file the agent modifies.
  • program.md - Instructions for the agent: what it can change, how to evaluate results, when to keep vs. discard changes.

The constraint is a fixed 5-minute wall-clock training budget. The agent's job is to minimize val_bpb (validation bits per byte) within that window. Everything in train.py is fair game - architecture, hyperparameters, optimizer settings, batch size, model depth - as long as the code runs without crashing.

The Bottleneck: One GPU, One Experiment

Running experiments sequentially means the agent spends most of its time waiting. A typical cycle looks like:

  1. Agent edits train.py (~30 seconds)
  2. Training runs (~5 minutes)
  3. Agent reads the result, plans the next experiment (~30 seconds)

Step 2 dominates. During step 2, the agent is idle - it could be preparing the next experiment, or the next ten. With sequential execution, testing combinations of parameters means waiting another 5 minutes for each test.

Ad

Giving the Agent Cloud GPUs

The team used SkyPilot, an open-source tool that launches jobs across clouds and Kubernetes from a YAML file. It includes a skill that teaches coding agents to use it. The agent reads the skill, then launches and manages GPU clusters on its own - no manual cloud setup.

Each experiment is defined in a short YAML (experiment.yaml) that specifies the GPU type, installs dependencies, runs train.py, and prints metrics to stdout. The agent checks results with sky logs.

Results: ~910 Experiments, ~8 Hours, 16 GPUs

Claude Code used the SkyPilot skill to launch and manage GPU experiments across 16 GPUs. Over 8 hours it submitted ~910 experiments and drove val_bpb from 1.003 down to 0.974 - a 2.87% improvement over baseline.

How Parallelism Changed the Agent's Research Strategy

With one GPU, the agent does greedy hill-climbing - try one thing, check, repeat. With 16 GPUs, it ran factorial grids of 10-13 experiments per wave, catching interaction effects between parameters that sequential search would miss.

For example, the agent tested six model widths in a single wave, saw the trend immediately, and zeroed in on the best one - one round instead of six.

The agent also discovered it had access to multiple GPU types (H100s and H200s) and developed a strategy to exploit the performance difference across heterogeneous hardware: screen ideas on cheaper H100s, promote winners to H200 for validation.

Performance Comparison

With 16 GPUs, the parallel agent reached the same best validation loss 9x faster than the simulated sequential baseline (~8 hours vs ~72 hours).

Experiment Phases

  • Phase 1: Hyperparameter sweeps (~first 200 experiments)
  • Phase 2: Architecture discovery (~experiments 200-420)
  • Phase 3: Fine-tuning the wider model (~experiments 420-560)
  • Phase 4: Optimizer tuning (~experiments 560-700)
  • Phase 5: Diminishing returns (~experiments 700-910)

The agent found that scaling model width mattered more than any single hyperparameter.

📖 Read the full source: HN AI Agents

Ad

👀 See Also