Self-Supervised Fine-Tuning on Own Mistakes Boosts Small Models to 80% on HumanEval

A developer on r/LocalLLaMA implemented a self-supervised training loop where a small language model generates its own coding problems, attempts solutions, and fine-tunes on the pairs where the interpreter confirms correctness. The key insight from the DeepSeek-R1 paper — that models can improve through verifiable rewards — was applied without human-labeled data.
Method
The base model (starting with Qwen 2.5 7B) was prompted to invent a coding problem and a few small tests. It then solved the same problem multiple times. The Python interpreter acted as the sole judge: pairs of (broken attempt, working attempt) were saved. Fine-tuning was performed on these self-mined corrections. No human-written code was used in training.
Results
- Qwen 2.5 7B base: 25 → 112 on HumanEval (+87 problems) after fixing a grader bug that truncated function outputs.
- Qwen 2.5 14B: Mined 100 pairs, trained in 95 minutes on an H100 ($3.50 in credits). Scored within 4 points of the same company's RLHF version.
- Llama 3.2 3B: 32 pairs → 39 → 43 on HumanEval. Confirms transfer across architectures.
- Qwen 2.5 Coder 7B: Already code-specialized, yet still improved: HumanEval 83 → 87, MBPP 122 → 124.
- Qwen 3 4B: HumanEval 79 → 106 (+27), MBPP 135 → 148.
Control Experiment
To verify the signal wasn't from generic training, the author built fake pairs with random garbage code that didn't pass any tests. Training on those produced zero lift (25/164, same as base). The improvement is specifically from learning on self-generated mistakes and corrections.
Practical Details
The initial attempt failed because the grader stopped early, cutting model outputs in half. Fixing the grader was critical. The entire setup ran on a 24GB MacBook and a RunPod account. The code and training scripts are presumably shared in the Reddit post.
Who It's For
Developers and researchers working with small language models who want to bootstrap code reasoning without human annotations.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Anthropic Refuses Pentagon Safety Removal Demands, Loses Federal Contracts
Anthropic refused Pentagon demands to remove safety guardrails from Claude for military applications, leading to a $200M contract cancellation and a presidential order banning federal agency use of their technology.

Claude Code 2.1.80 adds rate limit visibility, MCP push messaging, and memory improvements
Claude Code version 2.1.80 introduces rate limit visibility in the statusline, MCP push messaging via the --channels flag, inline plugin configuration, and reduces memory usage by 80MB on startup.

AI Carb Counting Fails Reproducibility: 27K Queries Show 429g Spread on One Photo
A study of 26,904 AI queries across 4 models found that Gemini 2.5 Pro varied its carb estimates for a single paella photo from 55g to 484g — a potential 42.9U insulin swing. Claude showed only 2.4% median variation.

Two Research Projects Challenge Imitation Learning for Web Agents
Two research projects demonstrate limitations of imitation-only training for web agents: 'Browser in the Loop' uses RL with an 8B-parameter model to improve form submission success, while 'Concentrate or Collapse' shows standard RL fails with diffusion language models, requiring sequence-level optimization.