Self-Supervised Fine-Tuning Boosts 7B Models to 80% HumanEval

A developer on r/LocalLLaMA implemented a self-supervised training loop where a small language model generates its own coding problems, attempts solutions, and fine-tunes on the pairs where the interpreter confirms correctness. The key insight from the DeepSeek-R1 paper — that models can improve through verifiable rewards — was applied without human-labeled data.

Method

The base model (starting with Qwen 2.5 7B) was prompted to invent a coding problem and a few small tests. It then solved the same problem multiple times. The Python interpreter acted as the sole judge: pairs of (broken attempt, working attempt) were saved. Fine-tuning was performed on these self-mined corrections. No human-written code was used in training.

Results

Qwen 2.5 7B base: 25 → 112 on HumanEval (+87 problems) after fixing a grader bug that truncated function outputs.
Qwen 2.5 14B: Mined 100 pairs, trained in 95 minutes on an H100 ($3.50 in credits). Scored within 4 points of the same company's RLHF version.
Llama 3.2 3B: 32 pairs → 39 → 43 on HumanEval. Confirms transfer across architectures.
Qwen 2.5 Coder 7B: Already code-specialized, yet still improved: HumanEval 83 → 87, MBPP 122 → 124.
Qwen 3 4B: HumanEval 79 → 106 (+27), MBPP 135 → 148.

Control Experiment

To verify the signal wasn't from generic training, the author built fake pairs with random garbage code that didn't pass any tests. Training on those produced zero lift (25/164, same as base). The improvement is specifically from learning on self-generated mistakes and corrections.

Practical Details

The initial attempt failed because the grader stopped early, cutting model outputs in half. Fixing the grader was critical. The entire setup ran on a 24GB MacBook and a RunPod account. The code and training scripts are presumably shared in the Reddit post.

Who It's For

Developers and researchers working with small language models who want to bootstrap code reasoning without human annotations.

📖 Read the full source: r/LocalLLaMA

Self-Supervised Fine-Tuning on Own Mistakes Boosts Small Models to 80% on HumanEval

Method

Results

Control Experiment

Practical Details

Who It's For

👀 See Also

Anthropic Refuses Pentagon Safety Removal Demands, Loses Federal Contracts

Claude Code 2.1.80 adds rate limit visibility, MCP push messaging, and memory improvements

AI Carb Counting Fails Reproducibility: 27K Queries Show 429g Spread on One Photo

Two Research Projects Challenge Imitation Learning for Web Agents