Qwen 3 8B Beats 4x Larger Models in 6 of 13 Hard Tasks

Evaluation Results

A blind peer evaluation system called The Multivac tested 10 small language models on 13 hard frontier-level questions. The same difficulty level was used for GPT-5.4 and Claude Opus 4.6. Models didn't know which response came from which model, and rankings were computed from peer consensus.

Key Findings

Qwen 3 8B (8B parameters) achieved:

6 first-place wins out of 13 evaluations
Top-3 finishes in 12 of 13 tasks
Average score of 9.40
Worst finish: 5th place

This performance exceeded models with significantly larger parameter counts, including:

Gemma 3 27B (27B parameters): 3 wins, 11 top-3 finishes, average 9.33
Kimi K2.5 (32B/1T MoE): 3 wins, 5 top-3 finishes, average 8.78
Qwen 3 32B (32B parameters): 2 wins, 5 top-3 finishes, average 8.40

Task-Specific Performance

On code tasks, Qwen 3 8B placed:

1st on Go concurrency debugging (9.65)
1st on distributed lock analysis (9.33)
Tied 1st on SQL optimization (9.66)

On reasoning tasks, it placed:

1st on Simpson's Paradox (9.51)
1st on investment decision theory (9.63)
2nd on Bayesian diagnosis (9.53)

Notable Observations

Qwen 3 32B showed a significant performance drop on the distributed lock debugging task (EVAL-20260315-043330), scoring only 1.00 out of 10 while every other model scored above 5.5. The 8B model scored 9.33 on the identical task. The cause is unclear but could be related to OpenRouter routing, quantization artifacts, or a genuine failure mode.

Kimi K2.5, technically a 32B active/1T MoE model, won 3 evaluations including the 502 debugging task (9.57), Arrow's voting theorem (9.18), and survivorship bias (9.63).

Llama 3.1 8B finished last or second-to-last in 10 of 13 evaluations with an average score of 7.51, showing a massive gap compared to Qwen 3 8B (9.40) despite having the same parameter count.

Methodology Notes

The evaluation used a blind peer system where 10 models respond to the same question, then each model judges all 10 responses (100 total judgments per evaluation, minus self-judgments). The author notes genuine limitations: AI judging AI has a circularity problem, and scores measure peer consensus rather than ground truth. A human baseline study is being developed to measure correlation.

📖 Read the full source: r/LocalLLaMA