Qwen 3 8B outperforms larger models in blind peer evaluations on hard tasks

Evaluation Results
A blind peer evaluation system called The Multivac tested 10 small language models on 13 hard frontier-level questions. The same difficulty level was used for GPT-5.4 and Claude Opus 4.6. Models didn't know which response came from which model, and rankings were computed from peer consensus.
Key Findings
Qwen 3 8B (8B parameters) achieved:
- 6 first-place wins out of 13 evaluations
- Top-3 finishes in 12 of 13 tasks
- Average score of 9.40
- Worst finish: 5th place
This performance exceeded models with significantly larger parameter counts, including:
- Gemma 3 27B (27B parameters): 3 wins, 11 top-3 finishes, average 9.33
- Kimi K2.5 (32B/1T MoE): 3 wins, 5 top-3 finishes, average 8.78
- Qwen 3 32B (32B parameters): 2 wins, 5 top-3 finishes, average 8.40
Task-Specific Performance
On code tasks, Qwen 3 8B placed:
- 1st on Go concurrency debugging (9.65)
- 1st on distributed lock analysis (9.33)
- Tied 1st on SQL optimization (9.66)
On reasoning tasks, it placed:
- 1st on Simpson's Paradox (9.51)
- 1st on investment decision theory (9.63)
- 2nd on Bayesian diagnosis (9.53)
Notable Observations
Qwen 3 32B showed a significant performance drop on the distributed lock debugging task (EVAL-20260315-043330), scoring only 1.00 out of 10 while every other model scored above 5.5. The 8B model scored 9.33 on the identical task. The cause is unclear but could be related to OpenRouter routing, quantization artifacts, or a genuine failure mode.
Kimi K2.5, technically a 32B active/1T MoE model, won 3 evaluations including the 502 debugging task (9.57), Arrow's voting theorem (9.18), and survivorship bias (9.63).
Llama 3.1 8B finished last or second-to-last in 10 of 13 evaluations with an average score of 7.51, showing a massive gap compared to Qwen 3 8B (9.40) despite having the same parameter count.
Methodology Notes
The evaluation used a blind peer system where 10 models respond to the same question, then each model judges all 10 responses (100 total judgments per evaluation, minus self-judgments). The author notes genuine limitations: AI judging AI has a circularity problem, and scores measure peer consensus rather than ground truth. A human baseline study is being developed to measure correlation.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Firefox 148 adds AI kill switch and enhanced privacy controls
Firefox 148 introduces an AI kill switch feature that lets users disable all AI functionalities, including chatbot prompts and AI-generated link summaries. The update also provides more control over remote updates and data collection.

Claude Cowork UX Problem: Persistent Input Box Creates False Continuity Expectations
A user identifies a UX problem in Claude Cowork where the persistent text input box maintains draft text across task switches but resets context and loses attachments, creating contradictory signals about continuity.

Stripe's Minions: Enhancing Developer Productivity with One-Shot End-to-End Coding Agents
Stripe Minions are one-shot, end-to-end coding agents designed to boost developer productivity by automating complex tasks within the Stripe ecosystem.

Uber's AI Development Faces Budget Constraints Despite $3.4B Investment
Uber's AI initiatives are encountering budget limitations according to their CTO, despite the company having allocated $3.4 billion toward these efforts. The article discusses challenges in scaling AI development within financial constraints.