Distilled Qwen3 Models Beat Frontier LLMs at 10x Lower Cost

Benchmark Results: Distilled vs. Frontier Models

Researchers conducted a comprehensive comparison of small distilled models against frontier LLMs across 9 datasets covering classification, function calling, QA, and open-book QA tasks. All distilled models are from the Qwen3 family (0.6B to 8B), trained with as few as 50 examples using open-weight teacher models without frontier API outputs for training.

Key Performance Findings

Distilled models match or beat the best mid-tier frontier model (<$1/MTok input) on 6/9 tasks, effectively tie on a 7th
Text2SQL: Qwen3-4B distilled hits 98.0% vs Claude Haiku 98.7%, GPT-5 nano 96.0% at $3/M requests vs $378 and $24 respectively
Smart Home (function calling): Qwen3-0.6B scores 98.7% vs Gemini Flash's 92.0%
HotpotQA: Distilled models score 92.0% vs Haiku's 98.0% - open-ended reasoning with world knowledge remains frontier territory
Classification tasks (Banking77, E-commerce, TREC): Distilled models are within 0-1.5 percentage points of the best frontier option

Inference Performance

Models were served via vLLM on a single H100 with the following Text2SQL 4B model performance:

222 RPS sustained
p50: 390ms, p95: 640ms, p99: 870ms
7.6 GiB VRAM (BF16, no quantization)
FP8 gave +15% throughput, -44% memory, no accuracy loss in brief experiments

Methodology

Same test sets, same prompts, same eval criteria across all models
Frontier models run 3x per dataset (mean ± std reported), distilled at temp=0
Eval: exact-match for classification, tool_call_equivalence (JSON comparison with default param normalization) for function calling, Claude Sonnet 4.6 as LLM-as-a-judge for generation
Cost: frontier = measured API token usage × published pricing (Feb 2026). Distilled = H100 at $2.40/hr ÷ measured sustained RPS

Practical Recommendations

Distill: structured tasks, well-defined schemas, high volume, data sovereignty requirements
Frontier API: broad world knowledge, freeform generation, low volume
Best setup: route between both