Benchmarks Show Distilled Models Match Frontier LLMs on Structured Tasks at 10x Lower Cost

✍️ OpenClawRadar📅 Published: March 7, 2026🔗 Source
Benchmarks Show Distilled Models Match Frontier LLMs on Structured Tasks at 10x Lower Cost
Ad

Benchmark Results: Distilled vs. Frontier Models

Researchers conducted a comprehensive comparison of small distilled models against frontier LLMs across 9 datasets covering classification, function calling, QA, and open-book QA tasks. All distilled models are from the Qwen3 family (0.6B to 8B), trained with as few as 50 examples using open-weight teacher models without frontier API outputs for training.

Key Performance Findings

  • Distilled models match or beat the best mid-tier frontier model (<$1/MTok input) on 6/9 tasks, effectively tie on a 7th
  • Text2SQL: Qwen3-4B distilled hits 98.0% vs Claude Haiku 98.7%, GPT-5 nano 96.0% at $3/M requests vs $378 and $24 respectively
  • Smart Home (function calling): Qwen3-0.6B scores 98.7% vs Gemini Flash's 92.0%
  • HotpotQA: Distilled models score 92.0% vs Haiku's 98.0% - open-ended reasoning with world knowledge remains frontier territory
  • Classification tasks (Banking77, E-commerce, TREC): Distilled models are within 0-1.5 percentage points of the best frontier option

Inference Performance

Models were served via vLLM on a single H100 with the following Text2SQL 4B model performance:

  • 222 RPS sustained
  • p50: 390ms, p95: 640ms, p99: 870ms
  • 7.6 GiB VRAM (BF16, no quantization)
  • FP8 gave +15% throughput, -44% memory, no accuracy loss in brief experiments
Ad

Methodology

  • Same test sets, same prompts, same eval criteria across all models
  • Frontier models run 3x per dataset (mean ± std reported), distilled at temp=0
  • Eval: exact-match for classification, tool_call_equivalence (JSON comparison with default param normalization) for function calling, Claude Sonnet 4.6 as LLM-as-a-judge for generation
  • Cost: frontier = measured API token usage × published pricing (Feb 2026). Distilled = H100 at $2.40/hr ÷ measured sustained RPS

Practical Recommendations

  • Distill: structured tasks, well-defined schemas, high volume, data sovereignty requirements
  • Frontier API: broad world knowledge, freeform generation, low volume
  • Best setup: route between both

Available Resources

All code, models, data, and eval scripts are open source at https://github.com/distil-labs/inference-efficiency-benchmarks/

Full blog post with charts and per-dataset breakdowns: https://www.distillabs.ai/blog/the-10x-inference-tax-you-dont-have-to-pay

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also