Benchmarks Show Distilled Models Match Frontier LLMs on Structured Tasks at 10x Lower Cost

Benchmark Results: Distilled vs. Frontier Models
Researchers conducted a comprehensive comparison of small distilled models against frontier LLMs across 9 datasets covering classification, function calling, QA, and open-book QA tasks. All distilled models are from the Qwen3 family (0.6B to 8B), trained with as few as 50 examples using open-weight teacher models without frontier API outputs for training.
Key Performance Findings
- Distilled models match or beat the best mid-tier frontier model (<$1/MTok input) on 6/9 tasks, effectively tie on a 7th
- Text2SQL: Qwen3-4B distilled hits 98.0% vs Claude Haiku 98.7%, GPT-5 nano 96.0% at $3/M requests vs $378 and $24 respectively
- Smart Home (function calling): Qwen3-0.6B scores 98.7% vs Gemini Flash's 92.0%
- HotpotQA: Distilled models score 92.0% vs Haiku's 98.0% - open-ended reasoning with world knowledge remains frontier territory
- Classification tasks (Banking77, E-commerce, TREC): Distilled models are within 0-1.5 percentage points of the best frontier option
Inference Performance
Models were served via vLLM on a single H100 with the following Text2SQL 4B model performance:
- 222 RPS sustained
- p50: 390ms, p95: 640ms, p99: 870ms
- 7.6 GiB VRAM (BF16, no quantization)
- FP8 gave +15% throughput, -44% memory, no accuracy loss in brief experiments
Methodology
- Same test sets, same prompts, same eval criteria across all models
- Frontier models run 3x per dataset (mean ± std reported), distilled at temp=0
- Eval: exact-match for classification, tool_call_equivalence (JSON comparison with default param normalization) for function calling, Claude Sonnet 4.6 as LLM-as-a-judge for generation
- Cost: frontier = measured API token usage × published pricing (Feb 2026). Distilled = H100 at $2.40/hr ÷ measured sustained RPS
Practical Recommendations
- Distill: structured tasks, well-defined schemas, high volume, data sovereignty requirements
- Frontier API: broad world knowledge, freeform generation, low volume
- Best setup: route between both
Available Resources
All code, models, data, and eval scripts are open source at https://github.com/distil-labs/inference-efficiency-benchmarks/
Full blog post with charts and per-dataset breakdowns: https://www.distillabs.ai/blog/the-10x-inference-tax-you-dont-have-to-pay
📖 Read the full source: r/LocalLLaMA
👀 See Also

Google: 75% of New Code Is AI-Generated, Code Migration 6x Faster with Agents
Google reports 75% of new code is AI-generated, up from 25% in 2024. A complex code migration completed 6x faster using Gemini agents. Engineers in some orgs have AI usage goals tied to performance reviews.

OpenAI Training Costs Projected to Exceed Anthropic's by 4-5 Times Annually
According to confidential financials reported by the Wall Street Journal, OpenAI expects to spend 4-5 times more on training than Anthropic each year for the next five years. The expense scale is described as mind-boggling.

Claude.ai Experiencing Elevated Errors and Login Issues
Claude.ai is reporting elevated errors affecting the platform, including login issues specifically for Claude Code. The incident was officially posted on March 11, 2026 at 17:19:35 UTC.

Research shows AI users often accept LLM answers without verification
University of Pennsylvania research found AI users engage in 'cognitive surrender,' accepting LLM answers with minimal scrutiny. In experiments, users accepted correct AI answers 93% of the time and incorrect answers 80% of the time, even when AI was wrong half the time.