Fine-tuned Qwen3 Small Models Outperform Frontier LLMs on Specific Tasks at Lower Cost

A systematic comparison of small distilled Qwen3 models against frontier API models shows that fine-tuned small language models can outperform larger, more expensive models on specific structured tasks.
Benchmark Results
The study compared Qwen3 models (0.6B to 8B parameters) against frontier APIs including GPT-5 nano/mini/5.2, Gemini 2.5 Flash Lite/Flash, Claude Haiku 4.5/Sonnet 4.6/Opus 4.6, and Grok 4.1 Fast/Grok 4 across 9 datasets. All distilled models were trained using open-weight teachers only, with as few as 50 examples. Inference was run on vLLM on a single H100.
Key Performance Findings
- Smart Home function calling: Qwen3-0.6B achieved 98.7% accuracy vs. Gemini Flash at 92.0%
- Text2SQL: Qwen3-4B distilled got 98.0% vs. Claude Haiku at 98.7% and GPT-5 nano at 96.0%
- Cost comparison: Text2SQL cost per million requests: Qwen3-4B ~$3 vs. Claude Haiku $378 and GPT-5 nano $24
- Classification tasks: Distilled models performed within 0–1.5 percentage points of the best frontier option on Banking77, E-commerce, and TREC datasets
- Frontier advantage: HotpotQA (open-ended reasoning + world knowledge) — 92.0% vs. Haiku's 98.0%
Performance Metrics
For Text2SQL with Qwen3-4B on H100:
- 222 RPS sustained
- p50: 390ms | p95: 640ms | p99: 870ms
- 7.6 GiB VRAM (BF16, no quantization)
- FP8 gave +15% throughput, −44% VRAM, no measurable accuracy loss in brief experiments
Methodology
- Same test sets, prompts, and evaluation criteria for all models
- Frontier models run 3× per dataset (reporting mean ± std), distilled at temperature=0
- Evaluation: exact-match for classification, tool_call_equivalence (JSON comparison with default parameter normalization) for function calling, Claude Sonnet 4.6 as LLM-judge for generation tasks
- Cost calculation: frontier = measured token usage × published pricing (Feb 2026); distilled = H100 at $2.40/hr ÷ sustained RPS
Practical Recommendations
- Use distilled models when: You have structured tasks, well-defined schemas, high volume, or data sovereignty needs
- Use frontier APIs when: You need broad world knowledge, freeform generation, or volume is low enough that cost doesn't matter
- Hybrid approach: Route between the two based on task requirements
Availability
All code, models, data, and evaluation scripts are open source on GitHub: https://github.com/distil-labs/inference-efficiency-benchmarks/
Full analysis with charts available on the blog: https://www.distillabs.ai/blog/the-10x-inference-tax-you-dont-have-to-pay
📖 Read the full source: r/LocalLLaMA
👀 See Also

Pentagon Pledges No More Single AI Provider After Anthropic Fallout, Signs Deals with AWS, Google, Microsoft, NVIDIA, OpenAI, Oracle, SpaceX
Defense Under Secretary Emil Michael says the Pentagon will 'never again' rely on a single AI model provider, citing integration complexity and the recent dispute with Anthropic. New agreements with eight AI companies aim to diversify the tech stack.

OpenClaw loses cost-effective access to GPT and Claude models
OpenClaw users can no longer use Anthropic models without paying high API fees, and OpenAI has severely reduced Business and Teams account quotas to near free-tier levels, forcing users toward Chinese or local model alternatives.

Atlassian lays off 10% of workforce to fund AI investments
Atlassian is cutting 1,600 jobs (10% of workforce) to self-fund AI investments and strengthen its financial profile, with 900 positions in software development affected. CEO Mike Cannon-Brookes says AI doesn't replace people but changes skill requirements.

OpenAI and PNNL Introduce DraftNEPABench for AI Coding Agents in Federal Permitting
OpenAI and Pacific Northwest National Laboratory have released DraftNEPABench, a benchmark evaluating how AI coding agents can accelerate federal permitting. Initial results show potential to reduce NEPA drafting time by up to 15%.