Fine-Tuned Qwen3-0.6B Beats 120B Teacher in Function Calling

What this is

Distil Labs released a complete pipeline that fine-tunes a small 0.6B parameter Qwen3 model to outperform a 120B parameter teacher model on structured function calling tasks. The pipeline extracts production traces, generates synthetic training data, and trains a specialist model that's 200x smaller than the teacher.

Performance results

Teacher (GPT-OSS-120B): 50.0% tool call equivalence
Base Qwen3-0.6B (no fine-tuning): 10.3% tool call equivalence
Fine-tuned Qwen3-0.6B: 79.5% tool call equivalence

The task is IoT smart home function calling: routing natural language commands like "turn on the kitchen lights" or "make me a coffee at 7am" to the correct function with the right parameters. Scoring is based on exact structured match, not fuzzy scoring.

Why the small model wins

The 120B teacher is a general-purpose model that has never seen these specific function schemas or user phrasing patterns. It often produces verbose or slightly off-format responses. The 0.6B student is a specialist trained exclusively on this task, so it nails the exact output format consistently.

Pipeline architecture

The three-stage pipeline:

Data extraction: dlt extracts production traces from databases, APIs, cloud storage, or log aggregators and writes them to Hugging Face as clean Parquet datasets
Automatic curation: An LLM judge scores and filters traces to select high-quality seed examples (no manual annotation required)
Synthetic data generation and training: Distil Labs uses the traces as domain context, generates ~10,000 synthetic training examples with a large teacher, validates and filters them, then fine-tunes the student model

The key insight: instead of training on raw traces directly, they're used as context so the synthetic data generator produces examples matching real vocabulary, function schemas, and phrasing patterns from actual users.

Dataset and practical details

Used Amazon MASSIVE dataset (16k+ utterances, 60 intents) as stand-in for production traffic
Filtered to IoT scenario with 9 smart home functions
~75 labeled seed examples were enough (automatic curation, zero manual annotation)
Training completed in under 12 hours
Model inference: under 50ms locally vs. 400-700ms for cloud API calls
Model available in safetensors and GGUF formats on Hugging Face

Production considerations

The model scores 79.5% exact match, meaning roughly 1 in 5 queries may need a fallback. For production use, you'd want a confidence threshold routing low-confidence predictions to a larger model.

📖 Read the full source: r/LocalLLaMA