Fine-tuned Qwen3-0.6B model outperforms 120B teacher on structured function calling

What this is
Distil Labs released a complete pipeline that fine-tunes a small 0.6B parameter Qwen3 model to outperform a 120B parameter teacher model on structured function calling tasks. The pipeline extracts production traces, generates synthetic training data, and trains a specialist model that's 200x smaller than the teacher.
Performance results
- Teacher (GPT-OSS-120B): 50.0% tool call equivalence
- Base Qwen3-0.6B (no fine-tuning): 10.3% tool call equivalence
- Fine-tuned Qwen3-0.6B: 79.5% tool call equivalence
The task is IoT smart home function calling: routing natural language commands like "turn on the kitchen lights" or "make me a coffee at 7am" to the correct function with the right parameters. Scoring is based on exact structured match, not fuzzy scoring.
Why the small model wins
The 120B teacher is a general-purpose model that has never seen these specific function schemas or user phrasing patterns. It often produces verbose or slightly off-format responses. The 0.6B student is a specialist trained exclusively on this task, so it nails the exact output format consistently.
Pipeline architecture
The three-stage pipeline:
- Data extraction: dlt extracts production traces from databases, APIs, cloud storage, or log aggregators and writes them to Hugging Face as clean Parquet datasets
- Automatic curation: An LLM judge scores and filters traces to select high-quality seed examples (no manual annotation required)
- Synthetic data generation and training: Distil Labs uses the traces as domain context, generates ~10,000 synthetic training examples with a large teacher, validates and filters them, then fine-tunes the student model
The key insight: instead of training on raw traces directly, they're used as context so the synthetic data generator produces examples matching real vocabulary, function schemas, and phrasing patterns from actual users.
Dataset and practical details
- Used Amazon MASSIVE dataset (16k+ utterances, 60 intents) as stand-in for production traffic
- Filtered to IoT scenario with 9 smart home functions
- ~75 labeled seed examples were enough (automatic curation, zero manual annotation)
- Training completed in under 12 hours
- Model inference: under 50ms locally vs. 400-700ms for cloud API calls
- Model available in safetensors and GGUF formats on Hugging Face
Production considerations
The model scores 79.5% exact match, meaning roughly 1 in 5 queries may need a fallback. For production use, you'd want a confidence threshold routing low-confidence predictions to a larger model.
📖 Read the full source: r/LocalLLaMA
👀 See Also

OpenClaw-Superpowers: A Native Port of Jesse Vincent's Superpowers Framework Without Claude Code Dependency
A Reddit user ported obra/superpowers to OpenClaw with dedicated agents (coding orchestrator, implementer, reviewer) and native commands like sessions_spawn and update_plan, removing Claude Code dependency.

AgentMail Founder Details Agent-Native Onboarding After OpenClaw Exposed CAPTCHA Block
AgentMail, an email API for AI agents, rebuilt its onboarding flow after its own OpenClaw agent failed at a Cloudflare CAPTCHA. The new system offers a single REST endpoint for programmatic account creation while keeping humans in the loop for verification.

memv MCP Server: Persistent Structured Memory for AI Agents
memv, an open-source Python memory layer for agents, now ships with an MCP server. It provides five tools for persistent, structured memory with per-user isolation and LLM-optional extraction.

Relay: A Tool for Handing Off Claude Code Sessions to Other AI Agents
Relay is a Rust binary that extracts Claude Code's session context—including conversation history, tool calls, errors, and git state—and transfers it to other AI agents like Codex or Gemini when rate limits are hit. It supports 8 agents and can be installed via GitHub or npm.