Qwen 3.6 27B hits 2.5x speed with MTP speculative decoding on llama.cpp

A Reddit user has compiled llama.cpp with a pending PR (#22673) that enables Multi-Token Prediction (MTP) for Qwen 3.6 27B. MTP uses the model's built-in tensor layers for speculative decoding, claiming a 2.5x speedup — from ~11 tok/s to 28 tok/s on a Mac M2 Max 96GB.
Key Details
- Model: Qwen 3.6 27B (Qwen2.5-3.0 architecture variant)
- Hardware tested: Mac M2 Max 96GB
- Results: 28 tok/s with MTP (vs ~11 tok/s without)
- Context support: Up to 262K tokens with turbo4 KV cache on 48GB Mac
- Quantizations: Pre-converted GGUF quants uploaded by the user at
froggeric/Qwen3.6-27B-MTP-GGUF
Compilation Instructions
git clone --depth 1 https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
git fetch origin pull/22673/head:mtp-pr && git checkout mtp-pr
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --target llama-cli llama-serverServer Command
llama-server -m Qwen3.6-27B-Q5_K_M-mtp.gguf \
--mmproj mmproj-Qwen3.6-27B-f16.gguf \
--spec-type mtp --spec-draft-n-max 5 \
--cache-type-k turbo4 --cache-type-v turbo4 \
-c 262144 --temp 0.7 --top-k 20 -ngl 99 --port 8081Three optimizations combined:
--spec-type mtp --spec-draft-n-max 5: enables MTP speculative decoding (2.5x faster)--cache-type-k turbo4 --cache-type-v turbo4: 4.25-bit KV cache (quarter memory vs 16-bit)-c 262144: 262K context window (fits 48GB with turbo4)
Hardware Recommendations
Apple Silicon and NVIDIA GPU quantization/KV cache tables are provided in the source for RAM-constrained setups (e.g., IQ2_M on 16GB Apple Silicon with 48K context). Vision (mmproj) support is available on 32GB+ configurations.
Additional Fixes
The user also published 7 fixes to the Qwen jinja chat template that were broken due to vLLM-specific formatting. These are now compatible with llama.cpp and other tools.
Note: Existing GGUF files on Hugging Face do not include MTP support — they require re-conversion with the PR applied. The user warns that initial uploads are incomplete; check the Hugging Face repo status.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Throttle Meter: Open-Source Claude Code Usage Meter for macOS
Open-source macOS menu bar app that reads local Claude Code logs to show real-time 5-hour and weekly usage, with threshold notifications and token-saving hooks. Also has a €19 commercial sibling with Exact mode (reads claude.ai's internal API via Safari).

Jean-Claude: A Satirical LLM Frontend Mocking EU AI Regulation, with 412 Cookie Partners and VAT Invoices Every 5 Messages
Jean-Claude is a satirical LLM frontend that applies extreme EU-style bureaucracy to AI usage: 412 cookie partners, four-eyes principle requiring co-signature, per-token CO₂ tracking with mandatory €offset, VAT invoices every 5 messages, and a compliance center with fake GDPR/AI Act metrics.

claude-sessions: Terminal UI for Browsing Claude Code Transcripts
claude-sessions is an open-source terminal UI tool that scans local Claude Code transcript files, allowing developers to browse, search, and resume past sessions. Built with Claude Code itself, it features WASD navigation, keyword search, and one-click session resumption.

First-Tree: Open-Source Daemon That Uses Claude Code to Triage GitHub Notifications While You Sleep
An open-source menu bar daemon that uses Claude Code to autonomously triage GitHub notifications – it handled 98 out of 100 notifications in a recent scan, leaving only 2 for human review.