Qwen 3.6 27B hits 2.5x speed with MTP speculative decoding on llama.cpp

✍️ OpenClawRadar📅 Published: May 6, 2026🔗 Source
Qwen 3.6 27B hits 2.5x speed with MTP speculative decoding on llama.cpp
Ad

A Reddit user has compiled llama.cpp with a pending PR (#22673) that enables Multi-Token Prediction (MTP) for Qwen 3.6 27B. MTP uses the model's built-in tensor layers for speculative decoding, claiming a 2.5x speedup — from ~11 tok/s to 28 tok/s on a Mac M2 Max 96GB.

Key Details

  • Model: Qwen 3.6 27B (Qwen2.5-3.0 architecture variant)
  • Hardware tested: Mac M2 Max 96GB
  • Results: 28 tok/s with MTP (vs ~11 tok/s without)
  • Context support: Up to 262K tokens with turbo4 KV cache on 48GB Mac
  • Quantizations: Pre-converted GGUF quants uploaded by the user at froggeric/Qwen3.6-27B-MTP-GGUF

Compilation Instructions

git clone --depth 1 https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
git fetch origin pull/22673/head:mtp-pr && git checkout mtp-pr
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --target llama-cli llama-server
Ad

Server Command

llama-server -m Qwen3.6-27B-Q5_K_M-mtp.gguf \
  --mmproj mmproj-Qwen3.6-27B-f16.gguf \
  --spec-type mtp --spec-draft-n-max 5 \
  --cache-type-k turbo4 --cache-type-v turbo4 \
  -c 262144 --temp 0.7 --top-k 20 -ngl 99 --port 8081

Three optimizations combined:

  • --spec-type mtp --spec-draft-n-max 5: enables MTP speculative decoding (2.5x faster)
  • --cache-type-k turbo4 --cache-type-v turbo4: 4.25-bit KV cache (quarter memory vs 16-bit)
  • -c 262144: 262K context window (fits 48GB with turbo4)

Hardware Recommendations

Apple Silicon and NVIDIA GPU quantization/KV cache tables are provided in the source for RAM-constrained setups (e.g., IQ2_M on 16GB Apple Silicon with 48K context). Vision (mmproj) support is available on 32GB+ configurations.

Additional Fixes

The user also published 7 fixes to the Qwen jinja chat template that were broken due to vLLM-specific formatting. These are now compatible with llama.cpp and other tools.

Note: Existing GGUF files on Hugging Face do not include MTP support — they require re-conversion with the PR applied. The user warns that initial uploads are incomplete; check the Hugging Face repo status.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also

Throttle Meter: Open-Source Claude Code Usage Meter for macOS
Tools

Throttle Meter: Open-Source Claude Code Usage Meter for macOS

Open-source macOS menu bar app that reads local Claude Code logs to show real-time 5-hour and weekly usage, with threshold notifications and token-saving hooks. Also has a €19 commercial sibling with Exact mode (reads claude.ai's internal API via Safari).

OpenClawRadar
Jean-Claude: A Satirical LLM Frontend Mocking EU AI Regulation, with 412 Cookie Partners and VAT Invoices Every 5 Messages
Tools

Jean-Claude: A Satirical LLM Frontend Mocking EU AI Regulation, with 412 Cookie Partners and VAT Invoices Every 5 Messages

Jean-Claude is a satirical LLM frontend that applies extreme EU-style bureaucracy to AI usage: 412 cookie partners, four-eyes principle requiring co-signature, per-token CO₂ tracking with mandatory €offset, VAT invoices every 5 messages, and a compliance center with fake GDPR/AI Act metrics.

OpenClawRadar
claude-sessions: Terminal UI for Browsing Claude Code Transcripts
Tools

claude-sessions: Terminal UI for Browsing Claude Code Transcripts

claude-sessions is an open-source terminal UI tool that scans local Claude Code transcript files, allowing developers to browse, search, and resume past sessions. Built with Claude Code itself, it features WASD navigation, keyword search, and one-click session resumption.

OpenClawRadar
First-Tree: Open-Source Daemon That Uses Claude Code to Triage GitHub Notifications While You Sleep
Tools

First-Tree: Open-Source Daemon That Uses Claude Code to Triage GitHub Notifications While You Sleep

An open-source menu bar daemon that uses Claude Code to autonomously triage GitHub notifications – it handled 98 out of 100 notifications in a recent scan, leaving only 2 for human review.

OpenClawRadar