MTPLX: 2.24x Faster Tokens on Apple Silicon Using Native MTP Heads

✍️ OpenClawRadar📅 Published: May 5, 2026🔗 Source
MTPLX: 2.24x Faster Tokens on Apple Silicon Using Native MTP Heads
Ad

MTPLX is an inference engine for Apple Silicon that exploits a model's built-in Multi-Token Prediction (MTP) heads as speculative drafters. The key result: Qwen 3.6 27B 4-bit MLX goes from 28 tok/s to 63 tok/s (2.24× faster) on a MacBook Pro M5 Max at temperature 0.6, top_p 0.95, top_k 20 — the exact settings Qwen recommends for coding.

How It Works

Unlike DFlash or DDTree (which require an external drafter model and are greedy-only), MTPLX uses the model's own MTP heads. Each MTP head drafts sequentially, producing per-token probability distributions. This enables exact rejection sampling with temperature and residual correction. No external drafter means no extra memory usage.

For Qwen 3.6 27B (which ships MTP heads up to depth 5), the optimal depth was found to be D3 after sweeping D2–D5. Deeper depths (D4/D5) had good early acceptance but deeper positions cost more verify time than tokens saved.

Status vs. DFlash / DDTree

DFlash MLX achieves higher raw speed but is restricted to greedy (temperature 0) sampling only, severely limiting real-world use. DDTree inherits the same limitations. Both require an external drafter. MTPLX works with any model that retains its MTP heads and supports full temperature-sampled inference.

Ad

Installation & Usage

MTPLX ships as a full CLI with the following commands:

  • mtplx start wizard — guided setup
  • Model download and inspection with four-tier MTP compatibility detection
  • Configurable depth 2–7+
  • OpenAI/Anthropic compatible API server, browser chat UI, terminal chat
  • Benchmarking suite, health diagnostics, crash-safe fan control with idle-aware auto-restore
  • A 562-test suite included

The engine is built on a patched MLX fork with custom Metal kernels, compiled verify graphs, innovation-tape GDN rollback, and a draft-only requantised LM head.

Who It's For

Developers running local LLMs on Apple Silicon who need high-throughput, temperature-sampled inference for coding or creative writing without sacrificing output quality.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also