MTPLX: 2.24x Faster Tokens on Apple Silicon Using Native MTP Heads

MTPLX is an inference engine for Apple Silicon that exploits a model's built-in Multi-Token Prediction (MTP) heads as speculative drafters. The key result: Qwen 3.6 27B 4-bit MLX goes from 28 tok/s to 63 tok/s (2.24× faster) on a MacBook Pro M5 Max at temperature 0.6, top_p 0.95, top_k 20 — the exact settings Qwen recommends for coding.
How It Works
Unlike DFlash or DDTree (which require an external drafter model and are greedy-only), MTPLX uses the model's own MTP heads. Each MTP head drafts sequentially, producing per-token probability distributions. This enables exact rejection sampling with temperature and residual correction. No external drafter means no extra memory usage.
For Qwen 3.6 27B (which ships MTP heads up to depth 5), the optimal depth was found to be D3 after sweeping D2–D5. Deeper depths (D4/D5) had good early acceptance but deeper positions cost more verify time than tokens saved.
Status vs. DFlash / DDTree
DFlash MLX achieves higher raw speed but is restricted to greedy (temperature 0) sampling only, severely limiting real-world use. DDTree inherits the same limitations. Both require an external drafter. MTPLX works with any model that retains its MTP heads and supports full temperature-sampled inference.
Installation & Usage
MTPLX ships as a full CLI with the following commands:
mtplx start wizard— guided setup- Model download and inspection with four-tier MTP compatibility detection
- Configurable depth 2–7+
- OpenAI/Anthropic compatible API server, browser chat UI, terminal chat
- Benchmarking suite, health diagnostics, crash-safe fan control with idle-aware auto-restore
- A 562-test suite included
The engine is built on a patched MLX fork with custom Metal kernels, compiled verify graphs, innovation-tape GDN rollback, and a draft-only requantised LM head.
Who It's For
Developers running local LLMs on Apple Silicon who need high-throughput, temperature-sampled inference for coding or creative writing without sacrificing output quality.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Open-source pipeline turns Claude Code workflow into reusable skills
A developer who used Claude Code daily for 9 months has open-sourced a pipeline that structures feature development with checkpoints like functional documentation, technical documentation, complexity estimation, and security checks. The pipeline includes /new-feature and /bug-fix entry points that guide implementation.

git-prism v0.9.0: Give AI Coding Agents Structured Diffs via MCP
git-prism is an MCP server that replaces raw git diff text with structured JSON for AI coding agents. v0.9.0 intercepts git calls at the PATH layer, catching subprocess and gh commands.

Claude Code v2.1.90 adds mouse support with CLAUDE_CODE_NO_FLICKER flag
Anthropic released Claude Code v2.1.90 with a new feature that enables mouse support in the chat interface. Users can activate it by setting the CLAUDE_CODE_NO_FLICKER=1 environment variable before running claude.

AGENTS.md Schema for LLM-Compiled Knowledge Bases with Learning Layer
AGENTS.md v1.0 provides a schema standard for Claude to build and maintain personal research wikis from raw sources, including a spaced repetition learning layer with automatic flashcard generation and knowledge gap tracking.