MLX Inference Performance Update: April 2026 Benchmarks and Features

Performance Benchmarks on M2 Ultra
The source benchmarks MLX inference on a Mac Studio M2 Ultra with 128GB unified memory, running large models locally for coding agent workloads. Generation speed was measured across four models with decode throughput in tokens/second at various KV cache depths (256 output tokens per run).
Model Performance Data
- Qwen3.5-27B (dense, 8-bit): 20.2 tok/s at 4K, 16.4 tok/s at 64K, 13.1 tok/s at 128K
- Qwen3.5-35B-A3B (MoE, 8-bit): 71.8 tok/s at 4K, 53.5 tok/s at 64K, 41.9 tok/s at 128K
- Nemotron Super 120B (5-bit): 36.4 tok/s at 4K, 31.2 tok/s at 64K, 28.4 tok/s at 128K
- Qwen3.5-122B-A10B (MoE, 5-bit): 40.6 tok/s at 4K, 29.4 tok/s at 64K, 23.1 tok/s at 128K
The 35B MoE achieves high throughput because only 3B of its 35B parameters are active per token. Nemotron Super 120B shows minimal degradation with context (14% drop from 4K to 64K) because 80 of its 88 layers use Mamba-2, which has constant cost per token.
Feature Speedups
Multi-Token Prediction (MTP): Qwen 3.5 models have a built-in draft head that predicts the next token in parallel. With probabilistic acceptance at 90% rate, the 122B goes from ~17 tok/s to 38.8 tok/s (2.3x speedup). Server overhead is minimal: a short-prompt request through vllm-mlx generates at 39 tok/s, matching baseline.
SpecPrefill: For long prompts, a 2B draft model scores token importance via attention, then the target only prefills the top 20%. On the 122B at 128K context, Time To First Token (TTFT) drops from 19.3 minutes to 3.5 minutes (5.5x speedup). This feature only activates for prompts above 8K tokens.
MLX vs. llama.cpp Comparison
Benchmarking Qwen3.5-35B-A3B on both stacks (512 tokens generated after filling KV cache):
- 32K context: MLX 8-bit: 60.8 tok/s, llama.cpp FA ON (5-bit): 54.85 tok/s, llama.cpp FA OFF: 36.45 tok/s
- 64K context: MLX 8-bit: 53.2 tok/s, llama.cpp FA ON (5-bit): 45.84 tok/s, llama.cpp FA OFF: 24.47 tok/s
- 128K context: MLX 8-bit: 42.7 tok/s, llama.cpp FA ON (5-bit): 34.48 tok/s, llama.cpp FA OFF: 13.73 tok/s
MLX uses a 2-pass split-K decode kernel (sdpa_vector_2pass) that dispatches up to 1024 threadgroups at 128K context. The comparison shows MLX is competitive with llama.cpp at long context lengths.
Hybrid Architecture Impact
The models tested use hybrid architectures with fewer attention layers:
- Qwen3.5-35B-A3B: 25% attention layers (10 of 40), 71.8 tok/s at 4K, -25% drop at 64K
- Nemotron Super 120B: 9% attention layers (8 of 88), 36.4 tok/s at 4K, -14% drop at 64K
Qwen 3.5 uses GatedDeltaNet layers (linear recurrence) for most of the network with standard attention for only 25% of layers. Fewer attention layers means less KV cache to scan per token and less degradation at long context.
Recent Improvements
The MLX ecosystem has three layers that have seen rapid development. MLX core received a thread safety overhaul (per-thread M... [source text truncated]. Combined with continuous batching and prefix cache, the 122B now serves coding agents interactively at context lengths that were previously impractical.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Litigation Risks in AI Data Center Financing Structures
The AI data center buildout requires $5.2 trillion in infrastructure investment by 2030, with companies using complex financing structures like SPVs and GPU-collateralized facilities that create nine categories of litigation risk.

Exploring the Intricacies of OpenClaw: How It Operates
OpenClaw is revolutionizing the AI coding landscape with its innovative architecture and unique functionalities. Discover the inner workings of this potent automation agent.

Claude-Code v2.1.84 adds PowerShell tool, environment variables, and multiple fixes
Claude-Code v2.1.84 introduces a PowerShell tool for Windows as an opt-in preview, adds environment variables for model configuration and streaming timeouts, and includes numerous bug fixes and performance improvements.

Wikipedia's AI Policy: LLMs Banned for Article Creation, Exceptions for Copyediting and Translation
Wikipedia prohibits using LLMs to generate or rewrite articles, with narrow exceptions for basic copyediting and translation. Violations can lead to speedy deletion (G15) and removal of AI-generated comments from talk pages.