Speculative Decoding Benefit: MTP Acceptance Rate >50%

A Reddit user tested MTP (Multi-Token Prediction) using mlx-vlm on Gemma-4 (26B, 4-bit) and found performance depends entirely on draft token acceptance rate. Measurements on an M4 Max Studio show concrete thresholds.

Workload Results

Code generation: 75 tok/s → 114.8 tok/s (1.53× faster) — acceptance rate: 66% of slots
Long-form prose: 75 tok/s → 71.1 tok/s (0.95×, essentially wash) — acceptance rate: 31% of slots
JSON output: 51.3 tok/s → 25.6 tok/s (0.50× slower) — acceptance rate: 8% of slots

The threshold appears to be ~50% acceptance. Below that, speculative decoding overhead outweighs gains.

Test details: code was "write some python functions to do X"; long-form prose was "write an 800 word essay on paper money in the Tang Dynasty"; JSON output involved grouping items by similarity into structured output.

Bonus tip: The user notes Gemma's JSON structure instruction following is decent, but enabling structured output (json_schema) adds ~20% overhead. They recommend accepting slightly sloppy JSON and fixing it at runtime. mlx-vlm does not support json_schema for spec-decode anyway.

Bottom line: MTP is great for local coding but can degrade performance for structured or prose tasks with low acceptance rates.

📖 Read the full source: r/LocalLLaMA

MTP Acceptance Rate: 50% Threshold Determines Speculative Decoding Benefit

Workload Results

👀 See Also

OpenClaw API Budget Drain: Settings to Change Immediately

Agent Framework Token Bloat: A 500:1 Input-to-Output Ratio Is Normal

Reddit user shares prompt structure to reduce Claude Code output drift in complex tasks

Managing Claude AI Token Consumption: Practical Tips from Developer Experience