RTX 5080 16GB: Qwen3.6 35B MoE at 128k Context — 56 tok/s, and Why MTP Doesn't Help

✍️ OpenClawRadar📅 Published: May 20, 2026🔗 Source
RTX 5080 16GB: Qwen3.6 35B MoE at 128k Context — 56 tok/s, and Why MTP Doesn't Help
Ad

Mainline llama.cpp commit b9190 merged MTP (Multi-Token Prediction). Benchmarks on a RTX 5080 16GB with Qwen3.6 35B MoE at 128k context reveal a clear finding: MTP hurts performance when the model doesn't fully fit on GPU.

The Best Config (No MTP)

Qwen3.6-35B-A3B Q4_K_XL --fit-target 1536 at 131k context yields:

  • 56 tok/s generation
  • 1,584 tok/s prompt processing at 128k context

No MTP flags needed.

Why MTP Slows Down 35B MoE on 16GB

Three configs tested at coding-agent context lengths:

  • 27B IQ3+MTP: 12.45 GB, fully on GPU — avg 73 tok/s (MTP helps)
  • 35B Q4_K_XL+MTP: ~22 GB, partial offload — avg 74 tok/s (MTP hurts)
  • 35B Q8_0+MTP: ~36 GB, heavy offload — avg 46 tok/s

Without MTP, the 35B Q4_K_XL achieves 97 tok/s at --fit-target 0 (15,815 MiB VRAM) and 86 tok/s at --fit-target 1536 (14,269 MiB). With MTP enabled at --fit-target 1536, speed drops to 74 tok/s (14,623 MiB) — a 23% slowdown.

The root cause: MTP's compute buffer reserves ~1.5 GB (--fit-target 1536), pushing ~3 more MoE expert layers from GPU to CPU. Since MoE inference is bottlenecked by CPU-bound expert layers, MTP's 79% token acceptance rate can't compensate for the slower per-step speed.

For the 27B model (fits entirely on GPU), --fit-target 0 works with or without MTP, so no VRAM penalty — MTP boosts speed from ~56 to 73 tok/s.

Ad

Rule of Thumb

MTP helps when your model fits on GPU. It hurts when the MTP compute buffer forces more layers to CPU. On 16GB cards with 35B MoE, skip MTP.

Full test system: RTX 5080 16GB, Ryzen 9 9950X, 128GB RAM, llama.cpp b9204 (mainline). Common MTP flags: -np 1 --fit on -fa on -t 20 --no-mmap --jinja -ctk q8_0 -ctv q8_0 --spec-type draft-mtp --spec-draft-n-max 2.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also

DeepSeek-V4-Flash Makes LLM Steering Practical for Local Models
News

DeepSeek-V4-Flash Makes LLM Steering Practical for Local Models

Seen Goedecke explains why steering vectors are relevant again thanks to DeepSeek-V4-Flash running locally via DwarfStar, with hands-on details on how steering works and why it hasn't caught on before.

OpenClawRadar
Meta tracking employee computer interactions for AI agent training
News

Meta tracking employee computer interactions for AI agent training

Meta is installing tracking software on US employee computers to capture mouse movements, clicks, and keystrokes for training AI models that can perform work tasks autonomously. The tool runs on work-related apps and websites and takes occasional screen snapshots for context.

OpenClawRadar
US Military Pressures Anthropic to Remove Claude Safeguards for Military Use
News

US Military Pressures Anthropic to Remove Claude Safeguards for Military Use

US military leaders including Defense Secretary Pete Hegseth met with Anthropic executives to demand removal of Claude's safeguards against military applications like mass surveillance and autonomous weapons. The Pentagon has given Anthropic until Friday to comply or face penalties including contract cancellation.

OpenClawRadar
State Flow Machine: Non-Transformer Architecture Maintains 62% Accuracy on Long Sequences Where Transformers Drop to 2%
News

State Flow Machine: Non-Transformer Architecture Maintains 62% Accuracy on Long Sequences Where Transformers Drop to 2%

A researcher has developed State Flow Machine (SFM), an alternative architecture using explicit memory slots instead of attention heads, achieving 62% accuracy on a synthetic program state tracking task at 4× training length where transformers drop to 1.9-3.1%. The model runs on a single Huawei Ascend 910 ProA NPU.

OpenClawRadar