Qwen 3.6 27B with MTP on V100 32GB: 54 t/s via llama.cpp Branch

✍️ OpenClawRadar📅 Published: May 6, 2026🔗 Source
Qwen 3.6 27B with MTP on V100 32GB: 54 t/s via llama.cpp Branch
Ad

A user on r/LocalLLaMA reports impressive results running Qwen 3.6 27B with Multi-Token Prediction (MTP) on a V100 32GB SXM module using a PCIe adapter. The setup uses am17an's MTP branch of llama.cpp and the corresponding MTP GGUF quant. Key specs: Q8_0 KV cache with 200k cache limit, running as a VS Code Copilot backend via llama-server.

Performance Numbers

  • Without MTP: 29-30 tokens/second
  • With MTP: 54-55 tokens/second (at 150W power limit)
  • After 50k tokens context: drops to 40-45 t/s

Branch: am17an's MTP fork. Build and run were straightforward — 'pulled and built in one shot' with llama-server running without issues. The setup handles tool calls and sub-agents well, and delivered 'very insightful code reviews and refactors' despite the VRAM limitation (32GB).

This is particularly relevant for developers running LLMs on older datacenter hardware like V100s. MTP effectively doubles throughput for this model, demonstrating practical gains for coding assistant workloads.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also