Qwen 3.6 27B at 54 t/s on V100 32GB via MTP Branch

A user on r/LocalLLaMA reports impressive results running Qwen 3.6 27B with Multi-Token Prediction (MTP) on a V100 32GB SXM module using a PCIe adapter. The setup uses am17an's MTP branch of llama.cpp and the corresponding MTP GGUF quant. Key specs: Q8_0 KV cache with 200k cache limit, running as a VS Code Copilot backend via llama-server.

Performance Numbers

Without MTP: 29-30 tokens/second
With MTP: 54-55 tokens/second (at 150W power limit)
After 50k tokens context: drops to 40-45 t/s

Branch: am17an's MTP fork. Build and run were straightforward — 'pulled and built in one shot' with llama-server running without issues. The setup handles tool calls and sub-agents well, and delivered 'very insightful code reviews and refactors' despite the VRAM limitation (32GB).

This is particularly relevant for developers running LLMs on older datacenter hardware like V100s. MTP effectively doubles throughput for this model, demonstrating practical gains for coding assistant workloads.

📖 Read the full source: r/LocalLLaMA

Qwen 3.6 27B with MTP on V100 32GB: 54 t/s via llama.cpp Branch

Performance Numbers

👀 See Also

OmniCoder-9B fine-tune shows strong performance for agentic coding on 8GB VRAM systems

Introducing operate.txt: A YAML spec for AI agents navigating SaaS products

Rift: A Better Alternative to Git Worktrees with Instant Copy-on-Write Snapshots

Custom WhatsApp Channel Plugin for Claude Code Using Baileys