Qwen 3.5 122B MoE at 35 t/s on a Single 3090 with ik_llama.cpp MTP

✍️ OpenClawRadar📅 Published: June 6, 2026🔗 Source
Qwen 3.5 122B MoE at 35 t/s on a Single 3090 with ik_llama.cpp MTP
Ad

A developer running a fully local inference stack on a single desktop reports hitting 35 tokens/s on Qwen 3.5 122B MoE using only one 3090, with the key enabler being a fork of llama.cpp that fixes MTP (Multi-Token Prediction) for offloaded experts.

Hardware Config

  • AMD 9900X CPU
  • 192GB DDR5-5200 RAM (called “the secret weapon”)
  • Two 3090s (Ti + standard), no NVLink

Card 1 runs the worker: Qwen3.5-122B-A10B using Unsloth IQ3_S MTP GGUF with 204K context. 75% of expert layers are offloaded to CPU via surgical -ot flags. Card 2 runs the reasoner: Qwen3.6-35B-A3B Q4_K_XL with MTP at 135 t/s, 262K context.

Additional CPU-only instances handle background processing: Dialectic (35B heretic Q8), Scribe-Logos (Gemma4 19B), Moonshot (Gemma4 2B) — totalling ~19GB RAM.

Ad

The ik_llama.cpp Finding

Stock llama.cpp’s MTP evaluates each speculated token’s experts sequentially through DDR5, which on reasoning content actually regresses performance — the draft overhead outweighs the acceptance speedup. The ik fork implements fused MoE ops that batch expert reads for speculated tokens, turning MTP from a +4% gain into a +20% gain. The developer reports 35 t/s decode on a 122B model from a single 3090 using this fork.

If you’re offloading experts to RAM on any MoE model, try ik_llama.cpp before giving up on MTP.

Total Build Cost

  • ~$1600 for RAM
  • ~$1600 for two 3090s
  • ~$400 for everything else
  • Running cost: electricity only

📖 Read the full source: r/openclaw

Ad

👀 See Also