Qwen3.5-122B on Blackwell SM120: fp8 KV Cache Corruption Issue and Performance Findings

✍️ OpenClawRadar📅 Published: March 1, 2026🔗 Source
Qwen3.5-122B on Blackwell SM120: fp8 KV Cache Corruption Issue and Performance Findings
Ad

Key Findings from Qwen3.5-122B Testing on Blackwell SM120

A detailed test of Qwen3.5-122B on 8x RTX PRO 6000 Blackwell hardware (AWS g7e.48xlarge, SM120) with SGLang revealed critical configuration issues and performance characteristics. The most significant finding: fp8_e4m3 KV cache doesn't crash but silently produces corrupt output with no errors or warnings - just exclamation marks and repetition instead of proper answers. The only fix is using bf16 KV cache instead.

Configuration Requirements

DeltaNet layers in Qwen3.5-122B add constraints that standard MoE models don't have. The setup required 6 specific Triton backend flags on SM120 hardware:

  • Attention backend forced to Triton (for DeltaNet layers)
  • KV cache forced to bf16 (fp8 corrupts output)
  • No CUDA graphs (due to Triton SMEM overflow)
  • No HiCache (DeltaNet incompatible)

This contrasts with M2.5 testing on the same hardware, which only needed 2 Triton backend flags.

Ad

Performance Benchmarks

All tests used the same hardware and methodology with SGLang nightly (cu13 20260219), TP=8:

  • Burst tok/s: 1,985 vs 1,818 (Qwen3.5-122B vs M2.5)
  • Online 4 rps: 310 vs 404
  • Online 8 rps: 514 vs 744
  • Single-request tok/s: ~25 (with MTP) vs 72
  • Arena-Hard quality: 6.99/10 vs 4.94/10 (judged by Claude Opus 4.6, not comparable to leaderboard results)

Optimization Results

Of the optimization paths tested, MTP (Multi-Token Prediction) was the only one that materially improved performance, providing a 2.75x single-request speedup (~9 to ~25 tok/s). Other optimizations available on SM120 hardware - FP8 KV cache, CUDA graphs, and HiCache - were blocked by DeltaNet constraints in Qwen3.5-122B.

Qwen3.5-122B wins on burst throughput and quality metrics, while M2.5 still wins on every sustained serving metric due to being able to use the optimizations that Qwen3.5-122B's DeltaNet blocks.

Full results, compatibility matrix, exact reproduction commands, and all JSONL artifacts are available in the GitHub issue linked below.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also