Qwen3.5-122B fp8 KV Cache Corruption on Blackwell SM120 Fix

Key Findings from Qwen3.5-122B Testing on Blackwell SM120

A detailed test of Qwen3.5-122B on 8x RTX PRO 6000 Blackwell hardware (AWS g7e.48xlarge, SM120) with SGLang revealed critical configuration issues and performance characteristics. The most significant finding: fp8_e4m3 KV cache doesn't crash but silently produces corrupt output with no errors or warnings - just exclamation marks and repetition instead of proper answers. The only fix is using bf16 KV cache instead.

Configuration Requirements

DeltaNet layers in Qwen3.5-122B add constraints that standard MoE models don't have. The setup required 6 specific Triton backend flags on SM120 hardware:

Attention backend forced to Triton (for DeltaNet layers)
KV cache forced to bf16 (fp8 corrupts output)
No CUDA graphs (due to Triton SMEM overflow)
No HiCache (DeltaNet incompatible)

This contrasts with M2.5 testing on the same hardware, which only needed 2 Triton backend flags.

Performance Benchmarks

All tests used the same hardware and methodology with SGLang nightly (cu13 20260219), TP=8:

Burst tok/s: 1,985 vs 1,818 (Qwen3.5-122B vs M2.5)
Online 4 rps: 310 vs 404
Online 8 rps: 514 vs 744
Single-request tok/s: ~25 (with MTP) vs 72
Arena-Hard quality: 6.99/10 vs 4.94/10 (judged by Claude Opus 4.6, not comparable to leaderboard results)

Optimization Results

Of the optimization paths tested, MTP (Multi-Token Prediction) was the only one that materially improved performance, providing a 2.75x single-request speedup (~9 to ~25 tok/s). Other optimizations available on SM120 hardware - FP8 KV cache, CUDA graphs, and HiCache - were blocked by DeltaNet constraints in Qwen3.5-122B.

Qwen3.5-122B wins on burst throughput and quality metrics, while M2.5 still wins on every sustained serving metric due to being able to use the optimizations that Qwen3.5-122B's DeltaNet blocks.

Full results, compatibility matrix, exact reproduction commands, and all JSONL artifacts are available in the GitHub issue linked below.

📖 Read the full source: r/LocalLLaMA