Qwen3.5-122B on Blackwell SM120: fp8 KV Cache Corruption Issue and Performance Findings

Key Findings from Qwen3.5-122B Testing on Blackwell SM120
A detailed test of Qwen3.5-122B on 8x RTX PRO 6000 Blackwell hardware (AWS g7e.48xlarge, SM120) with SGLang revealed critical configuration issues and performance characteristics. The most significant finding: fp8_e4m3 KV cache doesn't crash but silently produces corrupt output with no errors or warnings - just exclamation marks and repetition instead of proper answers. The only fix is using bf16 KV cache instead.
Configuration Requirements
DeltaNet layers in Qwen3.5-122B add constraints that standard MoE models don't have. The setup required 6 specific Triton backend flags on SM120 hardware:
- Attention backend forced to Triton (for DeltaNet layers)
- KV cache forced to bf16 (fp8 corrupts output)
- No CUDA graphs (due to Triton SMEM overflow)
- No HiCache (DeltaNet incompatible)
This contrasts with M2.5 testing on the same hardware, which only needed 2 Triton backend flags.
Performance Benchmarks
All tests used the same hardware and methodology with SGLang nightly (cu13 20260219), TP=8:
- Burst tok/s: 1,985 vs 1,818 (Qwen3.5-122B vs M2.5)
- Online 4 rps: 310 vs 404
- Online 8 rps: 514 vs 744
- Single-request tok/s: ~25 (with MTP) vs 72
- Arena-Hard quality: 6.99/10 vs 4.94/10 (judged by Claude Opus 4.6, not comparable to leaderboard results)
Optimization Results
Of the optimization paths tested, MTP (Multi-Token Prediction) was the only one that materially improved performance, providing a 2.75x single-request speedup (~9 to ~25 tok/s). Other optimizations available on SM120 hardware - FP8 KV cache, CUDA graphs, and HiCache - were blocked by DeltaNet constraints in Qwen3.5-122B.
Qwen3.5-122B wins on burst throughput and quality metrics, while M2.5 still wins on every sustained serving metric due to being able to use the optimizations that Qwen3.5-122B's DeltaNet blocks.
Full results, compatibility matrix, exact reproduction commands, and all JSONL artifacts are available in the GitHub issue linked below.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Claude Agents on Bedrock Get Autonomous Micropayments via x402 Protocol
AWS AgentCore Payments lets Claude agents on Bedrock hold wallets and make USDC micropayments mid-task via the x402 HTTP standard, enabling autonomous paid API calls and subtask delegation without human approval.

Trading Strategy Benchmark: Cheaper AI Models Outperform Claude Opus 4.6
A benchmark tested 10 LLMs on developing trading strategies, with cheaper models like Minimax 2.5 and Gemini 3.1 outperforming Claude Opus 4.6 despite its 10x higher cost. The experiment was run three times with consistent results.

RTX 5080 16GB: Qwen3.6 35B MoE at 128k Context — 56 tok/s, and Why MTP Doesn't Help
New benchmarks show Qwen3.6 35B MoE on RTX 5080 16GB hits 56 tok/s generation at 128k context. MTP (Multi-Token Prediction) makes it 23% slower due to VRAM pressure pushing expert layers to CPU.

Claude AI credited in macOS Tahoe 26.5 update release notes
Apple’s macOS Tahoe 26.5 release notes credit Claude AI alongside engineering teams, marking the first known case of an AI being formally acknowledged in Apple’s changelog.