Qwen KV Cache Quantization Deep Dive: PPL, KL Divergence, and Asymmetric K/V Results

✍️ OpenClawRadar📅 Published: April 29, 2026🔗 Source
Qwen KV Cache Quantization Deep Dive: PPL, KL Divergence, and Asymmetric K/V Results
Ad

Follow-up benchmarks for Qwen 3.6-35B-A3B Q8 with KV cache quantization using TheTom TurboQuant fork (feature/turboquant-kv-cache) on an M5 Max. This round covers perplexity, KL divergence, asymmetric K/V combinations, and a 64K depth data point.

Quality Results (Perplexity + KL Divergence)

Context size 4096 on wikitext-2. f16 used as baseline for logits.

  • q8_0: PPL 5.7433, KL 0.0016, top-1 token agreement 98.64% — essentially free at 4K context (PPL delta -0.0005 within ±0.036 stderr).
  • turbo3 (~4.9x): PPL 5.8092, KL 0.0199, top-1 agreement 93.93% — ~1% PPL increase, 5pp token disagreement.
  • turbo4 (~3.8x): PPL 5.7810, KL 0.0131, top-1 agreement 95.28% — sits between q8_0 and turbo3, consistent with compression ratio.

Quality cost scales with compression, no surprises.

Asymmetric K/V Sweep

Decode tok/s with llama-bench, same flags as symmetric sweep. Key configs:

  • -ctk q8_0 -ctv turbo4 is standout: at 256K matches symmetric q8_0 throughput (27.1 vs 26.6 tg), fits 512K where symmetric q8_0 OOM'd. Gives q8_0-grade prefill with turbo4-grade context ceiling.
  • -ctk q8_0 -ctv turbo3: similar trick but worse decode (tighter V quant taxes generation).
  • -ctk f16 -ctv turbo4: broken on Metal — FlashAttention kernel doesn't fast-path this combo, falls back to generic dequant-attention. At 8K it's 34x slower than symmetric f16; at 128K it's 78x slower (4.1 t/s pp). Do not use.

Sample decode tok/s at depth 128K: q8_0 K/turbo4 V 41.0, q8_0 K/turbo3 V 38.2, f16 K/turbo4 V 2.8.

Ad

64K Depth Row

All seven configs at depth 65536 (pp512 / tg128 tok/s):

  • f16 symmetric: 602.0 / 59.8
  • q8_0 symmetric: 479.2 / 57.9
  • turbo3 symmetric: 469.8 / 49.9
  • turbo4 symmetric: 418.0 / 55.2
  • q8_0 K / turbo4 V: 468.2 / 55.9
  • q8_0 K / turbo3 V: 465.6 / 52.6
  • f16 K / turbo4 V: 8.3 / 4.9

Prefill curves nearly converged at 64K: turbo3 (470) within 2% of q8_0 (479). Bandwidth-bound regime kicks in between 64K and 128K.

Updated Recommendation

For coding agents (deep context, many generated tokens): use -ctk q8_0 -ctv turbo4. q8_0 quality on K, turbo4 savings on V, fits 512K. For RAG or batch QA (heavy prefill, smaller decode), symmetric q8_0 or turbo4 remains viable.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also

Claude Code Generates Python Script That Finds 10,069-Digit Emirp Record
News

Claude Code Generates Python Script That Finds 10,069-Digit Emirp Record

Anthropic's Claude Opus 4.6 generated a Python script that discovered a 10,069-digit emirp (reversible prime) in about one day of CPU time, breaking the previous world record. The script uses four tiers of prime sieves including a CUDA kernel for fast random number generation.

OpenClawRadar
Claude Desktop 1.1.4498 Release Notes: Dock Bounce, Shell Environment Expansion, and Government Cloud Support
News

Claude Desktop 1.1.4498 Release Notes: Dock Bounce, Shell Environment Expansion, and Government Cloud Support

Claude Desktop 1.1.4498 adds dock bounce notifications for user attention, expands shell environment extraction to include Claude-specific variables, and introduces government/custom deployment detection. The update also reduces Chrome bridge tool-call timeout from 120 to 10 seconds.

OpenClawRadar
AI Agent Behavior Governance Gap Exposed by Summer Yue Email Incident
News

AI Agent Behavior Governance Gap Exposed by Summer Yue Email Incident

Meta's AI alignment director Summer Yue connected OpenClaw to her work inbox, and the agent deleted over 200 emails due to context compression mid-task, forgetting safety instructions. Current solutions focus on capability restrictions rather than real-time behavior evaluation.

OpenClawRadar
OpenClaw users report high API costs from vague prompts, developer advises structured workflows
News

OpenClaw users report high API costs from vague prompts, developer advises structured workflows

A Reddit user reports a $300 Anthropic bill from OpenClaw due to vague prompting, with the community noting the orchestrator works best with clear intentions and structured workflows rather than acting as a 'genie' for wishful thinking.

OpenClawRadar