Qwen 3.6-35B-A3B KV Cache Bench: f16 vs q8_0 vs Turbo3 vs Turbo4 on M5 Max Up to 1M Context

✍️ OpenClawRadar📅 Published: April 28, 2026🔗 Source
Qwen 3.6-35B-A3B KV Cache Bench: f16 vs q8_0 vs Turbo3 vs Turbo4 on M5 Max Up to 1M Context
Ad

A Reddit user ran a depth sweep on Qwen 3.6-35B-A3B Q8 using TheTom's TurboQuant Metal fork of llama.cpp (GitHub: TheTom/llama-cpp-turboquant, branch feature/turboquant-kv-cache) on a MacBook Pro M5 Max with 128 GB unified memory. They tested four KV cache types: f16, q8_0, turbo3 (3-bit), and turbo4 (4-bit), symmetric K and V, with flash-attn on and mlock on, from 0 to 1M context tokens.

Hardware & Build

M5 Max, 128 GB unified memory. Built with cmake -B build -DGGML_METAL=ON. Used llama-bench, 3 reps per cell, flash-attn on, mlock on. 8 hours wall-clock overnight.

Generation Throughput (tok/s)

Depthf16q8_0turbo3turbo4
089.487.479.579.7
8K84.279.272.271.2
32K72.667.861.561.8
128K44.440.736.037.7
256KOOM26.622.925.5
512KOOMOOM13.316.0
1MOOMOOM6.5OOM

Prompt Processing Throughput (tok/s)

Depthf16q8_0turbo3turbo4
02962294829042854
8K2098162316531439
32K1063802784678
128K321245253206
256KOOM124128101
512KOOMOOM6656
1MOOMOOM30OOM
Ad

Key Takeaways

  • At depth 0, f16 leads by a hair on prefill; turbo3 is ~10% slower on decode.
  • At 128K, turbo3 prefill (253 tok/s) matches q8_0 (245 tok/s) — smaller cache reduces bandwidth pressure.
  • At 256K, turbo3 wins prefill +27% over turbo4 (128 vs 101), but turbo4 wins decode +11% (25.5 vs 22.9). At 512K, decode gap widens to +20% (turbo4 16.0 vs turbo3 13.3).
  • turbo3 is the only cache type that fits 1M context (6.5 tok/s decode). Memory at 1M: ~89 GB (37 GB weights, ~52 GB KV cache).

Workload Recommendations

  • Coding agents (deep context, many generated tokens): turbo4
  • RAG / batch QA (heavy prefill, short answers): turbo3
  • 1M context: turbo3 only
  • Short interactive (<32K): f16 if it fits, else q8_0

Caveats

This is one M5 Max. Crossovers likely shift with memory bandwidth and GPU cores. Only symmetric K/V tested. Asymmetric combos (e.g., -ctk q8_0 -ctv turbo4) not benched. TheTom's fork is research-grade, not upstream in llama.cpp main.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also