Qwen 3.6-35B-A3B KV Cache Bench: Turbo3 Hits 1M at 6.5 tok/s

A Reddit user ran a depth sweep on Qwen 3.6-35B-A3B Q8 using TheTom's TurboQuant Metal fork of llama.cpp (GitHub: TheTom/llama-cpp-turboquant, branch feature/turboquant-kv-cache) on a MacBook Pro M5 Max with 128 GB unified memory. They tested four KV cache types: f16, q8_0, turbo3 (3-bit), and turbo4 (4-bit), symmetric K and V, with flash-attn on and mlock on, from 0 to 1M context tokens.

Hardware & Build

M5 Max, 128 GB unified memory. Built with cmake -B build -DGGML_METAL=ON. Used llama-bench, 3 reps per cell, flash-attn on, mlock on. 8 hours wall-clock overnight.

Generation Throughput (tok/s)

Depth	f16	q8_0	turbo3	turbo4
0	89.4	87.4	79.5	79.7
8K	84.2	79.2	72.2	71.2
32K	72.6	67.8	61.5	61.8
128K	44.4	40.7	36.0	37.7
256K	OOM	26.6	22.9	25.5
512K	OOM	OOM	13.3	16.0
1M	OOM	OOM	6.5	OOM

Prompt Processing Throughput (tok/s)

Depth	f16	q8_0	turbo3	turbo4
0	2962	2948	2904	2854
8K	2098	1623	1653	1439
32K	1063	802	784	678
128K	321	245	253	206
256K	OOM	124	128	101
512K	OOM	OOM	66	56
1M	OOM	OOM	30	OOM

Key Takeaways

At depth 0, f16 leads by a hair on prefill; turbo3 is ~10% slower on decode.
At 128K, turbo3 prefill (253 tok/s) matches q8_0 (245 tok/s) — smaller cache reduces bandwidth pressure.
At 256K, turbo3 wins prefill +27% over turbo4 (128 vs 101), but turbo4 wins decode +11% (25.5 vs 22.9). At 512K, decode gap widens to +20% (turbo4 16.0 vs turbo3 13.3).
turbo3 is the only cache type that fits 1M context (6.5 tok/s decode). Memory at 1M: ~89 GB (37 GB weights, ~52 GB KV cache).

Workload Recommendations

Coding agents (deep context, many generated tokens): turbo4
RAG / batch QA (heavy prefill, short answers): turbo3
1M context: turbo3 only
Short interactive (<32K): f16 if it fits, else q8_0

Caveats

This is one M5 Max. Crossovers likely shift with memory bandwidth and GPU cores. Only symmetric K/V tested. Asymmetric combos (e.g., -ctk q8_0 -ctv turbo4) not benched. TheTom's fork is research-grade, not upstream in llama.cpp main.

📖 Read the full source: r/LocalLLaMA