Qwen 3.6-35B-A3B KV Cache Bench: f16 vs q8_0 vs Turbo3 vs Turbo4 on M5 Max Up to 1M Context

A Reddit user ran a depth sweep on Qwen 3.6-35B-A3B Q8 using TheTom's TurboQuant Metal fork of llama.cpp (GitHub: TheTom/llama-cpp-turboquant, branch feature/turboquant-kv-cache) on a MacBook Pro M5 Max with 128 GB unified memory. They tested four KV cache types: f16, q8_0, turbo3 (3-bit), and turbo4 (4-bit), symmetric K and V, with flash-attn on and mlock on, from 0 to 1M context tokens.
Hardware & Build
M5 Max, 128 GB unified memory. Built with cmake -B build -DGGML_METAL=ON. Used llama-bench, 3 reps per cell, flash-attn on, mlock on. 8 hours wall-clock overnight.
Generation Throughput (tok/s)
| Depth | f16 | q8_0 | turbo3 | turbo4 |
|---|---|---|---|---|
| 0 | 89.4 | 87.4 | 79.5 | 79.7 |
| 8K | 84.2 | 79.2 | 72.2 | 71.2 |
| 32K | 72.6 | 67.8 | 61.5 | 61.8 |
| 128K | 44.4 | 40.7 | 36.0 | 37.7 |
| 256K | OOM | 26.6 | 22.9 | 25.5 |
| 512K | OOM | OOM | 13.3 | 16.0 |
| 1M | OOM | OOM | 6.5 | OOM |
Prompt Processing Throughput (tok/s)
| Depth | f16 | q8_0 | turbo3 | turbo4 |
|---|---|---|---|---|
| 0 | 2962 | 2948 | 2904 | 2854 |
| 8K | 2098 | 1623 | 1653 | 1439 |
| 32K | 1063 | 802 | 784 | 678 |
| 128K | 321 | 245 | 253 | 206 |
| 256K | OOM | 124 | 128 | 101 |
| 512K | OOM | OOM | 66 | 56 |
| 1M | OOM | OOM | 30 | OOM |
Key Takeaways
- At depth 0, f16 leads by a hair on prefill; turbo3 is ~10% slower on decode.
- At 128K, turbo3 prefill (253 tok/s) matches q8_0 (245 tok/s) — smaller cache reduces bandwidth pressure.
- At 256K, turbo3 wins prefill +27% over turbo4 (128 vs 101), but turbo4 wins decode +11% (25.5 vs 22.9). At 512K, decode gap widens to +20% (turbo4 16.0 vs turbo3 13.3).
- turbo3 is the only cache type that fits 1M context (6.5 tok/s decode). Memory at 1M: ~89 GB (37 GB weights, ~52 GB KV cache).
Workload Recommendations
- Coding agents (deep context, many generated tokens): turbo4
- RAG / batch QA (heavy prefill, short answers): turbo3
- 1M context: turbo3 only
- Short interactive (<32K): f16 if it fits, else q8_0
Caveats
This is one M5 Max. Crossovers likely shift with memory bandwidth and GPU cores. Only symmetric K/V tested. Asymmetric combos (e.g., -ctk q8_0 -ctv turbo4) not benched. TheTom's fork is research-grade, not upstream in llama.cpp main.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Georgia Court Order Contains AI-Hallucinated Legal Citations
A Georgia Supreme Court appeal revealed a trial court order contained at least five citations to nonexistent cases and five more to cases that don't support their cited propositions, with the prosecutor's proposed order containing the same errors.

Claude Code v2.1.85 Release: MCP Improvements, Hook Filters, and Bug Fixes
Claude Code v2.1.85 adds environment variables for MCP headersHelper scripts, conditional if fields for hooks to reduce process spawning, and fixes for /compact failures, plugin enable/disable issues, and terminal keyboard problems in Ghostty, Kitty, and WezTerm.

Hospital CEO Claims AI Ready to Replace Radiologists
The CEO of America's largest public hospital system says he's prepared to replace radiologists with AI, according to a Radiology Business article that generated significant discussion on Hacker News with 83 comments.

Claude Platform on AWS Now GA: Native Anthropic Experience via IAM, CloudTrail, and AWS Billing
AWS announced GA of Claude Platform on AWS, giving developers direct access to Anthropic's native Claude experience through existing AWS accounts with IAM auth, AWS billing, and CloudTrail logging — but customer data is processed outside AWS security boundary.