Running GLM 5.1 & Kimi K2.6 on Mac Studio: Quantized Local LLMs

Over on r/LocalLLaMA, user ezyz posted their Mac Studio local LLM loadout as of May 2026, running on an M3 Ultra with 512GB unified memory. The post is a day-to-day vibe check, not rigorous benchmarks, but it's full of practical observations for anyone running large models locally for coding with Claude Code.

Current active models and performance

GLM 5.1 is the biggest winner. Quantized, it fits in ~380GB with max context, leaving room for other tasks. Decode speed is ~17 t/s, prefill ~190 t/s. The author trusts it up to a 6/10 on task complexity (10 being 'brownfield legacy codebase + vague spec') for coding via Claude Code. It handles self-contained, semi-scoped problems consistently, with occasional API Claude assistance for planning or cleanup.

Kimi K2.6 is in the same tier — not obviously better or worse — but is larger. Even aggressively quantized, it uses ~460GB, leaving little for other experiments. It's faster: prefill ~220 t/s, decode ~21 t/s. The friction is needing to unload it for memory-heavy experiments.

Minimax 2.7 is impressive for its size and speed, but the author rates it only 3-4/10 for dev work. It's an awkward size — GLM and Kimi win on shipping usable code, while smaller models win on assistant tasks like 'summarize this web search'. It does quickly bail out of reasoning for simple requests.

Gemma 4 31B disappointed: MLX support is still messy a month post-release. The 31B dense isn't much faster the big MoEs, the official chat template has multiple unaddressed bugs, and patches are still trickling in. The author plans to revisit once MTP/draft support stabilizes.

Qwen 3.6 35B was replaced with Qwen 3.5 9B for multimodal tasks like translating screenshots — it's good enough and fast enough, and handles Claude Code's Haiku background tasks with no noticeable difference, while saving ~14GB memory.

Pending support and future watch

Neither Deepseek 4 Flash nor Mimo 2.5 have officially landed in llama.cpp or mlx-lm yet. The author will try the PRs when time permits. They guess the pro versions of both will be too large and slow for the M3 Ultra — GLM's 40B active parameters is roughly their patience limit.

Eagerly watched projects:

Exo and tinygrad for Mac + NVIDIA clustering and disaggregated prefill
Stable Dflash / DDtree / MTP support
Novel quantization formats (paroquant, JANGTQ) — see llama.cpp PR #21038
Local music generation — Ace Step 1.5 is 'almost good' but voices not there yet.

📖 Read the full source: r/LocalLLaMA

Mac Studio local LLM loadout: GLM 5.1, Kimi K2.6, and what's working for coding with Claude Code

Current active models and performance

Pending support and future watch

👀 See Also

How Centralized Context Architecture with Claude Saves 10+ Hours Weekly

OpenClaw user automates cross-platform content formatting with custom skill

Practical OpenClaw workflows: TikTok automation, portfolio tracking, Reddit engagement, and scheduled tasks

Building a Fantasy Baseball Analytics App with Claude Code: A Law Student's Experience