Optimizing GLM-4.7-Flash on M4 Mac Mini with 24GB RAM

Practical Configuration for GLM-4.7-Flash on M4 Hardware
A developer testing OpenClaw and Ollama on an M4 Mac Mini with 24GB RAM has shared specific optimization details for running the GLM-4.7-Flash model. The source provides concrete memory allocation realities and configuration parameters that work within the hardware constraints.
Memory Reality and Model Selection
The testing reveals that the effective GPU memory budget on the M4 Mini is approximately 17.8GB Metal (GPU-wired), not the full 24GB. The rest is consumed by macOS, applications, and CPU compute. This limitation affects model selection and context size.
- Q4_K_XL quantization (17.5GB GGUF) cannot handle 32k context: Model (14.4GB) + KV (2.8GB) + compute (1.4GB) = 18.6GB → Out of Memory
- Q3_K_XL quantization (13.8GB GGUF) works at 32k context: Model (12.7GB) + KV (3.2GB) + compute (1.4GB) = 16.1GB with 1.7GB headroom
- Context ceiling is approximately 34k before OOM occurs
Configuration Details
The successful setup uses:
- Model: unsloth/GLM-4.7-Flash-GGUF from Hugging Face
- Quantization: Q3_K_XL
- Context size: 32k with MLA (Multi-Head Latent Attention)
- KV cache implementation: llama.cpp's v-less KV cache (PR #19067, Jan 2026) triggered by GGUF metadata (key_length_mla, kv_lora_rank)
- Build requirement: llama.cpp b7860+
The MLA implementation reduces KV memory usage significantly - 32k context KV cache is only 3.2GB instead of 13GB.
Framework-Specific Considerations
Agentic frameworks like OpenClaw have internal context thresholds that affect performance:
- OpenClaw triggers aggressive compaction below 32k context
- Increasing context from 20k to 32k reduced startup time from 5 minutes to 2 minutes 17 seconds
- Compaction passes dropped from 2 to 1 when matching num_ctx to framework thresholds
- num_ctx must be baked into the Ollama Modelfile - OpenClaw and other orchestrators using Ollama's OpenAI-compatible API ignore it at the request level
Performance Testing Data
The developer provided specific timing data for various tasks:
Task Time Input Tokens Compactions Result Personality intro 119s ~13,900 2 ✅ Profile recall 60s 13,247 2 ✅ w/ caveat Task creation 61s 13,375 2 ✅ Memory write 165s 14,448 2 ✅ Memory recall 89s 14,085 2 ✅ Web search + synthesis 273s 18,668 2 ✅
MLX Considerations
The developer notes that MLX and GGUF are different formats - Unsloth/bartowski GGUF files cannot run with mlx-lm. Currently, no 3-bit Flash model exists in the mlx-community repository, only 4-bit models are available.
📖 Read the full source: r/openclaw
👀 See Also

Running Qwen3.6 27B and 35B on 6GB VRAM with ik_llama: Practical Configs and Benchmarks
A user shares detailed ik_llama configs and performance numbers for running Qwen3.6 27B and 35B A3B models on an RTX2060 mobile (6GB VRAM, 32GB RAM), with prefill speeds of 40-100 t/s and generation up to 11 t/s.

Opus 4.7 Broke 40% of Prompts; Fix Was Structuring CLAUDE.md and Skills
After Opus 4.7 degraded ~40% of prompts across 6 setups, a fractional head of AI fixed it by replacing ad-hoc prompts with structured Skill files, hierarchical CLAUDE.md, and separate memory files — reducing token usage 22% and iteration turns from 3-4 to 1-2.

Three-layer memory architecture for persistent OpenClaw agent context
A developer built a 3-layer memory system on top of OpenClaw's infrastructure to prevent agents from starting each session without context. The architecture includes L1 workspace files injected every turn, L2 semantic memory search, and L3 reference documents opened on demand.

How to Secure Claude Cowork with a Proxy Layer: Practical Guide
A walkthrough on setting up a proxy layer to observe and secure Claude Cowork's behavior, published by General Analysis team.