Qwen 35B-A3B as always-on agent on 16GB M4 Mac: disk I/O fails before RAM

Running a Qwen 35B-A3B MoE model as an always-on agent on a 16GB M4 Mac Mini (basic spec) seemed plausible on paper: with llama.cpp --mmap and --flash-attn, the IQ3_XXS quant (12GB on disk) keeps RAM resident at 4–6GB via expert paging, delivering ~17 tok/s with --threads 8 --ctx-size 4096. As a batch tool, it works on this box. But scaling to a continuous agentic loop, sitting alongside Claude Code (Opus/Sonnet) and Codex CLI, collapsed — and the bottleneck was disk, not RAM.
The setup that broke
- Ollama daemon serving
qwen3.5:9b+qwen3.5:4b(config:OLLAMA_MAX_LOADED_MODELS=2,OLLAMA_KEEP_ALIVE=10m,OLLAMA_FLASH_ATTENTION=1,OLLAMA_KV_CACHE_TYPE=q8_0) llama-serverfor the 35B on its own port- LiteLLM bridge proxying everything as a Claude-compatible endpoint on
:4000 - One or two Claude Code sessions
- Codex CLI session
- Usual home-server cron, watchers, mail queue
What failed
Continuous mmap paging from the 35B + Claude Code's file-watcher/indexer + Codex holding context = constant SSD contention. The Mac started rebooting spontaneously (no crash logs in log show --predicate 'eventMessage CONTAINS "panic"'), background cron jobs missed windows by 5+ minutes, then quietly failed. Known issues: Claude Code and Codex CLIs have open bugs for memory growth in long sessions (#22968), idle CPU pegging (#19393), and accumulating processes (#11122). With one harness it's invisible; with two plus a paging 35B doing real loops, disk dies first.
Stable workaround
- 35B
llama-serverLaunchDaemon disabled (plist renamed.disabled) - 24GB reclaimed by deleting the 35B GGUF and an old 26B Gemma
- All Anthropic-shaped routes go to Ollama:
qwen3.5:9bfor opus/sonnet,qwen3.5:4bfor haiku - Both Metal-resident via Ollama (~3GB GPU + 0.5GB CPU each), evict cleanly on idle
- LiteLLM moved to a proper user LaunchAgent (
KeepAlive=true,ThrottleInterval=30) — it had been a barepython -m litellmprocess for 7 days
The takeaway
The 35B-A3B-as-agent-loop dream is alive on a different class of box. On unified 16GB, it's a single-purpose batch tool, not an always-on layer. The author estimates 32GB unified memory minimum for sustained MoE agent inference without swap pain or daemon contention.
If you've got a trick for running it sustainably on 16GB without disk contention, the thread on r/LocalLLaMA is still active.
📖 Read the full source: r/LocalLLaMA
👀 See Also

NVIDIA Vera CPU Launched for Agentic AI Workloads
NVIDIA has launched the Vera CPU, a processor designed specifically for agentic AI and reinforcement learning workloads, claiming 50% faster performance and twice the efficiency compared to traditional rack-scale CPUs.

Claude Opus 4.7 Model Card Released
Anthropic has published the Claude Opus 4.7 model card, providing technical documentation for their latest AI model. The source material appears to be a PDF document containing system specifications and technical details.

AIME 2026 Results: Both Open and Closed Models Score Above 90%
AI models achieve remarkable 90%+ scores on AIME 2026, with DeepSeek V3.2 running the entire test for just bash.09.

Claude Opus 4.7 Analysis: Top Intelligence but High Cost and Verbosity
Claude Opus 4.7 (Adaptive Reasoning, Max Effort) ranks #1 in intelligence among 133 models with a score of 57 on the Artificial Analysis Intelligence Index, but costs $5 per 1M input tokens and $25 per 1M output tokens, making it significantly more expensive than average.