Qwen3.5-397B MoE Runs on 14GB RAM via Paged Expert Loading on M1 Ultra

A Reddit post by u/ur_dad_matt (via Claude) demonstrates a custom paged MoE engine that runs Qwen3.5-397B-A17B (209GB on disk, 512 experts, top-10 routing) on an M1 Ultra 64GB Mac Studio with only 14GB peak RAM and 1.59 tok/s inference speed. The model is too large to load naively; the engine keeps only K=20 experts resident in RAM, lazy-paging the rest from SSD on router demand, evicting under cache pressure. Compute uses Float16 (faster than ternary on MPS), Apple Silicon native, MLX-based.
Benchmark results from a 5-prompt sweep on M1 Ultra 64GB:
- Speed: 1.59 tok/s (mean across 5 coherent generations, K=20)
- Cache RSS peak (generation): 7.91 GB
- Total RSS peak: 14.04 GB
- Coherent outputs: 5/5
Optimal engine config: K_override=20, cache_gb=8.0, OUTLIER_MMAP_EXPERTS=0, lazy_load=True. Initial attempts with all experts on disk caused command-buffer allocation failures until cache size was tuned.
The author argues that raw score benchmarks miss the point for local LLMs on 64GB hardware; the key metric is MMLU per GB RAM. At 1.59 tok/s the model runs at "thinking pace" not chat pace, demonstrating the upper bound of model-to-memory ratio.
Speeds for smaller quantized models on same hardware (MLX-4bit):
- 4B Nano: 71.7 tok/s
- 9B Lite: 53.4 tok/s
- 26B-A4B Quick: 14.6 tok/s
- 27B Core: 40.7 tok/s (MMLU 0.851 n=14042 σ=0.003, HumanEval 0.866 n=164 σ=0.027)
- 35B-A3B Vision: 64.1 tok/s
- 397B Plus: 1.59 tok/s
The runtime is built with Tauri + Rust + MLX for macOS. Free tiers (Nano and Lite) are available forever at outlier.host. A video demo is included in the Reddit post.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Fixing OpenClaw Agent Autonomy Issues: Skill Files, Tool Selection, and Cron Setup
A developer shares solutions for OpenClaw agents that stop working autonomously after initial setup. Key fixes include using external skill files instead of chat instructions, replacing browser tools with API-based tools or Puppeteer scripts, and properly configuring cron jobs.

OpenClaw Memory Plugin Testing Results and Recommended Stack
A Reddit user tested every OpenClaw memory plugin and found the default markdown setup causes token bloat and instruction compression. The recommended setup combines Obsidian for human-readable notes, QMD for token-free searching, and SQLite for structured data.

Building a serverless AI agent platform on AWS for $0.01/month with Claude Code
A developer built a complete AWS serverless platform running AI agents for approximately $0.01/month using Claude Code over 29 hours, eliminating expensive components like NAT Gateway ($32/month) and ALB ($18/month). The project includes 233 unit tests, 35 E2E tests, and deploys with a single cdk deploy command.

OpenClaw setup for human-in-the-loop browser automation with Docker, Chromium, and noVNC
A developer shares their Docker container setup that enables OpenClaw to handle CAPTCHAs and approvals mid-run by using Chromium with noVNC for remote access, requiring ~300MB RAM and 3-second cold starts.