Qwen3.5-397B MoE Runs on 14GB RAM via Paged Expert Loading on M1 Ultra

✍️ OpenClawRadar📅 Published: May 7, 2026🔗 Source
Qwen3.5-397B MoE Runs on 14GB RAM via Paged Expert Loading on M1 Ultra
Ad

A Reddit post by u/ur_dad_matt (via Claude) demonstrates a custom paged MoE engine that runs Qwen3.5-397B-A17B (209GB on disk, 512 experts, top-10 routing) on an M1 Ultra 64GB Mac Studio with only 14GB peak RAM and 1.59 tok/s inference speed. The model is too large to load naively; the engine keeps only K=20 experts resident in RAM, lazy-paging the rest from SSD on router demand, evicting under cache pressure. Compute uses Float16 (faster than ternary on MPS), Apple Silicon native, MLX-based.

Benchmark results from a 5-prompt sweep on M1 Ultra 64GB:

  • Speed: 1.59 tok/s (mean across 5 coherent generations, K=20)
  • Cache RSS peak (generation): 7.91 GB
  • Total RSS peak: 14.04 GB
  • Coherent outputs: 5/5

Optimal engine config: K_override=20, cache_gb=8.0, OUTLIER_MMAP_EXPERTS=0, lazy_load=True. Initial attempts with all experts on disk caused command-buffer allocation failures until cache size was tuned.

The author argues that raw score benchmarks miss the point for local LLMs on 64GB hardware; the key metric is MMLU per GB RAM. At 1.59 tok/s the model runs at "thinking pace" not chat pace, demonstrating the upper bound of model-to-memory ratio.

Ad

Speeds for smaller quantized models on same hardware (MLX-4bit):

  • 4B Nano: 71.7 tok/s
  • 9B Lite: 53.4 tok/s
  • 26B-A4B Quick: 14.6 tok/s
  • 27B Core: 40.7 tok/s (MMLU 0.851 n=14042 σ=0.003, HumanEval 0.866 n=164 σ=0.027)
  • 35B-A3B Vision: 64.1 tok/s
  • 397B Plus: 1.59 tok/s

The runtime is built with Tauri + Rust + MLX for macOS. Free tiers (Nano and Lite) are available forever at outlier.host. A video demo is included in the Reddit post.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also

Fixing OpenClaw Agent Autonomy Issues: Skill Files, Tool Selection, and Cron Setup
Guides

Fixing OpenClaw Agent Autonomy Issues: Skill Files, Tool Selection, and Cron Setup

A developer shares solutions for OpenClaw agents that stop working autonomously after initial setup. Key fixes include using external skill files instead of chat instructions, replacing browser tools with API-based tools or Puppeteer scripts, and properly configuring cron jobs.

OpenClawRadar
OpenClaw Memory Plugin Testing Results and Recommended Stack
Guides

OpenClaw Memory Plugin Testing Results and Recommended Stack

A Reddit user tested every OpenClaw memory plugin and found the default markdown setup causes token bloat and instruction compression. The recommended setup combines Obsidian for human-readable notes, QMD for token-free searching, and SQLite for structured data.

OpenClawRadar
Building a serverless AI agent platform on AWS for $0.01/month with Claude Code
Guides

Building a serverless AI agent platform on AWS for $0.01/month with Claude Code

A developer built a complete AWS serverless platform running AI agents for approximately $0.01/month using Claude Code over 29 hours, eliminating expensive components like NAT Gateway ($32/month) and ALB ($18/month). The project includes 233 unit tests, 35 E2E tests, and deploys with a single cdk deploy command.

OpenClawRadar
OpenClaw setup for human-in-the-loop browser automation with Docker, Chromium, and noVNC
Guides

OpenClaw setup for human-in-the-loop browser automation with Docker, Chromium, and noVNC

A developer shares their Docker container setup that enables OpenClaw to handle CAPTCHAs and approvals mid-run by using Chromium with noVNC for remote access, requiring ~300MB RAM and 3-second cold starts.

OpenClawRadar