Flash-MoE: Running 397B Parameter Qwen Model on MacBook Pro with Pure C/Metal

Technical Implementation
Flash-MoE runs Qwen3.5-397B-A17B, a 397 billion parameter Mixture-of-Experts model with 60 transformer layers: 45 GatedDeltaNet (linear attention) + 15 standard full attention. Each layer has 512 experts, with K=4 activated per token plus one shared expert. Hidden dimension is 4096.
Performance Benchmarks
- 4-bit experts, FMA kernel: 4.36 tokens/second, excellent quality, full tool calling, 209GB on disk (current best)
- 4-bit experts, baseline: 3.90 tokens/second, excellent quality
- 2-bit experts, trust OS: 5.74 tokens/second, good quality, 120GB on disk (breaks JSON/tool calling)
- 2-bit peak single token: 7.05 tokens/second, good quality (not suitable for tool use)
Note: 2-bit quantization produces \name\ instead of "name" in JSON output, making tool calling unreliable. 4-bit is the production configuration.
Hardware Requirements
- Machine: MacBook Pro with Apple M3 Max
- Chip: 16-core CPU (12P + 4E), 40-core GPU, 16-core ANE
- Memory: 48 GB unified (~400 GB/s bandwidth)
- SSD: 1TB Apple Fabric, 17.5 GB/s sequential read (measured)
- macOS: 26.2 (Darwin 25.2.0)
Key Techniques
SSD Expert Streaming
Expert weights (209GB at 4-bit) are read from NVMe SSD on demand via parallel pread() with GCD dispatch groups. Only the K=4 active experts per layer are loaded (~6.75MB each). The OS page cache manages caching with no custom cache needed ("Trust the OS" principle), achieving ~71% hit rate naturally.
FMA-Optimized Dequant Kernel
The inner loop of the 4-bit dequantized matrix-vector multiply rearranges the math from (nibble * scale + bias) * x to fma(nibble, scale*x, bias*x). Pre-computing scale*x and bias*x lets the GPU fused multiply-add unit do dequant+multiply in one instruction, resulting in 12% faster performance than the naive formulation.
Metal Compute Shaders
Hand-written Metal kernels include:
- 4-bit and 2-bit dequantized matrix-vector multiply (tiled, SIMD-reduced, shared input cache, FMA-optimized)
- Fused SwiGLU activation
- RMS normalization (two-pass: sum-of-squares reduction + apply)
- Batched GPU attention (Q@K^T, softmax, scores@V) for full attention layers
- GPU RoPE (fused with Q deinterleave and K normalization)
- MoE combine + residual + sigmoid gate (fused kernel)
Deferred GPU Expert Compute
CMD3 (expert forward pass) is submitted without waiting. The GPU executes it while the CPU prepares the next layer. The combine + residual + norm are also on GPU, feeding directly into the next layer's attention projections.
Accelerate BLAS for Linear Attention
The GatedDeltaNet recurrence uses cblas_sscal, cblas_sgemv, and cblas_sger for the 64-head × 128×128 state matrix update, achieving 64% faster performance than scalar code.
Pipeline Performance
Per layer average at 4-bit: 4.28ms
- CMD3(prev) → CMD1: attention projections + delta-net [1.22ms GPU]
- CPU: flush results [0.01ms CPU]
- CMD2: o_proj + norm + routing + shared [0.55ms GPU]
- CPU: softmax + topK routing [0.003ms]
- I/O: parallel pread K=4 experts [2.41ms SSD]
- CMD3: expert forward + combine + norm [0.04ms encode, DEFERRED]
Architecture Constraints
On Apple Silicon, SSD DMA and GPU compute share the same memory controller and cannot be profitably overlapped. The GPU's dequant kernels are bandwidth-saturated at ~418 GiB/s. Even small background SSD DMA causes disproportionate GPU latency spikes through memory controller arbitration, requiring a serial pipeline.
📖 Read the full source: HN AI Agents
👀 See Also

MarkView: Open-source tool renders and manages AI-generated Markdown files
MarkView is a private-first rendering engine that displays Markdown files with Mermaid diagrams and KaTeX math, available as a web app, native macOS app, and MCP server for Claude Desktop and Cursor integration.

SIDJUA v0.9.7: Open Source Multi-Agent AI with Pre-Action Governance Enforcement
SIDJUA v0.9.7 is a self-hosted, open source multi-agent AI framework that enforces governance rules before agents act, blocking unauthorized actions like budget overruns or scope violations. It supports multiple LLM providers, runs on 4GB RAM, and includes a desktop GUI built with Tauri v2.

RAG Learning Academy Built Inside Claude Code with 20 Specialist Agents
A developer created an interactive RAG learning academy inside Claude Code featuring 20 specialist agents, 17 slash commands, and a 9-module curriculum that assesses knowledge level and uses open-source tools by default.

Claude Agent Teams UI: Desktop App for Visualizing Claude Code Agent Workflows
A developer built a free, open-source desktop app that adds a visual layer to Claude Code's experimental Agent Teams feature. The app provides a real-time kanban board where tasks move automatically as agents work, plus cross-team communication, built-in review workflows, and per-task code review.