Flash-MoE: Running 397B Parameter Qwen Model on MacBook Pro with Pure C/Metal

✍️ OpenClawRadar📅 Published: March 22, 2026🔗 Source
Flash-MoE: Running 397B Parameter Qwen Model on MacBook Pro with Pure C/Metal
Ad

Technical Implementation

Flash-MoE runs Qwen3.5-397B-A17B, a 397 billion parameter Mixture-of-Experts model with 60 transformer layers: 45 GatedDeltaNet (linear attention) + 15 standard full attention. Each layer has 512 experts, with K=4 activated per token plus one shared expert. Hidden dimension is 4096.

Performance Benchmarks

  • 4-bit experts, FMA kernel: 4.36 tokens/second, excellent quality, full tool calling, 209GB on disk (current best)
  • 4-bit experts, baseline: 3.90 tokens/second, excellent quality
  • 2-bit experts, trust OS: 5.74 tokens/second, good quality, 120GB on disk (breaks JSON/tool calling)
  • 2-bit peak single token: 7.05 tokens/second, good quality (not suitable for tool use)

Note: 2-bit quantization produces \name\ instead of "name" in JSON output, making tool calling unreliable. 4-bit is the production configuration.

Hardware Requirements

  • Machine: MacBook Pro with Apple M3 Max
  • Chip: 16-core CPU (12P + 4E), 40-core GPU, 16-core ANE
  • Memory: 48 GB unified (~400 GB/s bandwidth)
  • SSD: 1TB Apple Fabric, 17.5 GB/s sequential read (measured)
  • macOS: 26.2 (Darwin 25.2.0)
Ad

Key Techniques

SSD Expert Streaming

Expert weights (209GB at 4-bit) are read from NVMe SSD on demand via parallel pread() with GCD dispatch groups. Only the K=4 active experts per layer are loaded (~6.75MB each). The OS page cache manages caching with no custom cache needed ("Trust the OS" principle), achieving ~71% hit rate naturally.

FMA-Optimized Dequant Kernel

The inner loop of the 4-bit dequantized matrix-vector multiply rearranges the math from (nibble * scale + bias) * x to fma(nibble, scale*x, bias*x). Pre-computing scale*x and bias*x lets the GPU fused multiply-add unit do dequant+multiply in one instruction, resulting in 12% faster performance than the naive formulation.

Metal Compute Shaders

Hand-written Metal kernels include:

  • 4-bit and 2-bit dequantized matrix-vector multiply (tiled, SIMD-reduced, shared input cache, FMA-optimized)
  • Fused SwiGLU activation
  • RMS normalization (two-pass: sum-of-squares reduction + apply)
  • Batched GPU attention (Q@K^T, softmax, scores@V) for full attention layers
  • GPU RoPE (fused with Q deinterleave and K normalization)
  • MoE combine + residual + sigmoid gate (fused kernel)

Deferred GPU Expert Compute

CMD3 (expert forward pass) is submitted without waiting. The GPU executes it while the CPU prepares the next layer. The combine + residual + norm are also on GPU, feeding directly into the next layer's attention projections.

Accelerate BLAS for Linear Attention

The GatedDeltaNet recurrence uses cblas_sscal, cblas_sgemv, and cblas_sger for the 64-head × 128×128 state matrix update, achieving 64% faster performance than scalar code.

Pipeline Performance

Per layer average at 4-bit: 4.28ms

  • CMD3(prev) → CMD1: attention projections + delta-net [1.22ms GPU]
  • CPU: flush results [0.01ms CPU]
  • CMD2: o_proj + norm + routing + shared [0.55ms GPU]
  • CPU: softmax + topK routing [0.003ms]
  • I/O: parallel pread K=4 experts [2.41ms SSD]
  • CMD3: expert forward + combine + norm [0.04ms encode, DEFERRED]

Architecture Constraints

On Apple Silicon, SSD DMA and GPU compute share the same memory controller and cannot be profitably overlapped. The GPU's dequant kernels are bandwidth-saturated at ~418 GiB/s. Even small background SSD DMA causes disproportionate GPU latency spikes through memory controller arbitration, requiring a serial pipeline.

📖 Read the full source: HN AI Agents

Ad

👀 See Also