FOMOE Enables 397B Qwen3.5 Model Inference on $2,100 Desktop Hardware

What FOMOE Solves
Large Mixture of Experts (MoE) models require hundreds of GBs of weight storage, typically in flash memory like NVMe. During inference, only a small fraction of weights are needed, but you can't predict which ones ahead of time. Random access patterns make flash latencies too high for practical inference on consumer hardware.
How FOMOE Works
The system makes most expert weight reads unnecessary through several techniques:
- Stores the most common experts in GPU memory (VRAM) with an up-to-date rolling expert cache
- Achieves 60% VRAM hit rate with warm start, reducing NVMe reads to 28% (12% served from DRAM)
- Uses dual GPU ping-pong architecture to overlap weight loading and compute
- Implements Cache-Aware Routing (CAR) - when two experts score similarly, the model picks the next-best scoring expert already in VRAM or DRAM cache within acceptable threshold
Performance Results
- 5-9 tokens/second inference speed for Qwen3.5's 397B parameter model
- NVMe reads reduced to 7% with CAR enabled
- Only 3.5% drop in perplexity measured on wikitext
- Hardware requirements: two $500 GPUs, 32GB RAM, one NVMe drive
- Uses Q4_K_M quantization
The implementation consists of approximately 15,000 lines of Claude-driven C/HIP code with heavy human guidance.
📖 Read the full source: r/LocalLLaMA
👀 See Also

SLayer: An Open-Source Semantic Layer for AI Agents That Learns from Queries
SLayer is a lightweight, embeddable semantic layer that lets AI agents query databases, manage models, and learn from interactions via MCP, REST, CLI, or Python.

Open Source Agent Skill for TypeScript, React, and Next.js Patterns
A developer has released a 4,000-line, 17-file structured markdown reference designed for AI agents like Claude Code to follow when generating or reviewing TypeScript, React, and Next.js code. It addresses common issues like improper API response validation and misuse of 'use client' directives.

ConnectSafely AI MCP Server Links LinkedIn to Claude for Direct Control
ConnectSafely AI provides an MCP server that connects LinkedIn directly to Claude, allowing users to send messages, search for people, check profile visitors, and track conversations through prompts without switching tabs.

ACO System: Multi-Agent AI Pipeline from GitHub Issue to Merged PR
ACO System is an open-source multi-agent framework where six specialized AI agents autonomously run the entire dev pipeline from GitHub Issue to merged PR, with a deterministic Architect gate that rejects bad stories before they reach developers.