FOMOE Enables 397B Qwen3.5 Model Inference on $2,100 Desktop Hardware

✍️ OpenClawRadar📅 Published: March 29, 2026🔗 Source

What FOMOE Solves

Large Mixture of Experts (MoE) models require hundreds of GBs of weight storage, typically in flash memory like NVMe. During inference, only a small fraction of weights are needed, but you can't predict which ones ahead of time. Random access patterns make flash latencies too high for practical inference on consumer hardware.

How FOMOE Works

The system makes most expert weight reads unnecessary through several techniques:

Stores the most common experts in GPU memory (VRAM) with an up-to-date rolling expert cache
Achieves 60% VRAM hit rate with warm start, reducing NVMe reads to 28% (12% served from DRAM)
Uses dual GPU ping-pong architecture to overlap weight loading and compute
Implements Cache-Aware Routing (CAR) - when two experts score similarly, the model picks the next-best scoring expert already in VRAM or DRAM cache within acceptable threshold

Performance Results

5-9 tokens/second inference speed for Qwen3.5's 397B parameter model
NVMe reads reduced to 7% with CAR enabled
Only 3.5% drop in perplexity measured on wikitext
Hardware requirements: two $500 GPUs, 32GB RAM, one NVMe drive
Uses Q4_K_M quantization

The implementation consists of approximately 15,000 lines of Claude-driven C/HIP code with heavy human guidance.

📖 Read the full source: r/LocalLLaMA

👀 See Also

Tools

SLayer: An Open-Source Semantic Layer for AI Agents That Learns from Queries

SLayer is a lightweight, embeddable semantic layer that lets AI agents query databases, manage models, and learn from interactions via MCP, REST, CLI, or Python.

May 11, 2026, 06:16 PM UTC

OpenClawRadar

Tools

Open Source Agent Skill for TypeScript, React, and Next.js Patterns

A developer has released a 4,000-line, 17-file structured markdown reference designed for AI agents like Claude Code to follow when generating or reviewing TypeScript, React, and Next.js code. It addresses common issues like improper API response validation and misuse of 'use client' directives.

Apr 16, 2026, 04:45 PM UTC

OpenClawRadar

Tools

ConnectSafely AI MCP Server Links LinkedIn to Claude for Direct Control

ConnectSafely AI provides an MCP server that connects LinkedIn directly to Claude, allowing users to send messages, search for people, check profile visitors, and track conversations through prompts without switching tabs.

Apr 16, 2026, 05:21 PM UTC

OpenClawRadar

Tools

ACO System: Multi-Agent AI Pipeline from GitHub Issue to Merged PR

ACO System is an open-source multi-agent framework where six specialized AI agents autonomously run the entire dev pipeline from GitHub Issue to merged PR, with a deterministic Architect gate that rejects bad stories before they reach developers.

Jun 5, 2026, 12:16 PM UTC

OpenClawRadar