V100 MoE Cluster: 50 tok/s on 122B Model with 4 GPUs

A lawyer running a 12x V100 32GB SXM2 cluster on a Threadripper Pro reports that on Volta GPUs (compute capability 7.0), only MoE models deliver usable decode speeds. Dense models are a trap — even a 27-32B dense model struggles at 20-28 tok/s, well below a 40 tok/s floor. In contrast, Qwen3.5-122B-A10B (122B total, 10B active) achieves ~50 tok/s on a single 4-GPU NVLink board, and Gemma-4-26B-A4B hits ~113 tok/s. All benchmarks use Q8 GGUF with Q4 KV cache and flash-attention enabled.

Hardware Configuration

The final build: twelve V100-SXM2 32GB on a Threadripper Pro. Two NVLink boards (4 GPUs each) plus two mixed pairs. Board A occupies GPUs {4,5,8,9}, Board B {6,7,10,11}. An NVLink pair sits on {0,1}, and a mixed pair on {2,3} where one card is 16GB. Cross-board hops go over PCIe/NUMA instead of NVLink, killing throughput. All models are kept inside a single board.

A second box was added: EPYC 7302P, 512GB RAM, 4x RTX 3090 + 2x V100-PCIe, running Ollama for smaller models.

Stack Switch: vLLM → llama.cpp

The operator abandoned vLLM because the models he actually wants are MoE GGUFs, and vLLM on Volta is a dead end for them — FP8/AWQ/Marlin kernels require SM75+, and GPTQ kernels are broken on compute 7.0. He moved to mainline llama.cpp, which recently fixed a Gemma chat-parser bug that was mangling long prompts.

Orchestration with Claude Code

The system is not a single model answering a chat — an orchestrator (driven by Claude Code) routes legal tasks across several local models, each pinned to its own board to avoid GPU contention. For the heaviest job (full affidavit or motion, intake-to-document), all 16 GPUs across both boxes are active:

Workhorse drafting: Qwen3.6-35B-A3B on Board A
Heavy reasoning + high-stakes drafting: Qwen3.5-122B-A10B on Board B
Gate model: small model on the {0,1} pair checks if there are grounds
Adversarial reviewer: attacks the draft on the {2,3} pair
Financial/extraction: Gemma-4-26B on the 3090s via Ollama

This is a sequential pipeline — models don't hammer all at once — but all 16 remain resident in GPU memory.

Practical Lessons

Hallucination: Local models confidently fabricate citations and dates. A verifier checks every cite, date, and Bates number against source material and blocks ungrounded content. An adversarial reviewer runs on top.
Pipeline poisoning: The evidence bundle builder was scooping up its own prior outputs as client evidence, causing the models to "ground" on slop they'd written earlier — one draft cited an RTX 3060 as a Bates number. Fixed by scrubbing the builder's input history.

Lighter tasks use far less — combining and Bates-stamping exhibits is pure CPU (PyMuPDF + Tesseract), and plain summaries hit only Gemma and the router.

📖 Read the full source: r/LocalLLaMA