V100 Cluster vs. MoE: 12x SXM2 32GB Build with Claude Code Orchestration

A lawyer running a 12x V100 32GB SXM2 cluster on a Threadripper Pro reports that on Volta GPUs (compute capability 7.0), only MoE models deliver usable decode speeds. Dense models are a trap — even a 27-32B dense model struggles at 20-28 tok/s, well below a 40 tok/s floor. In contrast, Qwen3.5-122B-A10B (122B total, 10B active) achieves ~50 tok/s on a single 4-GPU NVLink board, and Gemma-4-26B-A4B hits ~113 tok/s. All benchmarks use Q8 GGUF with Q4 KV cache and flash-attention enabled.
Hardware Configuration
The final build: twelve V100-SXM2 32GB on a Threadripper Pro. Two NVLink boards (4 GPUs each) plus two mixed pairs. Board A occupies GPUs {4,5,8,9}, Board B {6,7,10,11}. An NVLink pair sits on {0,1}, and a mixed pair on {2,3} where one card is 16GB. Cross-board hops go over PCIe/NUMA instead of NVLink, killing throughput. All models are kept inside a single board.
A second box was added: EPYC 7302P, 512GB RAM, 4x RTX 3090 + 2x V100-PCIe, running Ollama for smaller models.
Stack Switch: vLLM → llama.cpp
The operator abandoned vLLM because the models he actually wants are MoE GGUFs, and vLLM on Volta is a dead end for them — FP8/AWQ/Marlin kernels require SM75+, and GPTQ kernels are broken on compute 7.0. He moved to mainline llama.cpp, which recently fixed a Gemma chat-parser bug that was mangling long prompts.
Orchestration with Claude Code
The system is not a single model answering a chat — an orchestrator (driven by Claude Code) routes legal tasks across several local models, each pinned to its own board to avoid GPU contention. For the heaviest job (full affidavit or motion, intake-to-document), all 16 GPUs across both boxes are active:
- Workhorse drafting: Qwen3.6-35B-A3B on Board A
- Heavy reasoning + high-stakes drafting: Qwen3.5-122B-A10B on Board B
- Gate model: small model on the {0,1} pair checks if there are grounds
- Adversarial reviewer: attacks the draft on the {2,3} pair
- Financial/extraction: Gemma-4-26B on the 3090s via Ollama
This is a sequential pipeline — models don't hammer all at once — but all 16 remain resident in GPU memory.
Practical Lessons
- Hallucination: Local models confidently fabricate citations and dates. A verifier checks every cite, date, and Bates number against source material and blocks ungrounded content. An adversarial reviewer runs on top.
- Pipeline poisoning: The evidence bundle builder was scooping up its own prior outputs as client evidence, causing the models to "ground" on slop they'd written earlier — one draft cited an RTX 3060 as a Bates number. Fixed by scrubbing the builder's input history.
Lighter tasks use far less — combining and Bates-stamping exhibits is pure CPU (PyMuPDF + Tesseract), and plain summaries hit only Gemma and the router.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Claude debugging case: Agent failed silently due to missing parameter, framing mattered more than model
A developer used Claude to build a calendar agent, then spent 40 minutes having Claude debug it before realizing the write_calendar tool lacked an attendees parameter. When given full context, Claude identified the issue in 10 seconds.

Financial Modeler Builds Local Speech-to-Tool Desktop App with Claude Code
A developer with a financial modeling background used Claude Code to create Sotto, a local Windows speech-to-text application that runs Whisper on GPU. The app features system-wide hotkeys, automatic stop detection, and a Qt UI, with about 2,200 lines of Python across 17 files.

OpenClaw Execution Visibility Issues on Mini PC Hardware
A developer testing OpenClaw on a GEEKOM A5 Pro mini PC found that while outputs appear normal, actual execution reveals hidden issues like silent failures, retries, and performance drift under load.

AI Agent Running Full E-commerce Operation: Inside Report
An AI agent has been operating a complete e-commerce business, handling design, coding, marketing, and operations. The source provides an honest assessment including what doesn't work.