Running MiniMax M2.7 Q8_0 128K on 2x3090 with CPU Offloading – Real-World Benchmarks and Config

In a recent r/LocalLLaMA post, a user shares their experience pushing the MiniMax M2.7 model (at Q8_0 quantization) to 128K context on a 2x3090 setup with 256GB DDR4 and a secondhand 10900X CPU. The key challenge: running a large MoE model with unquantized KV cache on relatively low-end hardware for its class.
Performance Numbers
The user reports:
- Prompt processing: ~50 tokens per second
- Token generation: ~10 tokens per second
- Described as “very slow but usable for coding agent workflows”
Configuration
They use ik-llama-cuda (a llama.cpp fork) with the following flags (from their NixOS config):
${ik-llama-cuda}/bin/llama-server \
-m ${modelPath} \
--host 0.0.0.0 \
--port ${toString cfg.port} \
-c ${toString cfg.contextLength} \
-ngl 999 \
--cpu-moe \
-sm graph \
-fa on \
-t 16 \
-tb 16 \
-b 4096 \
-ub 4096 \
-np 1 \
-muge \
-ger \
--jinja \
--metrics \
--temp 1.0 \
--top-p 0.95 \
--top-k 40 \
--min-p 0.01Notable flags:
--cpu-moe– offloads MoE expert computation to CPU-sm graph– enables graph-based scheduling-fa on– flash attention-t 16/-tb 16– 16 threads for compute and batch respectively-b 4096/-ub 4096– batch and ubatch size-muge– memory-usage-guided expert loading (probably)-ger– GPU expert routing
Context & Motivation
The user reports Q8_0 was chosen to mitigate “weird behavior” seen at lower quants. They note that the model’s draft model for speculative decoding was not released for M2.7, which could have improved speed. They are primarily interested in accuracy over speed, as long as generation doesn’t take “literally all day.”
Takeaway for Developers
This is a practical datapoint for anyone running large MoE models on multi-GPU setups with system RAM. The --cpu-moe approach allows scaling context far beyond VRAM limits, albeit at reduced speed. For coding agent workflows where latency is less critical, this tradeoff may be acceptable.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Reddit User Warns: When Using Claude for Complex Projects, Tackle the Hardest Part First
A developer on r/ClaudeAI reports that letting the AI plan incrementally for a complex document editor led to 'complexity soup' and failures. The user advises forcing the model to solve the most complicated use case first, as its performance degrades with more context.

Using project narratives to manage memory in large OpenClaw projects
A developer shares a process where after each major milestone, they spawn a separate OpenClaw worker to analyze the codebase and write a 'project narrative' document, which helps identify broken pipelines, redundancies, and missing pieces that the main worker might overlook.

Claude Code Self-Audit Finds 3GB of Cruft in ~/.claude — Here's How to Clean It
A user prompted Claude Code to audit its own ~/.claude directory and found 2.6GB of stale session transcripts, 170MB of failed telemetry retry logs, and 153MB of undo buffers — dropping from 3GB to under 200MB after cleanup.

Short system prompts improve Claude's adherence and reduce token waste
A developer discovered that replacing a 3,847-word system prompt with several tiny focused prompts (total ~200 words) eliminated Claude's drift and forgotten instructions.