DeepSeek-V4-Flash W4A16+FP8 with MTP Self-Speculation: 85 tok/s on 2x RTX PRO 6000 Max-Q

DeepSeek-V4-Flash running at 85.52 tok/s @ 524k context and ~111 tok/s @ 128k single-stream on 2× RTX PRO 6000 Max-Q (96 GB each, no NVLink). The quant uses pasta-paul's W4A16-FP8 base but with a retrofitted MTP head (original quant silently strips MTP at load time). Key details below.
Benchmarks
- pasta-paul base, no MTP, 524k: 52.85 tok/s, 91 ms TTFT (reference)
- This model, 524k 2-stream: 85.52 tok/s, 155 ms TTFT (+62%)
- This model, 128k single-stream: ~111 tok/s, ~310 ms TTFT (+110%)
- Sanity benchmarks (small samples): GSM8K 93%, MMLU 53%, HumanEval (syntactic) 90%
Quantization Details
- 768 routed-expert tensors (256 experts × {w1, w2, w3}): W4A16 INT4 group=128 sym, GPTQ (Frantar with Cholesky H⁻¹). Calibrated with 256 ultrachat_200k prompts × 256 max_tokens – 17,701 MTP forward dumps, 473k tokens.
- 5 attention projections: FP8_BLOCK (upstream FP8 weights, renamed scale → weight_scale for compressed-tensors compat).
- Shared experts, e_proj, h_proj, norms, gate, attn_sink: BF16 / FP32.
Max-Q Specific Fixes
Pass --disable-custom-all-reduce on Max-Q workstation cards (no NVLink). vLLM's CustomAllreduce uses CUDA P2P and deadlocks on PCIe-only topology. NCCL tuning for lower TTFT (~91 ms vs ~155 ms):
NCCL_PROTO=LL NCCL_ALGO=Ring NCCL_MIN_NCHANNELS=8 NCCL_NTHREADS=512How to Run
Needs the patched vLLM fork from pasta-paul's workspace with MTP patches. Example command:
vllm serve LordNeel/DeepSeek-V4-Flash-Acti-MTP-W4A16-FP8 \
--tensor-parallel-size 2 --kv-cache-dtype fp8 --block-size 256 \
--max-model-len 524288 --max-num-seqs 2 \
--gpu-memory-utilization 0.93 \
--tokenizer-mode deepseek_v4 \
--tool-call-parser deepseek_v4 --enable-auto-tool-choice \
--reasoning-parser deepseek_v4 \
--trust-remote-code \
--disable-custom-all-reduce \
--speculative-config '{"method":"mtp","num_speculative_tokens":1}' \
--host 0.0.0.0 --port 8000The model also includes an AGENTS.md runbook for setting up via AI coding agents (Claude/Codex/Cursor).
📖 Read the full source: r/LocalLLaMA
👀 See Also

How 40 Prompt Revisions Turned Claude AI Summaries Into a Product: A Tutoring Platform Case Study ($19K MRR)
A tutoring platform with $19K MRR iterated their Claude-generated session summary prompt 40+ times over 12 months. The journey from vague v1 to personalized v40 shows how prompt engineering transforms a feature into a product.

Cron Jobs vs Heartbeat: Optimizing OpenClaw Token Usage and Execution Consistency
A senior developer shares practical tips on using Cron jobs instead of Heartbeat to reduce token usage and improve execution consistency in OpenClaw, with concrete examples and a shell script method.

Local Claude Code Setup with Qwen3.5 27B via llama.cpp
A developer shares their configuration for running Claude Code locally using Qwen3.5 27B with llama.cpp, including environment variables, server parameters, and performance benchmarks across seven coding tasks.

Understanding AI Agent Architecture: Deterministic vs Probabilistic Layers
A Reddit user shares a mental model for AI agent systems that separates deterministic layers (scripts, commands, APIs) from probabilistic layers (LLM reasoning and decisions). The key insight: push as much work as possible to the deterministic side.