DeepSeek-V4-Flash W4A16+FP8: 85 tok/s on 2x RTX PRO 6000

DeepSeek-V4-Flash running at 85.52 tok/s @ 524k context and ~111 tok/s @ 128k single-stream on 2× RTX PRO 6000 Max-Q (96 GB each, no NVLink). The quant uses pasta-paul's W4A16-FP8 base but with a retrofitted MTP head (original quant silently strips MTP at load time). Key details below.

Benchmarks

pasta-paul base, no MTP, 524k: 52.85 tok/s, 91 ms TTFT (reference)
This model, 524k 2-stream: 85.52 tok/s, 155 ms TTFT (+62%)
This model, 128k single-stream: ~111 tok/s, ~310 ms TTFT (+110%)
Sanity benchmarks (small samples): GSM8K 93%, MMLU 53%, HumanEval (syntactic) 90%

Quantization Details

768 routed-expert tensors (256 experts × {w1, w2, w3}): W4A16 INT4 group=128 sym, GPTQ (Frantar with Cholesky H⁻¹). Calibrated with 256 ultrachat_200k prompts × 256 max_tokens – 17,701 MTP forward dumps, 473k tokens.
5 attention projections: FP8_BLOCK (upstream FP8 weights, renamed scale → weight_scale for compressed-tensors compat).
Shared experts, e_proj, h_proj, norms, gate, attn_sink: BF16 / FP32.

Max-Q Specific Fixes

Pass --disable-custom-all-reduce on Max-Q workstation cards (no NVLink). vLLM's CustomAllreduce uses CUDA P2P and deadlocks on PCIe-only topology. NCCL tuning for lower TTFT (~91 ms vs ~155 ms):

NCCL_PROTO=LL NCCL_ALGO=Ring NCCL_MIN_NCHANNELS=8 NCCL_NTHREADS=512

How to Run

Needs the patched vLLM fork from pasta-paul's workspace with MTP patches. Example command:

vllm serve LordNeel/DeepSeek-V4-Flash-Acti-MTP-W4A16-FP8 \
--tensor-parallel-size 2 --kv-cache-dtype fp8 --block-size 256 \
--max-model-len 524288 --max-num-seqs 2 \
--gpu-memory-utilization 0.93 \
--tokenizer-mode deepseek_v4 \
--tool-call-parser deepseek_v4 --enable-auto-tool-choice \
--reasoning-parser deepseek_v4 \
--trust-remote-code \
--disable-custom-all-reduce \
--speculative-config '{"method":"mtp","num_speculative_tokens":1}' \
--host 0.0.0.0 --port 8000

The model also includes an AGENTS.md runbook for setting up via AI coding agents (Claude/Codex/Cursor).

📖 Read the full source: r/LocalLLaMA

DeepSeek-V4-Flash W4A16+FP8 with MTP Self-Speculation: 85 tok/s on 2x RTX PRO 6000 Max-Q

Benchmarks

Quantization Details

Max-Q Specific Fixes

How to Run

👀 See Also

How 40 Prompt Revisions Turned Claude AI Summaries Into a Product: A Tutoring Platform Case Study ($19K MRR)

Cron Jobs vs Heartbeat: Optimizing OpenClaw Token Usage and Execution Consistency

Local Claude Code Setup with Qwen3.5 27B via llama.cpp

Understanding AI Agent Architecture: Deterministic vs Probabilistic Layers