Qwen3.6 27B FP8 Runs 200k Tokens BF16 KV Cache at 80 TPS on RTX 5000 PRO 48GB

✍️ OpenClawRadar📅 Published: May 5, 2026🔗 Source
Qwen3.6 27B FP8 Runs 200k Tokens BF16 KV Cache at 80 TPS on RTX 5000 PRO 48GB
Ad

A Reddit user on r/LocalLLaMA reports running Qwen3.6-27B-FP8 with a BF16 KV cache of 200k tokens at 60–90 TPS on a single RTX 5000 PRO 48GB GPU. The setup uses vLLM 0.20.1, CUDA 12.9, and Qwen's official FP8 quant, preserving multi-modality and MTP speculative decoding.

Setup Details

The environment uses FlashInfer FP8 MoE, FP8 Marlin, and async scheduling. Key environment variables and launch command:

export VLLM_USE_FLASHINFER_MOE_FP8=1
export VLLM_TEST_FORCE_FP8_MARLIN=1
export VLLM_SLEEP_WHEN_IDLE=1
export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1
export VLLM_LOG_STATS_INTERVAL=2
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export SAFETENSORS_FAST_GPU=1
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export TORCH_FLOAT32_MATMUL_PRECISION=high
export PYTORCH_ALLOC_CONF=expandable_segments:True

vllm serve Qwen/Qwen3.6-27B-FP8
--host 0.0.0.0 --port 8080
--performance-mode interactivity
--trust-remote-code
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--reasoning-parser qwen3
--mm-encoder-tp-mode data
--mm-processor-cache-type shm
--gpu-memory-utilization 0.975
--speculative-config '{"method":"mtp","num_speculative_tokens":2}'
--compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE", "max_cudagraph_capture_size": 16, "mode": "VLLM_COMPILE"}'
--async-scheduling
--attention-backend flashinfer
--max-model-len 196608
--kv-cache-dtype bfloat16
--enable-prefix-caching

Ad

Performance Observations

With MTP=2 speculative decoding, the system produces 60–90 TPS during code generation. The BF16 KV cache avoids compaction issues seen in quantized KV, making long coding sessions more reliable. The user notes that the setup runs on a single RTX 5000 PRO 48GB with a 64GB system RAM and a decent CPU, calling it a strong candidate for a $10k workstation for local LLM development.

Who It's For

Developers needing a local, low-compression agentic coding setup with minimal quantization artifacts and long context windows.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also