Qwen3.6 27B FP8 Runs 200k Tokens BF16 KV Cache at 80 TPS on RTX 5000 PRO 48GB

A Reddit user on r/LocalLLaMA reports running Qwen3.6-27B-FP8 with a BF16 KV cache of 200k tokens at 60–90 TPS on a single RTX 5000 PRO 48GB GPU. The setup uses vLLM 0.20.1, CUDA 12.9, and Qwen's official FP8 quant, preserving multi-modality and MTP speculative decoding.
Setup Details
The environment uses FlashInfer FP8 MoE, FP8 Marlin, and async scheduling. Key environment variables and launch command:
export VLLM_USE_FLASHINFER_MOE_FP8=1
export VLLM_TEST_FORCE_FP8_MARLIN=1
export VLLM_SLEEP_WHEN_IDLE=1
export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1
export VLLM_LOG_STATS_INTERVAL=2
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export SAFETENSORS_FAST_GPU=1
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export TORCH_FLOAT32_MATMUL_PRECISION=high
export PYTORCH_ALLOC_CONF=expandable_segments:True
vllm serve Qwen/Qwen3.6-27B-FP8
--host 0.0.0.0 --port 8080
--performance-mode interactivity
--trust-remote-code
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--reasoning-parser qwen3
--mm-encoder-tp-mode data
--mm-processor-cache-type shm
--gpu-memory-utilization 0.975
--speculative-config '{"method":"mtp","num_speculative_tokens":2}'
--compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE", "max_cudagraph_capture_size": 16, "mode": "VLLM_COMPILE"}'
--async-scheduling
--attention-backend flashinfer
--max-model-len 196608
--kv-cache-dtype bfloat16
--enable-prefix-caching
Performance Observations
With MTP=2 speculative decoding, the system produces 60–90 TPS during code generation. The BF16 KV cache avoids compaction issues seen in quantized KV, making long coding sessions more reliable. The user notes that the setup runs on a single RTX 5000 PRO 48GB with a 64GB system RAM and a decent CPU, calling it a strong candidate for a $10k workstation for local LLM development.
Who It's For
Developers needing a local, low-compression agentic coding setup with minimal quantization artifacts and long context windows.
📖 Read the full source: r/LocalLLaMA
👀 See Also

OpenClaw Discussion on AI Agent-to-Agent Messaging and Context Sharing
A Reddit discussion explores the implications of AI agents using personal context to communicate with other agents on a user's behalf, examining what information users might be comfortable sharing.

User reports switching from Gemini Pro to Claude Max for academic project assistance
A user switched from Gemini Pro to Claude Max after experiencing frustration with Gemini's performance on practical tasks. They report Claude successfully reviewed their academic project, asked clarifying questions, and suggested logging learned information to a memory.md file.

Anthropic Drops Key Safety Pledge from Responsible Scaling Policy
Anthropic has removed the central commitment from its Responsible Scaling Policy that required guaranteeing adequate safety measures before training AI systems, citing competitive pressure and the need to continue development.

An Open Standard for Agent Run Records: The Case for a Shared Log Schema
Every agent runtime has its own log format, causing fragmentation in debugging, auditing, and tool portability. The fields already converge on a core schema — it's time to standardize.