Qwen 3.6 27B Benchmarks: 52.8 tps on AMD MI50s

A Reddit user has published benchmark results for running Qwen3.6-27B (full precision, no quantization) on eight AMD MI50s (2018 GPUs) using a custom vllm fork. The system achieves 52.8 tokens per second (tps) for text generation and 1569 tps for prompt processing with TP8, no MTP, and no flash attention optimizations that might slow down large prompts.

Key Details

Hardware: 8x AMD MI50s, PCIe (no PCIe switch used yet)
Engine: vllm fork v0.20.1 with ROCm 7.2.1 – github.com/ai-infos/vllm-gfx906-mobydick
Model: Qwen/Qwen3.6-27B (HuggingFace full precision FP16)
Quantization: None – full FP16 precision
MTP: Disabled (slower for large prompts)
Flash attention: Not used (triton-based AMD flash attention also slower for big prompts)
Prompt: Single inference with 1K and 15K token prompts (bench used 10K input, 1K output)

Benchmark Results

Successful requests: 4
Total input tokens: 40000
Total generated tokens: 4000
Output token throughput (tok/s): 32.91
Peak output token throughput (tok/s): 56.00
Total token throughput (tok/s): 362.03
Mean TTFT (ms): 32874.56
Mean TPOT (ms): 88.66
Mean ITL (ms): 88.66

Note: The user reports 52.8 tps TG for single inference with 15K prompt; the benchmark shows aggregate results over 4 requests at 10K input each. With TP2, the model also fits and runs at ~34 tps TG.

Setup Commands (Docker + vllm serve)

docker run -it --name vllm-gfx906-mobydick \
  -v /llm:/llm --network host \
  --device=/dev/kfd --device=/dev/dri \
  --group-add video --group-add $(getent group render | cut -d: -f3) \
  --ipc=host \
  aiinfos/vllm-gfx906-mobydick:v0.20.1rc0.x-rocm7.2.1-pytorch2.11.0 \
  FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" VLLM_LOGGING_LEVEL=DEBUG vllm serve \
  /llm/models/Qwen3.6-27B \
  --served-model-name Qwen3.6-27B \
  --dtype float16 \
  --max-model-len auto \
  --max-num-batched-tokens 8192 \
  --block-size 64 \
  --gpu-memory-utilization 0.98 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --mm-processor-cache-gb 1 \
  --limit-mm-per-prompt.image 1 --limit-mm-per-prompt.video 1 \
  --skip-mm-profiling \
  --default-chat-template-kwargs '{"min_p": 0.0, "presence_penalty": 0.0, "repetition_penalty": 1.0}' \
  --tensor-parallel-size 8 \
  --host 0.0.0.0 --port 8000 2>&1 | tee log.txt

Who It's For

Developers running agentic coding tools (e.g., Claude Code, Hermes) on AMD hardware, especially with large prompts and full-precision requirements.

The user notes that further improvements are possible with PCIe switches (lower latency), more optimized flash attention/MTP for ROCm/gfx906, and updated software stacks.

📖 Read the full source: r/LocalLLaMA