Qwen 3.6 27B at 52.8 tps TG on AMD MI50s: Full Precision, No MTP, No Quant

A Reddit user has published benchmark results for running Qwen3.6-27B (full precision, no quantization) on eight AMD MI50s (2018 GPUs) using a custom vllm fork. The system achieves 52.8 tokens per second (tps) for text generation and 1569 tps for prompt processing with TP8, no MTP, and no flash attention optimizations that might slow down large prompts.
Key Details
- Hardware: 8x AMD MI50s, PCIe (no PCIe switch used yet)
- Engine: vllm fork v0.20.1 with ROCm 7.2.1 – github.com/ai-infos/vllm-gfx906-mobydick
- Model:
Qwen/Qwen3.6-27B(HuggingFace full precision FP16) - Quantization: None – full FP16 precision
- MTP: Disabled (slower for large prompts)
- Flash attention: Not used (triton-based AMD flash attention also slower for big prompts)
- Prompt: Single inference with 1K and 15K token prompts (bench used 10K input, 1K output)
Benchmark Results
Successful requests: 4 Total input tokens: 40000 Total generated tokens: 4000 Output token throughput (tok/s): 32.91 Peak output token throughput (tok/s): 56.00 Total token throughput (tok/s): 362.03 Mean TTFT (ms): 32874.56 Mean TPOT (ms): 88.66 Mean ITL (ms): 88.66
Note: The user reports 52.8 tps TG for single inference with 15K prompt; the benchmark shows aggregate results over 4 requests at 10K input each. With TP2, the model also fits and runs at ~34 tps TG.
Setup Commands (Docker + vllm serve)
docker run -it --name vllm-gfx906-mobydick \
-v /llm:/llm --network host \
--device=/dev/kfd --device=/dev/dri \
--group-add video --group-add $(getent group render | cut -d: -f3) \
--ipc=host \
aiinfos/vllm-gfx906-mobydick:v0.20.1rc0.x-rocm7.2.1-pytorch2.11.0 \
FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" VLLM_LOGGING_LEVEL=DEBUG vllm serve \
/llm/models/Qwen3.6-27B \
--served-model-name Qwen3.6-27B \
--dtype float16 \
--max-model-len auto \
--max-num-batched-tokens 8192 \
--block-size 64 \
--gpu-memory-utilization 0.98 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--mm-processor-cache-gb 1 \
--limit-mm-per-prompt.image 1 --limit-mm-per-prompt.video 1 \
--skip-mm-profiling \
--default-chat-template-kwargs '{"min_p": 0.0, "presence_penalty": 0.0, "repetition_penalty": 1.0}' \
--tensor-parallel-size 8 \
--host 0.0.0.0 --port 8000 2>&1 | tee log.txt
Who It's For
Developers running agentic coding tools (e.g., Claude Code, Hermes) on AMD hardware, especially with large prompts and full-precision requirements.
The user notes that further improvements are possible with PCIe switches (lower latency), more optimized flash attention/MTP for ROCm/gfx906, and updated software stacks.
📖 Read the full source: r/LocalLLaMA
👀 See Also

OpenClaw 2026.4.2 and 2026.3.31 break local LLM connections
OpenClaw versions 2026.4.2 and 2026.3.31 are causing connection timeouts to locally hosted Ollama instances. The issue appears when connecting to Ubuntu boxes running locally, with error logs showing LLM request timeouts and failover decisions.

Reddit discussion highlights 68% token reduction for AI agents through infrastructure changes
A Reddit user reports cutting AI agent token usage by 68.5% by switching from standard infrastructure to an agent-native OS with JSON-native state access, reducing state checks from ~9 shell commands to 1 structured call.

OpenClaw auto-update bug leaves orphaned preflight directories filling /tmp
OpenClaw's auto-update mechanism creates preflight copies in /tmp that persist when updates fail, potentially filling disk space and blocking further updates. A user found 9 orphaned directories totaling 6.5GB on a 38GB VPS.

AI Agents Hiring Other AI Agents: From Solo Workers to Networked Economies
A Reddit post argues that AI agents will evolve from isolated tools into networked workers that delegate tasks, specialize, build reputation, and exchange value — shifting the hard problem from intelligence to coordination.