hipEngine: Fast Native Qwen 3.6 Inference for RDNA3 (Strix Halo, 7900 XTX)

A new ROCm-native inference engine for Qwen 3.6 MoE and dense models has appeared: hipEngine by the developer behind FastDMS and ParoQuant. It's Python-based with hot paths in HIP/C++, using AMD native libs like hipBLASLt, hipGraph, and AOTriton. No heavy PyTorch dependency.
Target Hardware
gfx1100— Radeon RX 7900 XTX / Radeon Pro W7900 (RDNA3). Strix Halo also supported.
Benchmarks vs llama.cpp
On Qwen 3.6 35B MoE (using ParoQuant 4.68 bpw and GGUF Q4_K_S), hipEngine matches or beats llama.cpp HIP and Vulkan at all tested context lengths (512–128K). Key numbers (prefill tok/s, 512 prompt / 128 gen):
- hipEngine PARO: 2718.497 tok/s
- hipEngine GGUF Q4_K_S: 2258.847 tok/s
- llama.cpp HIP: 2436.049 tok/s
- llama.cpp Vulkan: 1816.927 tok/s
At 128K context, hipEngine PARO prefill reaches 1055 tok/s vs llama.cpp HIP 710 tok/s — a 48% improvement. Decode tok/s are comparable (60–127 tok/s range).
Memory Efficiency
hipEngine uses near-lossless INT8 KV cache with almost no speed penalty. This allows running the full Qwen 3.6 256K context window in under 24GB on a single 7900 XTX:
- 128K context, BF16 KV: sampled peak 21.04 GiB, prefill 1091.9 tok/s, decode 62.2 tok/s
- 128K context, INT8 KV: sampled peak 19.80 GiB, prefill 1076.5 tok/s, decode 60.0 tok/s
- Peak memory at 128K (hipEngine PARO): 22.122 GiB vs llama.cpp HIP 23.605 GiB
Features
- AGPLv3 open source
- ROCm-native, no PyTorch dependency in hot path
- Uses hipBLASLt, hipGraph, AOTriton
- ParoQuant ported to ROCm
- INT8 KV cache (near-lossless, minimal speed impact)
- Supports Qwen 3.6 MoE and dense models
If you're running Qwen 3.6 on RDNA3 hardware, hipEngine is worth a look — especially for memory-constrained 256K context use cases.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Routing Claude API traffic to control costs after Max subscription change
Anthropic's Max subscription no longer covers third-party tool usage, forcing OpenClaw users to API billing. A routing proxy directs simple tasks to Claude Sonnet ($3/M input, $15/M output) and complex ones to Opus ($5/M input, $25/M output), cutting costs without quality loss.

Marmy: A Self-Hosted Mobile App for Managing Multiple AI Coding Agent Sessions
Marmy is an open-source, self-hosted tool built with Claude Code that lets you manage multiple AI coding agent sessions from your phone. It features a Rust agent for your machines, an iOS app, file browsing with syntax highlighting, push notifications, and a manager-agent architecture.

OpenRoom: A Web-Based Desktop GUI for Visualizing AI Agent Skills
OpenRoom is a web-based desktop environment where AI agents operate, featuring real-time updates to system state like diaries and files during chat interactions, plus a livestream mode for multi-bot interaction.

Freestyle Launches Sandboxes for AI Coding Agents with Live Forking
Freestyle provides cloud sandboxes for AI coding agents that start in ~500ms and feature live forking with <400ms pause, allowing full VM clones including memory state. They run full Debian with hardware virtualization on bare metal infrastructure.