hipEngine: Fast Native Qwen 3.6 Inference for RDNA3 (Strix Halo, 7900 XTX)

✍️ OpenClawRadar📅 Published: May 25, 2026🔗 Source

A new ROCm-native inference engine for Qwen 3.6 MoE and dense models has appeared: hipEngine by the developer behind FastDMS and ParoQuant. It's Python-based with hot paths in HIP/C++, using AMD native libs like hipBLASLt, hipGraph, and AOTriton. No heavy PyTorch dependency.

Target Hardware

gfx1100 — Radeon RX 7900 XTX / Radeon Pro W7900 (RDNA3). Strix Halo also supported.

Benchmarks vs llama.cpp

On Qwen 3.6 35B MoE (using ParoQuant 4.68 bpw and GGUF Q4_K_S), hipEngine matches or beats llama.cpp HIP and Vulkan at all tested context lengths (512–128K). Key numbers (prefill tok/s, 512 prompt / 128 gen):

hipEngine PARO: 2718.497 tok/s
hipEngine GGUF Q4_K_S: 2258.847 tok/s
llama.cpp HIP: 2436.049 tok/s
llama.cpp Vulkan: 1816.927 tok/s

At 128K context, hipEngine PARO prefill reaches 1055 tok/s vs llama.cpp HIP 710 tok/s — a 48% improvement. Decode tok/s are comparable (60–127 tok/s range).

Memory Efficiency

hipEngine uses near-lossless INT8 KV cache with almost no speed penalty. This allows running the full Qwen 3.6 256K context window in under 24GB on a single 7900 XTX:

128K context, BF16 KV: sampled peak 21.04 GiB, prefill 1091.9 tok/s, decode 62.2 tok/s
128K context, INT8 KV: sampled peak 19.80 GiB, prefill 1076.5 tok/s, decode 60.0 tok/s
Peak memory at 128K (hipEngine PARO): 22.122 GiB vs llama.cpp HIP 23.605 GiB

Features

AGPLv3 open source
ROCm-native, no PyTorch dependency in hot path
Uses hipBLASLt, hipGraph, AOTriton
ParoQuant ported to ROCm
INT8 KV cache (near-lossless, minimal speed impact)
Supports Qwen 3.6 MoE and dense models

If you're running Qwen 3.6 on RDNA3 hardware, hipEngine is worth a look — especially for memory-constrained 256K context use cases.

📖 Read the full source: r/LocalLLaMA

👀 See Also

Tools

Routing Claude API traffic to control costs after Max subscription change

Anthropic's Max subscription no longer covers third-party tool usage, forcing OpenClaw users to API billing. A routing proxy directs simple tasks to Claude Sonnet ($3/M input, $15/M output) and complex ones to Opus ($5/M input, $25/M output), cutting costs without quality loss.

Apr 13, 2026, 08:45 PM UTC

OpenClawRadar

Tools

Marmy: A Self-Hosted Mobile App for Managing Multiple AI Coding Agent Sessions

Marmy is an open-source, self-hosted tool built with Claude Code that lets you manage multiple AI coding agent sessions from your phone. It features a Rust agent for your machines, an iOS app, file browsing with syntax highlighting, push notifications, and a manager-agent architecture.

Mar 26, 2026, 12:45 PM UTC

OpenClawRadar

Tools

OpenRoom: A Web-Based Desktop GUI for Visualizing AI Agent Skills

OpenRoom is a web-based desktop environment where AI agents operate, featuring real-time updates to system state like diaries and files during chat interactions, plus a livestream mode for multi-bot interaction.

Feb 24, 2026, 09:45 AM UTC

OpenClawRadar

Tools

Freestyle Launches Sandboxes for AI Coding Agents with Live Forking

Freestyle provides cloud sandboxes for AI coding agents that start in ~500ms and feature live forking with <400ms pause, allowing full VM clones including memory state. They run full Debian with hardware virtualization on bare metal infrastructure.

Apr 17, 2026, 08:17 AM UTC

OpenClawRadar