Custom llama.cpp Backend Offloads LLM Matrix Multiplication to AMD XDNA2 NPU on Ryzen AI MAX 385

Custom Backend for AMD XDNA2 NPU Offload
A developer has created a custom llama.cpp backend that dispatches GEMM operations directly to the AMD XDNA2 NPU on Ryzen AI MAX 385 (Strix Halo). This approach avoids iGPU usage and shared memory contention.
Hardware and Software Configuration
Model: Meta-Llama-3.1-8B-Instruct Q4_K_M
Hardware: Ryzen AI MAX 385, CachyOS 6.19, amdxdna driver, XRT 2.21.75
Performance Results
- Vulkan prefill + NPU decode: 930 t/s prefill (pp512), 43.7 t/s decode (tg64), 41.5W avg power, 0.947 J/tok
- Vulkan only: 833 t/s prefill, 41.6 t/s decode, 52.2W avg power, 1.3 J/tok
- CPU only: 4.6 t/s prefill, 3.76 t/s decode
The NPU decode path saves approximately 10W versus Vulkan-only while matching (and slightly beating) decode throughput, as the iGPU remains free for other work.
Technical Stack
- Kernels: mlir-aie xclbins (Xilinx/mlir-aie, Apache 2.0)
- Runtime dispatch: XRT 2.21.75
- Base: Fork of ggml-org/llama.cpp (MIT)
- Kernel routing: 4 xclbin slots covering different K-dimension tiles with MIN_N/MAX_N routing to select the appropriate kernel at runtime
Performance Ceiling Investigation
The developer attempted to push beyond 43.7 t/s decode with several approaches:
- Batch sweep N=1..64: No improvement (flat performance)
- Int4 double-quant: Killed SNR (44.8 → 19.7 dB) - dead end
- Cascade offload: Ruled out by AMD documentation
- Speculative decoding with Llama-3.2-1B draft: 44% accept rate, 212 t/s draft, but zero effective gain
The lack of improvement from speculative decoding (which normally provides gains with a 44% accept rate) indicates the bottleneck is LPDDR5 bandwidth, not compute. The NPU is already hitting the memory wall, making 43.7 t/s the ceiling for this model on this hardware.
Project Links
- GitHub: https://github.com/BrandedTamarasu-glitch/OllamaAMDNPU
- Changelog: https://brandedtamarasu-glitch.github.io/OllamaAMDNPU/xdna-npu/
The project was built with Claude Sonnet 4.6 / Claude Code, disclosed for reproducibility purposes. The developer is seeking feedback from others running Strix Halo or Phoenix with the amdxdna driver to compare decode throughput on comparable quants and determine if other XDNA2 configurations encounter the same performance ceiling.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Smriti: A Git-like system for managing LLM reasoning state to prevent conversation drift
Smriti is an open-source tool that lets developers save, restore, branch, and compare reasoning states in LLM conversations to prevent drift. It treats interactions as state rather than chat history, allowing clean rollbacks and alternative exploration without contamination.

Claude Code v2.1.59 adds auto-memory, copy command, and shell improvements
Claude Code v2.1.59 introduces automatic context saving to auto-memory with /memory management, adds a /copy command for interactive code block selection, and improves prefix suggestions for compound bash commands.

Developer Builds Tool for Realistic Relational Database Generation
A developer built a tool that generates fully loaded relational databases with realistic data, solving the problem of creating test databases with intact foreign key relationships and cross-table consistency.

ClawControl v1.3.1 adds media support, voice dictation, and Linux packaging
ClawControl v1.3.1 is a cross-platform OpenClaw client that now supports image sharing, wake-word voice dictation, usage charts, and Linux AppImage/.deb packages. The release includes security updates requiring OpenClaw 2.19+ users to update Control UI Allowed Origins.