llama.cpp Q8_0 quantization gets 3.1x speedup on Intel Arc GPUs with SYCL reorder fix

A performance optimization fix for llama.cpp's SYCL backend delivers significant speed improvements for Q8_0 quantized models running on Intel Arc GPUs. The fix addresses a memory access pattern issue that was limiting Q8_0 performance to only 21% of theoretical bandwidth.
Performance Problem and Root Cause
On an Intel Arc Pro B70 GPU with 32GB GDDR6 and 608 GB/s bandwidth, Q8_0 models were running at only 4.88 tokens/second while Q4_K_M achieved 20.56 tokens/second. This 4x performance gap was unexpected since Q8_0 only has 1.7x more data than Q4_K_M.
After ruling out VRAM pressure, driver issues, and backend problems, the investigation traced the bottleneck to llama.cpp's SYCL kernel dispatch path. The SYCL backend includes a "reorder" optimization that separates quantization scale factors from weight data for coalesced GPU memory access. This optimization was implemented for Q4_0, Q4_K, and Q6_K quantizations, but Q8_0 was never added to the reorder framework.
Q8_0's 34-byte blocks (which are not power-of-2) made the non-reordered layout particularly inefficient for GPU cache performance.
The Fix and Results
The solution involved approximately 200 lines of code extending the existing reorder framework to support Q8_0. The most critical bug was a single line issue: Q8_0 tensors weren't getting the "extra" struct allocated during buffer initialization, causing the reorder flag to never be set.
Results on Qwen3.5-27B (Intel Arc Pro B70):
- Q8_0 before: 4.88 t/s (21% bandwidth)
- Q8_0 after: 15.24 t/s (66% bandwidth) - 3.1x faster
- Q4_K_M: 20.12 t/s (unchanged)
- Q6_K: 13.83 t/s (no reorder)
With this fix, Q8_0 now outperforms Q6_K (15.24 vs 13.83 tokens/second) while providing higher quality than lower-bit quantizations.
Validation and Implementation
Before implementing the fix, the team binary-patched Intel's closed-source IPEX-LLM to run on the B70 GPU (which isn't officially supported by its PCI device ID). Their optimized Q8_0 kernels achieved 61% bandwidth, confirming the problem was solvable. The open-source implementation in llama.cpp achieves 66% bandwidth.
The fix has been submitted as a pull request to the llama.cpp repository.
📖 Read the full source: r/LocalLLaMA
👀 See Also
Claude Plan Users Now Get Monthly Agent SDK Credits Starting June 15, 2026
Claude Pro, Max, Team, and Enterprise plan subscribers can claim a monthly credit for Agent SDK usage, covering claude -p, GitHub Actions integration, and third-party apps. Credits refresh monthly, are per-user, and cannot be pooled.

Effortless Deployment: New One-Click AWS Setup for Open Claw Released
Open Claw enthusiasts now have a reason to celebrate. A new one-click AWS deployment tool simplifies the setup process for Open Claw, making it more accessible to developers and hobbyists alike.

Rust Will Save Linux from AI: Greg Kroah-Hartman on C Bugs and Rust's Safety Guarantees
Linux stable kernel maintainer Greg Kroah-Hartman says Rust eliminates 60% of kernel bugs at compile time, addressing the flood of AI-discovered CVEs like Dirty Frag and Fragnesia.

Claude Code v2.1.150 Adds Remote System Prompt Injection via Network
Claude Code v2.1.150 fetches system prompts from Anthropic servers at startup and every 60 seconds via a GrowthBook feature flag, allowing remote injection—bypassed with CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1.