Benchmark Results: Qwen3.5 Models on Apple Silicon vs AMD GPUs with ROCm vs Vulkan

Hardware and Software Setup
The benchmark compared three systems: a MacBook Pro with Apple M5 Max (48GB unified memory), a Mac Studio with Apple M1 Max (64GB unified memory), and a Fedora 43 GPU server with Intel Core Ultra 7 265K processor and three AMD GPUs: Radeon Pro W7900 (48GB, RDNA 3), Radeon AI PRO R9700 (32GB, RDNA 4), and Radeon Pro W6800 (32GB, RDNA 2). The motherboard provided x8/x8/x4 electrical connections, with the W6800 on a chipset-connected x4 slot bottlenecked by the DMI link.
Inference Engines and Models
Apple systems used mlx-lm (versions 0.31.1 and 0.31.0). The Fedora server ran llama.cpp with both HIP/ROCm build (b5065) and AMDVLK Vulkan build (b5065). ROCm version was 7.2, AMDVLK version was 2025.Q2.1. All Fedora runs used a single GPU except the 122B model which used W7900 + R9700 with --split-mode layer.
Models tested were Qwen3.5-35B-A3B MoE (3B active params, mlx-community 4-bit or unsloth Q4_K_M), Qwen3.5-27B dense (27B params, mlx-community 4-bit or unsloth Q4_K_M), and Qwen3.5-122B-A10B MoE (10B active params, unsloth Q3_K_XL).
Benchmark Methodology
The benchmark reflected pharmacovigilance data analysis use cases: writing extraction scripts, reasoning about clinical data, generating regulatory narratives, and structured data extraction from clinical text. Prompts were domain-specific, not general-purpose LLM benchmarks.
Standard benchmark used 8K context with 7 prompts: 2 prompt-processing tests (short ~27 token and long ~2.9K token input with minimal output to isolate prefill speed) and 5 generation tasks (short coding, medium coding, math reasoning, regulatory safety narrative writing, structured AE extraction). Single-user, single-request, temperature 0.3, /no_think to disable thinking mode, no prompt caching between requests.
Context-scaling benchmark used the same model and GPU with progressively larger prompts (512 to 16K+ tokens) consisting of synthetic adverse event listings, with only 64 max output tokens to isolate how prompt processing and generation scale with input size.
Key Findings
The benchmark revealed interesting ROCm vs AMDVLK Vulkan findings, including context-scaling tests showing when each backend performs best. The source notes that most available comparisons don't help decide between configurations like an M5 Max laptop and a W7900 workstation, or whether ROCm is worth the setup hassle over Vulkan.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Supreme Court Declines Review, AI-Generated Art Remains Uncopyrightable
The US Supreme Court declined to hear a case on copyrighting AI-generated art, letting stand lower court rulings that require 'human authorship' for copyright protection. This follows the Copyright Office's 2022 rejection of Stephen Thaler's request to copyright an image created by his algorithm.

Claude Code on the Web Partial Outage Reported
An automatic status update from r/ClaudeAI reports a partial outage for Claude Code on the web starting 2026-05-09T23:33:21.000Z. Check the official status page and community megathread for updates.

OpenClaw Users Report Model Replacements After Anthropic Ban
A community survey of Reddit, X, YouTube, and GitHub reveals GPT-5.x as the most-adopted replacement for Claude in OpenClaw workflows, with Kimi K2.5 leading community votes and hybrid setups gaining popularity.

OpenClaw: Disappointing Experience or Setup Error?
Users report issues with OpenClaw failing to perform beyond simple chatbot interactions despite correct setup following official guidelines.