Qwen3.6 Plus benchmark comparison against Western SOTA models

A Reddit post on r/LocalLLaMA compares Qwen3.6 Plus against several Western state-of-the-art models across multiple benchmarks. The comparison includes specific performance metrics for each model.
Benchmark Results
The source provides these exact scores:
- Qwen3.6-Plus: SWE-bench Verified 78.8, GPQA / GPQA Diamond 90.4, HLE (no tools) 28.8, MMMU-Pro 78.8
- GPT‑5.4 (xhigh): SWE-bench Verified 78.2, GPQA / GPQA Diamond 93.0, HLE (no tools) 39.8, MMMU-Pro 81.2
- Claude Opus 4.6 (thinking heavy): SWE-bench Verified 80.8, GPQA / GPQA Diamond 91.3, HLE (no tools) 34.44, MMMU-Pro 77.3
- Gemini 3.1 Pro Preview: SWE-bench Verified 80.6, GPQA / GPQA Diamond 94.3, HLE (no tools) 44.7, MMMU-Pro 80.5
The post includes a visual comparison chart available at: https://preview.redd.it/6kq4tt07yrsg1.png?width=714&format=png&auto=webp&s=ad8b207fb13729ae84f5b74cec5fd84a81dcface
User Assessment
The original poster notes that Qwen3.6 Plus is "competitive but not the bench" and states: "Will be my new model given how cheap it is, but whether it's actually good irl will depend more than benchmarks." They also observe that "Opus destroys all others despite being 3rd or 4th on artificalanalysis."
📖 Read the full source: r/LocalLLaMA
👀 See Also

Anthropic Doubles Claude Code Usage Limits, Signs SpaceX Compute Deal
Anthropic doubled five-hour usage windows for Claude Code Pro and Max subscribers, removed peak-hour reductions, and raised API limits for Opus, citing a new deal with SpaceX for 300+ MW of compute capacity from the Colossus 1 supercomputer (220,000+ NVIDIA GPUs).

Hospital CEO Claims AI Ready to Replace Radiologists
The CEO of America's largest public hospital system says he's prepared to replace radiologists with AI, according to a Radiology Business article that generated significant discussion on Hacker News with 83 comments.

Georgia AI Data Center Drained 29M Gallons of Unmetered Water
QTS Fayetteville campus drew 29M gallons via two unauthorized water connections over 15 months, causing low pressure complaints. County waived fines, charged $147K retroactive.

SWE-rebench Leaderboard Update: February 2026 Results Show Tight Competition
The SWE-rebench leaderboard has been updated with February 2026 results testing 57 fresh GitHub PR tasks. Claude Opus 4.6 leads with 65.3% resolved rate, but the top six models are within 5 percentage points.