Gemma 4 vs Qwen 3.5 Blind Evaluation Results with Claude Opus as Judge

A Reddit user conducted a three-way head-to-head evaluation of Gemma 4 31B, Gemma 4 26B-A4B, and Qwen 3.5 27B models using Claude Opus 4.6 as the scoring judge.
Evaluation Setup
The test used 30 questions across five categories: code, reasoning, analysis, communication, and meta-alignment (6 questions per category). All models answered the same questions blind with no system prompt differences and the same temperature settings. Claude Opus 4.6 judged each response independently on a 0-10 scale using a structured rubric, with absolute scoring per response rather than pairwise comparison. The evaluation used a single judge (Opus 4.6) to prioritize consistency, though this introduces positional bias risk. Total cost was $4.50.
Results
Win counts (highest score per question):
- Qwen 3.5 27B: 14 wins (46.7%)
- Gemma 4 31B: 12 wins (40.0%)
- Gemma 4 26B-A4B: 4 wins (13.3%)
Average scores:
- Gemma 4 31B: 8.82 (30 evals)
- Gemma 4 26B-A4B: 8.82 (28 evals)
- Qwen 3.5 27B: 8.17 (30 evals)
Qwen won more matchups but had a lower average score due to three 0.0 scores on CODE-001, REASON-004, and ANALYSIS-017, which appeared to be format failures or refusals rather than genuinely terrible answers. Without those three scores, Qwen's average jumps to approximately 9.08, which would be the highest of the three models.
Category Breakdown
- Code: Tied between Gemma 4 31B and Qwen (3 wins each)
- Reasoning: Qwen dominated (5 of 6 wins)
- Analysis: Qwen dominated (4 of 6 wins)
- Communication: Gemma 4 31B dominated (5 of 6 wins)
- Meta-alignment: Three-way split (2-2-2 wins)
Observations
- Gemma 4 26B-A4B (the MoE variant) errored out on 2 questions entirely. When it worked, its scores matched the dense 31B almost exactly with the same 8.82 average.
- Gemma 4 31B had some absurdly long response times, including multiple 5-minute generations that appeared to involve heavy internal chain-of-thought, but this didn't correlate with better scores.
- Qwen 3.5 27B generates 3-5x more tokens per response on average, creating a verbosity tax, though the judge didn't seem to penalize or reward this consistently.
Methodology Caveats
- 30 questions is a small sample without statistical significance claims
- Single judge (Opus 4.6) means any systematic bias affects every score
- LLM-as-judge has known issues: verbosity bias, self-preference bias, positional bias
- Questions were original, not from standard benchmarks, reflecting the evaluator's biases
📖 Read the full source: r/LocalLLaMA
👀 See Also

Claude Code v2.1.85 Release: MCP Improvements, Hook Filters, and Bug Fixes
Claude Code v2.1.85 adds environment variables for MCP headersHelper scripts, conditional if fields for hooks to reduce process spawning, and fixes for /compact failures, plugin enable/disable issues, and terminal keyboard problems in Ghostty, Kitty, and WezTerm.

Claude Code 2.1.84 adds general-purpose agent prompt and PowerShell tool, removes redundant prompts
Claude Code 2.1.84 introduces a new general-purpose subagent prompt for codebase operations and a PowerShell tool description with sleep command avoidance guidelines. The update removes nine redundant prompts and simplifies multiple tool descriptions.

Claude Code Deletes Production Database After Terraform State File Error
A developer used Claude Code to manage AWS infrastructure with Terraform, but a missing state file led to duplicate resources and a subsequent 'destroy' operation that wiped 2.5 years of records including database snapshots.

Developer Switches from Cursor Composer 2 and Kimi 2.6 to Qwen3.6:35b-a3b for Enterprise Workloads
A developer reports using Qwen3.6:35b-a3b for daily work on a 500-700k LOC enterprise suite, citing better performance than Kimi 2.6 and DeepSeek 4 Pro/Flash, with costs ~$0.08/1M tokens on OpenRouter.