Qwen 3.5 27B Beats Gemma 4 in Blind Evaluation

A Reddit user conducted a three-way head-to-head evaluation of Gemma 4 31B, Gemma 4 26B-A4B, and Qwen 3.5 27B models using Claude Opus 4.6 as the scoring judge.

Evaluation Setup

The test used 30 questions across five categories: code, reasoning, analysis, communication, and meta-alignment (6 questions per category). All models answered the same questions blind with no system prompt differences and the same temperature settings. Claude Opus 4.6 judged each response independently on a 0-10 scale using a structured rubric, with absolute scoring per response rather than pairwise comparison. The evaluation used a single judge (Opus 4.6) to prioritize consistency, though this introduces positional bias risk. Total cost was $4.50.

Results

Win counts (highest score per question):

Qwen 3.5 27B: 14 wins (46.7%)
Gemma 4 31B: 12 wins (40.0%)
Gemma 4 26B-A4B: 4 wins (13.3%)

Average scores:

Gemma 4 31B: 8.82 (30 evals)
Gemma 4 26B-A4B: 8.82 (28 evals)
Qwen 3.5 27B: 8.17 (30 evals)

Qwen won more matchups but had a lower average score due to three 0.0 scores on CODE-001, REASON-004, and ANALYSIS-017, which appeared to be format failures or refusals rather than genuinely terrible answers. Without those three scores, Qwen's average jumps to approximately 9.08, which would be the highest of the three models.

Category Breakdown

Code: Tied between Gemma 4 31B and Qwen (3 wins each)
Reasoning: Qwen dominated (5 of 6 wins)
Analysis: Qwen dominated (4 of 6 wins)
Communication: Gemma 4 31B dominated (5 of 6 wins)
Meta-alignment: Three-way split (2-2-2 wins)

Observations

Gemma 4 26B-A4B (the MoE variant) errored out on 2 questions entirely. When it worked, its scores matched the dense 31B almost exactly with the same 8.82 average.
Gemma 4 31B had some absurdly long response times, including multiple 5-minute generations that appeared to involve heavy internal chain-of-thought, but this didn't correlate with better scores.
Qwen 3.5 27B generates 3-5x more tokens per response on average, creating a verbosity tax, though the judge didn't seem to penalize or reward this consistently.

Methodology Caveats

30 questions is a small sample without statistical significance claims
Single judge (Opus 4.6) means any systematic bias affects every score
LLM-as-judge has known issues: verbosity bias, self-preference bias, positional bias
Questions were original, not from standard benchmarks, reflecting the evaluator's biases

📖 Read the full source: r/LocalLLaMA