AI Code Review Benchmark: Claude vs Gemini vs Codex vs Qwen vs MiniMax

AI Code Review Performance Comparison

A recent experiment benchmarked five flagship AI models for code review using 15 pull requests from Milvus, an open-source vector database. Each PR contained known bugs that surfaced in production after merging, providing a realistic test set.

Models and Setup

The models tested were:

Claude Opus 4.6
Gemini 3 Pro
GPT-5.2-Codex
Qwen-3.5-Plus
MiniMax-M2.5

The benchmark used Magpie, an open-source tool that prepares context by pulling in surrounding code, call chains, and related modules before feeding it to the model.

Bug Difficulty Levels

Bugs were categorized by difficulty:

L1: Visible from diff alone (all models caught these, so excluded from scoring)
L2 (10 cases): Requires understanding surrounding code (interface changes, concurrency races)
L3 (5 cases): Requires system-level understanding (cross-module inconsistencies, upgrade compatibility)

Results by Model

Two evaluation modes were used:

Raw: Model sees only PR diff and content
R1: Magpie provides surrounding context

Overall detection rates (L2 + L3 only):

Claude: 53% raw, 47% with context
Gemini: 13% raw, 33% with context
Codex: 33% raw, 27% with context
MiniMax: 27% raw, 33% with context
Qwen: 33% raw, 40% with context

Key Findings

Claude dominated raw review with 53% detection and perfect 5/5 on L3 bugs. It excels at organizing its own context, so additional context actually reduced its performance.

Gemini performed poorly in raw mode (13%) but improved significantly with context (33%), suggesting it needs context provided upfront.

Qwen was the strongest context-assisted performer at 40%, with the highest L2 bug detection (5/10).

Adversarial Debate Results

When models debated each other for five rounds, bug detection jumped from 53% (best single model) to 80%. The hardest L3 bugs reached 100% detection in debate mode.

The experiment reveals that different models have complementary strengths: Claude's thoroughness, Gemini's design-focused analysis when given context, Codex's concrete actionable feedback, and Qwen's strong context-assisted performance.

📖 Read the full source: HN AI Agents