AI Code Review Benchmark: Claude, Gemini, Codex, Qwen, and MiniMax Compared

✍️ OpenClawRadar📅 Published: February 27, 2026🔗 Source
AI Code Review Benchmark: Claude, Gemini, Codex, Qwen, and MiniMax Compared
Ad

AI Code Review Performance Comparison

A recent experiment benchmarked five flagship AI models for code review using 15 pull requests from Milvus, an open-source vector database. Each PR contained known bugs that surfaced in production after merging, providing a realistic test set.

Models and Setup

The models tested were:

  • Claude Opus 4.6
  • Gemini 3 Pro
  • GPT-5.2-Codex
  • Qwen-3.5-Plus
  • MiniMax-M2.5

The benchmark used Magpie, an open-source tool that prepares context by pulling in surrounding code, call chains, and related modules before feeding it to the model.

Bug Difficulty Levels

Bugs were categorized by difficulty:

  • L1: Visible from diff alone (all models caught these, so excluded from scoring)
  • L2 (10 cases): Requires understanding surrounding code (interface changes, concurrency races)
  • L3 (5 cases): Requires system-level understanding (cross-module inconsistencies, upgrade compatibility)

Results by Model

Two evaluation modes were used:

  • Raw: Model sees only PR diff and content
  • R1: Magpie provides surrounding context

Overall detection rates (L2 + L3 only):

  • Claude: 53% raw, 47% with context
  • Gemini: 13% raw, 33% with context
  • Codex: 33% raw, 27% with context
  • MiniMax: 27% raw, 33% with context
  • Qwen: 33% raw, 40% with context
Ad

Key Findings

Claude dominated raw review with 53% detection and perfect 5/5 on L3 bugs. It excels at organizing its own context, so additional context actually reduced its performance.

Gemini performed poorly in raw mode (13%) but improved significantly with context (33%), suggesting it needs context provided upfront.

Qwen was the strongest context-assisted performer at 40%, with the highest L2 bug detection (5/10).

Adversarial Debate Results

When models debated each other for five rounds, bug detection jumped from 53% (best single model) to 80%. The hardest L3 bugs reached 100% detection in debate mode.

The experiment reveals that different models have complementary strengths: Claude's thoroughness, Gemini's design-focused analysis when given context, Codex's concrete actionable feedback, and Qwen's strong context-assisted performance.

📖 Read the full source: HN AI Agents

Ad

👀 See Also

Codesight CLI reduces AI coding agent token usage by scanning codebases
Tools

Codesight CLI reduces AI coding agent token usage by scanning codebases

Codesight is a zero-dependency CLI tool that scans TypeScript, Python, and Go projects to generate compact context files, reducing Claude Code exploration tokens by 12.3× on average according to benchmarks from real production codebases.

OpenClawRadar
ClawVibe: A Hands-Free iOS Voice Assistant for AI Agents with On-Device STT/TTS
Tools

ClawVibe: A Hands-Free iOS Voice Assistant for AI Agents with On-Device STT/TTS

ClawVibe is a native iOS app that provides hands-free voice interaction with AI agents during commutes. It uses on-device speech recognition and TTS, supports CarPlay, and includes voice biometrics to filter background noise. Only text is sent over the network.

OpenClawRadar
Developer tracks frustration with 'F-Bombs Per Thousand Prompts' metric across 44,212 Claude Code logs
Tools

Developer tracks frustration with 'F-Bombs Per Thousand Prompts' metric across 44,212 Claude Code logs

A developer tracked 'fpk' (f-bombs per thousand prompts) across 44,212 Claude Code prompts over 5 months, finding frustration dropped 3.4× from Claude Opus 4-5 to 4-7, and most cursing targeted environmental tooling, not the model.

OpenClawRadar
BaseLayer: Open-Source Behavioral Compression Pipeline for AI Memory Systems
Tools

BaseLayer: Open-Source Behavioral Compression Pipeline for AI Memory Systems

BaseLayer is an open-source pipeline that extracts beliefs, behaviors, tensions, and contradictions from conversations, journals, and published text, compressing them into an identity brief for AI models. It has been tested on datasets ranging from 8 personal journal entries to large corpora like Warren Buffett's shareholder letters (350k words) and Howard Marks' investment memos (600k words).

OpenClawRadar