AI Code Review Benchmark: Claude, Gemini, Codex, Qwen, and MiniMax Compared

AI Code Review Performance Comparison
A recent experiment benchmarked five flagship AI models for code review using 15 pull requests from Milvus, an open-source vector database. Each PR contained known bugs that surfaced in production after merging, providing a realistic test set.
Models and Setup
The models tested were:
- Claude Opus 4.6
- Gemini 3 Pro
- GPT-5.2-Codex
- Qwen-3.5-Plus
- MiniMax-M2.5
The benchmark used Magpie, an open-source tool that prepares context by pulling in surrounding code, call chains, and related modules before feeding it to the model.
Bug Difficulty Levels
Bugs were categorized by difficulty:
- L1: Visible from diff alone (all models caught these, so excluded from scoring)
- L2 (10 cases): Requires understanding surrounding code (interface changes, concurrency races)
- L3 (5 cases): Requires system-level understanding (cross-module inconsistencies, upgrade compatibility)
Results by Model
Two evaluation modes were used:
- Raw: Model sees only PR diff and content
- R1: Magpie provides surrounding context
Overall detection rates (L2 + L3 only):
- Claude: 53% raw, 47% with context
- Gemini: 13% raw, 33% with context
- Codex: 33% raw, 27% with context
- MiniMax: 27% raw, 33% with context
- Qwen: 33% raw, 40% with context
Key Findings
Claude dominated raw review with 53% detection and perfect 5/5 on L3 bugs. It excels at organizing its own context, so additional context actually reduced its performance.
Gemini performed poorly in raw mode (13%) but improved significantly with context (33%), suggesting it needs context provided upfront.
Qwen was the strongest context-assisted performer at 40%, with the highest L2 bug detection (5/10).
Adversarial Debate Results
When models debated each other for five rounds, bug detection jumped from 53% (best single model) to 80%. The hardest L3 bugs reached 100% detection in debate mode.
The experiment reveals that different models have complementary strengths: Claude's thoroughness, Gemini's design-focused analysis when given context, Codex's concrete actionable feedback, and Qwen's strong context-assisted performance.
📖 Read the full source: HN AI Agents
👀 See Also

Codesight CLI reduces AI coding agent token usage by scanning codebases
Codesight is a zero-dependency CLI tool that scans TypeScript, Python, and Go projects to generate compact context files, reducing Claude Code exploration tokens by 12.3× on average according to benchmarks from real production codebases.

ClawVibe: A Hands-Free iOS Voice Assistant for AI Agents with On-Device STT/TTS
ClawVibe is a native iOS app that provides hands-free voice interaction with AI agents during commutes. It uses on-device speech recognition and TTS, supports CarPlay, and includes voice biometrics to filter background noise. Only text is sent over the network.

Developer tracks frustration with 'F-Bombs Per Thousand Prompts' metric across 44,212 Claude Code logs
A developer tracked 'fpk' (f-bombs per thousand prompts) across 44,212 Claude Code prompts over 5 months, finding frustration dropped 3.4× from Claude Opus 4-5 to 4-7, and most cursing targeted environmental tooling, not the model.

BaseLayer: Open-Source Behavioral Compression Pipeline for AI Memory Systems
BaseLayer is an open-source pipeline that extracts beliefs, behaviors, tensions, and contradictions from conversations, journals, and published text, compressing them into an identity brief for AI models. It has been tested on datasets ranging from 8 personal journal entries to large corpora like Warren Buffett's shareholder letters (350k words) and Howard Marks' investment memos (600k words).