Claude Opus 4.1 scores 17.75% on SWE-Bench Pro's private dataset, highlighting memorization vs. reasoning gap

✍️ OpenClawRadar📅 Published: March 9, 2026🔗 Source
Claude Opus 4.1 scores 17.75% on SWE-Bench Pro's private dataset, highlighting memorization vs. reasoning gap
Ad

Benchmark results show significant performance gap

Claude Opus 4.1 achieved 80%+ on SWE-Bench Verified, but scored only 17.75% on SWE-Bench Pro's private dataset. This dataset contains 276 tasks from 18 proprietary startup codebases that have never been on GitHub, specifically designed to eliminate data contamination through GPL-licensed public repositories.

Other model results on the same private dataset: GPT-5.2 scored 23.81% (topping the leaderboard) and Gemini 3 Pro scored 17.95%.

Trajectory analysis reveals memorization behavior

Scale AI's analysis found that during testing, models could identify correct file paths to modify before fully reading problem descriptions on familiar repositories. This indicates they were navigating by memory rather than reasoning through the problems.

The 80% score on SWE-Bench Verified was real, but measured a different capability than most people assumed - primarily memory of training data rather than reasoning about novel code.

Ad

Practical implications for AI coding tool deployment

For developers deciding where to deploy AI coding tools in their workflow, the distinction between memory and reasoning matters more than headline benchmark numbers. Models that perform well on contaminated benchmarks may struggle with truly novel codebases they haven't seen during training.

SWE-Bench Pro was created specifically to address this contamination issue by using code that has never been publicly available on GitHub or in training datasets.

📖 Read the full source: r/ClaudeAI

Ad

👀 See Also