Claude Opus 4.1 scores 17.75% on SWE-Bench Pro's private dataset, highlighting memorization vs. reasoning gap

✍️ OpenClawRadar📅 Published: March 9, 2026🔗 Source

Benchmark results show significant performance gap

Claude Opus 4.1 achieved 80%+ on SWE-Bench Verified, but scored only 17.75% on SWE-Bench Pro's private dataset. This dataset contains 276 tasks from 18 proprietary startup codebases that have never been on GitHub, specifically designed to eliminate data contamination through GPL-licensed public repositories.

Other model results on the same private dataset: GPT-5.2 scored 23.81% (topping the leaderboard) and Gemini 3 Pro scored 17.95%.

Trajectory analysis reveals memorization behavior

Scale AI's analysis found that during testing, models could identify correct file paths to modify before fully reading problem descriptions on familiar repositories. This indicates they were navigating by memory rather than reasoning through the problems.

The 80% score on SWE-Bench Verified was real, but measured a different capability than most people assumed - primarily memory of training data rather than reasoning about novel code.

Practical implications for AI coding tool deployment

For developers deciding where to deploy AI coding tools in their workflow, the distinction between memory and reasoning matters more than headline benchmark numbers. Models that perform well on contaminated benchmarks may struggle with truly novel codebases they haven't seen during training.

SWE-Bench Pro was created specifically to address this contamination issue by using code that has never been publicly available on GitHub or in training datasets.

📖 Read the full source: r/ClaudeAI

👀 See Also

News

Stanford Study: Law Professors Prefer AI Answers Over Peers 75% of the Time

In a blind evaluation of 3,000 comparisons, law professors rated AI-generated answers significantly higher than peer-written ones. AI responses were flagged as harmful only 3.5% of the time vs 12% for humans.

Jun 3, 2026, 12:19 PM UTC

OpenClawRadar

News

Simple Self-Distillation Method Improves LLM Code Generation

Researchers show that fine-tuning LLMs on their own sampled outputs (simple self-distillation) improves code generation performance, boosting Qwen3-30B-Instruct from 42.4% to 55.3% pass@1 on LiveCodeBench v6.

Apr 14, 2026, 11:07 AM UTC

OpenClawRadar

🦀

News

Claude Code System Prompts v2.1.139: Claude Platform on AWS Docs, Summarization Security, PowerShell Tooling

CC 2.1.139 (+2,248 tokens) adds Claude Platform on AWS reference docs with SigV4 auth, security-preserving conversation summarization, PowerShell Unix command equivalence table, and several skill/prompt refinements.

May 13, 2026, 12:18 AM UTC

OpenClawRadar

News

Slack Rate Limit Changes Break OpenClaw Context Retrieval

Slack changed rate limits on March 3rd, restricting conversations.history and conversations.replies to 1 request per minute with 15 messages max for non-Marketplace apps. This causes OpenClaw agents to lose 85% of their context window.

Mar 10, 2026, 05:45 PM UTC

OpenClawRadar