Claude Opus 4.1 scores 17.75% on SWE-Bench Pro's private dataset, highlighting memorization vs. reasoning gap

Benchmark results show significant performance gap
Claude Opus 4.1 achieved 80%+ on SWE-Bench Verified, but scored only 17.75% on SWE-Bench Pro's private dataset. This dataset contains 276 tasks from 18 proprietary startup codebases that have never been on GitHub, specifically designed to eliminate data contamination through GPL-licensed public repositories.
Other model results on the same private dataset: GPT-5.2 scored 23.81% (topping the leaderboard) and Gemini 3 Pro scored 17.95%.
Trajectory analysis reveals memorization behavior
Scale AI's analysis found that during testing, models could identify correct file paths to modify before fully reading problem descriptions on familiar repositories. This indicates they were navigating by memory rather than reasoning through the problems.
The 80% score on SWE-Bench Verified was real, but measured a different capability than most people assumed - primarily memory of training data rather than reasoning about novel code.
Practical implications for AI coding tool deployment
For developers deciding where to deploy AI coding tools in their workflow, the distinction between memory and reasoning matters more than headline benchmark numbers. Models that perform well on contaminated benchmarks may struggle with truly novel codebases they haven't seen during training.
SWE-Bench Pro was created specifically to address this contamination issue by using code that has never been publicly available on GitHub or in training datasets.
📖 Read the full source: r/ClaudeAI
👀 See Also

Stanford Study: Law Professors Prefer AI Answers Over Peers 75% of the Time
In a blind evaluation of 3,000 comparisons, law professors rated AI-generated answers significantly higher than peer-written ones. AI responses were flagged as harmful only 3.5% of the time vs 12% for humans.

Simple Self-Distillation Method Improves LLM Code Generation
Researchers show that fine-tuning LLMs on their own sampled outputs (simple self-distillation) improves code generation performance, boosting Qwen3-30B-Instruct from 42.4% to 55.3% pass@1 on LiveCodeBench v6.
Claude Code System Prompts v2.1.139: Claude Platform on AWS Docs, Summarization Security, PowerShell Tooling
CC 2.1.139 (+2,248 tokens) adds Claude Platform on AWS reference docs with SigV4 auth, security-preserving conversation summarization, PowerShell Unix command equivalence table, and several skill/prompt refinements.

Slack Rate Limit Changes Break OpenClaw Context Retrieval
Slack changed rate limits on March 3rd, restricting conversations.history and conversations.replies to 1 request per minute with 15 messages max for non-Marketplace apps. This causes OpenClaw agents to lose 85% of their context window.