Open-source models match or beat Claude Opus 4.6 on benchmarks

Benchmark Results
A detailed comparison of open-source models against Claude Opus 4.6 shows competitive or superior performance across multiple categories.
General Reasoning: DeepSeek V3.2
DeepSeek V3.2 holds its own against proprietary models, with its high-compute variant (V3.2-Speciale) surpassing GPT-5.
- SWE-bench Verified: Claude Opus 4.6: 80.8%, DeepSeek V3.2: 73.0%
- LiveCodeBench: Claude Opus 4.6: 76, DeepSeek V3.2: 74.1
- MMLU-Pro: DeepSeek V3.2: 85.0%, Claude Opus 4.6: 82.0%
DeepSeek V3.2 has strong multilingual support (CJK, Arabic, European languages), 128K context with sparse attention, but falls short on creative writing and some structured output edge cases. Inference: ~60 tok/s output, 1.18s TTFT, 128K context. Production-ready for 90%+ of general use cases. 5x cheaper than GPT-5, 20x cheaper than Opus 4.6.
Reasoning: DeepSeek R1
DeepSeek R1 beats expensive reasoning models on several benchmarks.
- Humanity's Last Exam: DeepSeek R1: 50.2%, Claude Opus 4.6: 40.0%
- MMLU-Pro: DeepSeek R1: 88.9%, Claude Opus 4.6: 82.0%
Inference: ~30 tok/s output, ~2s TTFT. Slower than non-reasoning models due to chain-of-thought processing. Best open-source reasoning model. Matches GPT-5.2 Pro on HLE. 30x cheaper than o1.
Agentic: Kimi K2.5
1 trillion parameters (32B active per token via MoE). 256K context. Open-source under modified MIT.
- Tool use improvement: Kimi K2.5: +20.1 pts, Claude Opus 4.6: +12.4 pts, GPT-5.2: +11.0 pts
- SWE-bench Verified: Claude Opus 4.6: 80.8%, Kimi K2.5: 76.8%
- Humanity's Last Exam: Kimi K2.5: 50.2%, Claude Opus 4.6: 40.0%
Can autonomously spawn up to 100 sub-agents in parallel and handle 1,500+ tool calls without human intervention. Inference: 334 tok/s output, 0.31s TTFT. Best model for autonomous agent workloads. Fastest TTFT, best tool use, competitive on every benchmark.
Code: MiniMax M2.5
MiniMax M2.5 became one of the best coding models.
- SWE-bench Verified: Claude Opus 4.6: 80.8%, MiniMax M2.5: 80.2%, GLM-5: 77.8%
MiniMax released M2.7 on March 18 — a "self-evolving" model at $0.30/$1.20 per M tokens. 96th percentile on coding accuracy, perfect score on general knowledge. One of the cheapest frontier models available. Open-source coding models effectively match the best proprietary model.
Speed Comparison
For production, latency matters as much as quality.
Output speed (tokens/second):
- Kimi K2.5 Turbo: 334
- Llama 3.1 8B: ~200
- GLM 4.7 Flash: ~150
- DeepSeek V3.2: ~60
- Claude Opus 4.6: 46
- DeepSeek R1: ~30
Time to first token (TTFT):
- Llama 3.1 8B: 0.2s
- Kimi K2.5 Turbo: 0.31s
- GLM 4.7 Flash: 0.51s
- DeepSeek V3.2: 1.18s
Kimi K2.5 at 334 tok/s is 7x faster than Opus at 46 tok/s.
Vision
Open-source vision has caught up for document processing and standard image analysis. Llama 4 Scout, Qwen VL, and others handle document extraction (invoices, receipts, forms), diagram understanding, and multi-image reasoning well. Still falls short on fine-grained spatial reasoning and non-Latin handwriting.
Overall Comparison
Best open-source model in each category compared to Claude Opus 4.6 (Opus = 100% on each axis):
- Code (SWE-bench): Open-source 80.2% vs Opus 80.8% — Opus wins by 0.6 pts. Basically tied.
- Knowledge (MMLU-Pro): Open-source 88.9% vs Opus 82.0% — Open-source wins by 6.9 pts.
- Speed (tok/s): Open-source 334 vs Opus 46 — Open-source is 7.3x faster.
- Tool Use (improvement): Open-source +20.1 pts vs Opus +12.4 pts — Open-source wins by 7.7 pts.
📖 Read the full source: r/LocalLLaMA
👀 See Also

The Need for Relational Governance in Multi-Agent Systems
Current governance frameworks focus on identity, permissions, and kill switches, but fail to address coordination between agents. Research shows agent-to-agent interactions require purpose-built solutions beyond scaled-up human-agent conversations.

The AI Ping-Pong: When Every Reply Is a ChatGPT Screenshot
Developers report being flooded with AI-generated answers — from coworkers, bosses, and even GitHub commenters — that ignore context and waste time. The HN discussion captures a growing frustration.

Claude Code v2.1.68: Opus 4.6 defaults to medium effort, reintroduces ultrathink keyword
Claude Code v2.1.68 changes the default effort level for Opus 4.6 to medium for Max and Team subscribers, reintroduces the 'ultrathink' keyword for high effort, and removes older Opus 4 and 4.1 models from the first-party API.

Gemini 3.1 Flash Live: Google's latest audio model with improved benchmarks and watermarking
Google released Gemini 3.1 Flash Live, an audio model scoring 90.8% on ComplexFuncBench Audio and 36.1% on Scale AI's Audio MultiChallenge. It's available via Gemini Live API in Google AI Studio and includes SynthID watermarking.