DeepSeek V3.2, Kimi K2.5 Beat Claude Opus 4.6 on Benchmarks

Benchmark Results

A detailed comparison of open-source models against Claude Opus 4.6 shows competitive or superior performance across multiple categories.

General Reasoning: DeepSeek V3.2

DeepSeek V3.2 holds its own against proprietary models, with its high-compute variant (V3.2-Speciale) surpassing GPT-5.

SWE-bench Verified: Claude Opus 4.6: 80.8%, DeepSeek V3.2: 73.0%
LiveCodeBench: Claude Opus 4.6: 76, DeepSeek V3.2: 74.1
MMLU-Pro: DeepSeek V3.2: 85.0%, Claude Opus 4.6: 82.0%

DeepSeek V3.2 has strong multilingual support (CJK, Arabic, European languages), 128K context with sparse attention, but falls short on creative writing and some structured output edge cases. Inference: ~60 tok/s output, 1.18s TTFT, 128K context. Production-ready for 90%+ of general use cases. 5x cheaper than GPT-5, 20x cheaper than Opus 4.6.

Reasoning: DeepSeek R1

DeepSeek R1 beats expensive reasoning models on several benchmarks.

Humanity's Last Exam: DeepSeek R1: 50.2%, Claude Opus 4.6: 40.0%
MMLU-Pro: DeepSeek R1: 88.9%, Claude Opus 4.6: 82.0%

Inference: ~30 tok/s output, ~2s TTFT. Slower than non-reasoning models due to chain-of-thought processing. Best open-source reasoning model. Matches GPT-5.2 Pro on HLE. 30x cheaper than o1.

Agentic: Kimi K2.5

1 trillion parameters (32B active per token via MoE). 256K context. Open-source under modified MIT.

Tool use improvement: Kimi K2.5: +20.1 pts, Claude Opus 4.6: +12.4 pts, GPT-5.2: +11.0 pts
SWE-bench Verified: Claude Opus 4.6: 80.8%, Kimi K2.5: 76.8%
Humanity's Last Exam: Kimi K2.5: 50.2%, Claude Opus 4.6: 40.0%

Can autonomously spawn up to 100 sub-agents in parallel and handle 1,500+ tool calls without human intervention. Inference: 334 tok/s output, 0.31s TTFT. Best model for autonomous agent workloads. Fastest TTFT, best tool use, competitive on every benchmark.

Code: MiniMax M2.5

MiniMax M2.5 became one of the best coding models.

SWE-bench Verified: Claude Opus 4.6: 80.8%, MiniMax M2.5: 80.2%, GLM-5: 77.8%

MiniMax released M2.7 on March 18 — a "self-evolving" model at $0.30/$1.20 per M tokens. 96th percentile on coding accuracy, perfect score on general knowledge. One of the cheapest frontier models available. Open-source coding models effectively match the best proprietary model.

Speed Comparison

For production, latency matters as much as quality.

Output speed (tokens/second):

Kimi K2.5 Turbo: 334
Llama 3.1 8B: ~200
GLM 4.7 Flash: ~150
DeepSeek V3.2: ~60
Claude Opus 4.6: 46
DeepSeek R1: ~30

Time to first token (TTFT):

Llama 3.1 8B: 0.2s
Kimi K2.5 Turbo: 0.31s
GLM 4.7 Flash: 0.51s
DeepSeek V3.2: 1.18s

Kimi K2.5 at 334 tok/s is 7x faster than Opus at 46 tok/s.

Vision

Open-source vision has caught up for document processing and standard image analysis. Llama 4 Scout, Qwen VL, and others handle document extraction (invoices, receipts, forms), diagram understanding, and multi-image reasoning well. Still falls short on fine-grained spatial reasoning and non-Latin handwriting.

Overall Comparison

Best open-source model in each category compared to Claude Opus 4.6 (Opus = 100% on each axis):

Code (SWE-bench): Open-source 80.2% vs Opus 80.8% — Opus wins by 0.6 pts. Basically tied.
Knowledge (MMLU-Pro): Open-source 88.9% vs Opus 82.0% — Open-source wins by 6.9 pts.
Speed (tok/s): Open-source 334 vs Opus 46 — Open-source is 7.3x faster.
Tool Use (improvement): Open-source +20.1 pts vs Opus +12.4 pts — Open-source wins by 7.7 pts.

📖 Read the full source: r/LocalLLaMA