Xiaomi Open-Sources MiMo-V2.5-Pro: Nears Claude Opus 4.6 on Coding Benchmarks

Xiaomi released the MiMo-V2.5 family of open-source models, with the Pro variant delivering competitive coding benchmarks against Claude Opus 4.6 and GPT-5.4.
Real-World Tests
V2.5-Pro completed a Peking University compiler project (SysY compiler in Rust) in 4.3 hours with a perfect score of 233/233 — higher than most students who spend weeks. Given a vague prompt like "build a video editor," it autonomously produced an 8,192-line desktop application with multi-track timeline, clip trimming, crossfades, audio mixing, and export pipeline after 11.5 hours and 1,868 tool calls. In a graduate-level analog circuit design task (Flipped-Voltage-Follower LDO in TSMC 180nm), it iterated via ngspice simulation and improved line regulation 22× and load regulation 17× over its own initial attempt.
Benchmarks vs. Claude Opus 4.6, GPT-5.4, Gemini 3.1 Pro, DeepSeek V4 Pro
- SWE-Bench Pro: 57.2 (vs. 57.3 Claude, 57.7 GPT, 54.2 Gemini, 55.4 DeepSeek)
- SWE-Bench Verified: 78.9 (vs. 80.8 Claude, n/a GPT, 76.2 Gemini, 80.6 DeepSeek)
- Terminal-Bench 2.0: 68.4 (vs. 65.4 Claude, 75.1 GPT, 68.5 Gemini, 67.9 DeepSeek) — leads Claude and Gemini
- Claw-Eval Pass@3: 63.8 (vs. 70.4 Claude, 60.3 GPT, 57.8 Gemini, 59.8 DeepSeek) — beats GPT and Gemini
- HLE with tools: 48.0 (vs. 53.0 Claude, 58.7 GPT, 51.4 Gemini, 48.2 DeepSeek) — lags on general reasoning
- GDPVal-AA: 1581 (vs. 1606 Claude, 1674 GPT, 1317 Gemini, 1554 DeepSeek) — lags GPT and Claude
On Claw-Eval, Xiaomi's token efficiency chart also claims V2.5-Pro (63.8) beats Claude Sonnet 4.6. V2.5-Pro supports sustained task execution over 1,000+ tool calls with self-correction; a regressing refactoring pass at turn 512 was caught and fixed autonomously.
Weights are now open-source for download and self-hosting.
📖 Read the full source: HN AI Agents
👀 See Also

OpenClaw loses cost-effective access to GPT and Claude models
OpenClaw users can no longer use Anthropic models without paying high API fees, and OpenAI has severely reduced Business and Teams account quotas to near free-tier levels, forcing users toward Chinese or local model alternatives.

InclusionAI Releases Ring-2.6-1T: Trillion-Parameter Model for Agent Workflows
InclusionAI unveiled Ring-2.6-1T, a 1-trillion-parameter reasoning model optimized for agent execution, with dual reasoning effort levels (high/xhigh) and async RL training via IcePop algorithm.

Anthropic Clarifies Claude CLI Usage Policy for OpenClaw Integration
Anthropic has confirmed that OpenClaw-style Claude CLI usage is permitted again, allowing developers to reuse existing Claude CLI logins directly. The documentation details both API key and CLI authentication methods, along with configuration options for Claude 4.6 models, fast mode, and prompt caching.

Vibe Coding vs Agentic Engineering: The Blur Lines Are Getting Uncomfortable
Simon Willison reflects on how vibe coding and agentic engineering are converging in his own workflow, noting that he now trusts Claude Code to write production JSON API endpoints without reviewing every line — and that feels weird.