MiMo-V2.5-Pro Benchmarked: Strong Social Deduction Reasoning, Good Value vs K2.6

MiMo-V2.5-Pro, Xiaomi's latest open-weights model, has been benchmarked in autonomous games of Blood on the Clocktower — a complex social deduction game similar to Mafia/Werewolf. The benchmark, created by Reddit user cjami, pits models against each other in full games, measuring reasoning, deception, and tool use.
Key Results
- Win rate: 88% as Good team, 48% as Evil team — overall high but lopsided. Evil performance is the main weakness vs Kimi K2.6.
- Token efficiency: 183,639 output tokens per game, similar to Gemini 3.1 Pro. Compare to Kimi K2.6 at 580k tokens (3x longer).
- Cost per game: $0.99 — less than half Kimi K2.6 ($2.65) and far below Claude Opus 4.6 ($3.76).
- Match duration: 2-3 hours (vs Kimi K2.6 which takes 10-15 hours due to verbose reasoning).
- Tool call error rate: 0.4% — reliable for autonomous agent workflows.
Notable Performance
Strong reasoning under uncertainty: example of thinking from others' perspectives vs GPT 5.5 and clean deductions winning a game.
Notable Mistakes
- Expected an evil Baron to self-reveal, leading to a loss — vs Claude Opus 4.6.
- Minion confessing their role — transcript.
Practical Takeaway
For developers needing an open-weights model with strong reasoning in multi-agent or game-theoretic settings, MiMo-V2.5-Pro offers the best value among top-tier models — lower cost, faster inference, and reasonable reliability, albeit with room for improvement in adversarial roles.
Full model transcripts and game logs: MiMo-V2.5-Pro on Clocktower Radio. Methodology: How-it-works.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Claude Code allegedly refuses requests or charges extra when commits mention 'OpenClaw'
A tweet by Theo claims Claude Code either refuses requests or charges extra if your git commits mention 'OpenClaw', sparking discussion on HN.

Developer's Dilemma: National Security Concerns Limit Open Model Choices
A developer working with security-sensitive clients reports being forced to choose between outdated U.S. open models like gpt-oss-120b or more capable Chinese models like GLM and MiniMax, which clients reject as national security risks.

Deterministic vs Probabilistic Code Generation: Why Bun's Vibe-Coded Rust Conversion Raises Red Flags
Noah Hall argues vibe-coded 1M-line repo changes (like Bun's Zig-to-Rust) are dangerous. Contrasts deterministic transpilers vs. probabilistic LLM output. Tests aren't enough.

Motherboard Sales Collapse 25%+ as AI Chip Production Crowds Out Consumer PC Components
Asus, Gigabyte, MSI, and ASRock all slash 2026 motherboard shipment targets by 22–37% as chipmakers prioritize AI processor production, driving component shortages and price hikes.