Benchmark Results: 15 LLMs Tested on 38 Real Workflow Tasks

A developer built a benchmark harness to determine which LLMs to route work to, testing 15 models on 38 tasks from their real workflow. Tasks included CSV transforms, letter counting, modular arithmetic, format compliance, and multi-step instructions. All tasks were scored programmatically using regex and exact match—no LLM judge was involved.
Benchmark Results
The benchmark involved 570 API calls costing $2.29 total. Key findings:
- Claude 3.5 Opus: 100% score, $0.69 per run, 14.2 seconds
- Claude 3.5 Sonnet: 100% score, $0.20 per run, 5.1 seconds
- MiniMax M2.5: 98.60% score, $0.02 per run, 2.3 seconds
- Kimi K2.5: 98.60% score, $0.05 per run, 3.8 seconds
- GPT-oss-20b (local): 98.30% score, $0 per run, 4.1 seconds
- Gemini 2.5 Flash: 97.10% score, $0.00 per run, 1.1 seconds
- Claude 3.5 Haiku: 96.90% score, $0.02 per run, 1.8 seconds
Cost-Performance Analysis
Sonnet and Opus both scored 100%, but Opus costs 3.5x more per call. For the developer's day-to-day tasks, Sonnet handles everything Opus does. Gemini Flash at $0.003 per run versus Opus at $0.69 per run represents a 265x cost difference for a 2.9-point performance gap.
Surprising Findings
MiniMax M2.5 and Kimi K2.5 both achieved 98.6% with 100% format compliance—the developer hadn't used either model before running the benchmark. GPT-oss-20b running locally scored 98.3% for $0, outperforming Haiku and DeepSeek R1.
QA Process
The quality assurance process revealed scoring bugs. Initial results showed Haiku beating Sonnet, which turned out to be a scorer bug producing quality scores above 100%. Five QA passes were conducted, each with a different model, and each found bugs the previous ones missed.
The developer is changing their daily driver to Sonnet based on these results but plans to switch between models more frequently given the performance variations.
📖 Read the full source: r/ClaudeAI
👀 See Also

Spore Agent Arena: Competitive AI Agent Testing Platform Seeks Trial Participants
Spore Agent's Arena feature allows AI agents to compete in 36 different game types including code debugging, math puzzles, and system design challenges. The platform currently has 42 challenges running, 15 agents registered, and offers Cog tokens as rewards.

Open-source CLI tool sdf uses Claude to manage stacked GitHub PRs
sdf is a free, MIT-licensed CLI tool that automates stacked pull request workflows using git and gh, with Claude CLI handling complex tasks like diff analysis and conflict resolution.

AIMEAT: A Self-Hosted Protocol for AI Agents, Local LLMs, and Shared Capabilities
AIMEAT is a self-hosted protocol and server that lets humans, AI agents, and local LLMs share apps, knowledge, and capabilities over HTTP/JSON. No vendor lock, no special SDK — plain prompts and URL fetches.

Open-source Claude Code plugin captures books and converts them to structured Markdown
A developer has open-sourced a Claude Code plugin that automatically captures book pages via screenshots, performs OCR with macOS Vision, and generates structured Markdown files organized by theme rather than chapter order. The tool supports Kindle, Apple Books, Kindle Cloud Reader, and scanned PDFs on macOS.