Benchmark Results for Small Local and OpenRouter Models on Agentic Text-to-SQL Task

A developer has published benchmark results for small local and OpenRouter models on an agentic text-to-SQL task. The benchmark takes English queries like "Show order lines, revenue, units sold, revenue per unit (total revenue ÷ total units sold), average list price per product in the subcategory, gross profit, and margin percentage for each product subcategory" and converts them to SQL that is tested against database tables.
Benchmark Details
The agent can see query results and modify SQL to fix issues, with a limit on debugging rounds. The benchmark is deliberately short with 25 questions and runs in much less than 5 minutes for most models, making it practical for testing different configurations. It's designed to be tough enough to separate the best models from others.
Key Findings
- The best open models identified were kimi-k2.5, Qwen 3.5 397B-A17B, and Qwen 3.5 27B
- NVIDIA Nemotron-Cascade-2-30B-A3B outscores Qwen 3.5-35B-A3B and matches Codex 5.3
- Mimo v2 Flash was described as "a gem of a model"
Self-Hosted Option
The benchmark now includes the ability to run it yourself against your own server using the WASM version of Llama.cpp. The developer is seeking feedback on what to change for version 2 and wants to see scores others get with different configurations.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Eä: A SIMD Compiler for Python Written in Rust
A developer built Eä, a compiler for SIMD kernels in ~12k lines of Rust that generates shared libraries and Python wrappers from .ea files, achieving 6.6× speedups over NumPy without ctypes or build systems.

Multi-provider LLM fallback chain with Ollama support in production AI IDE
Resonant Genesis AI IDE integrates local LLM support as a first-class provider alongside Groq, OpenAI, Anthropic, and Gemini across 30+ microservices using a shared UnifiedLLMClient library with automatic fallback chain.

Why Codex Still Beats Claude Code for Complex Python Monoliths
A senior developer compares Codex vs Claude Code on a production Python monolith with mixed architectural layers. Codex wins for back-end work due to better planning, code reuse, and harness-engineering adherence.

Developer builds Rust compression library with Claude Opus 4.6, questions utility
A developer used Claude Opus 4.6 for two weeks to create a 15,800-line Rust compression library with 449 passing tests, Python bindings, and C FFI layer, but questions whether another compression library was needed.