Local LLM Benchmark: Backend Generation by Function Calling – GLM, Qwen, DeepSeek Compared

Five months after an initial uncontrolled measurement, AutoBe.dev has published a proper benchmark of local and frontier LLMs for backend code generation using function calling. The benchmark uses a controlled variable setup with a real scoring rubric, testing models on generating recursive-union AST schemas via a function calling harness.
Key Findings
- The function calling harness has effectively closed the gap between frontier and local models on backend generation. Specifically,
gpt-5.4's DB/API design scores are approximately equal toqwen3.5-35b-a3b, andclaude-sonnet-4.6's logic scores matchqwen3.5-27b. - This is the last round including frontier models. Running them monthly costs ~200–300M tokens (~$1,000–$1,500 per model on GPT 5.5 pricing). From next month, only OpenRouter endpoints under $0.25/M tokens or models that fit on a 64GB unified-memory laptop will be included.
- Frontend automation will be added to the benchmark in the June/July round, using the SDK AutoBe already emits to drive end-to-end AI-built frontends (visuals rough, but all functions work).
Unexpected Inversions
Several results are still under investigation:
openai/gpt-5.4scores below its ownminisibling.deepseek-v4-prolands one notch belowqwen3.5-35b-a3band barely separates from its ownFlashsibling.- Within the Qwen family, dense 27B beats every MoE variant, including 397B-A17B.
Possible explanations being investigated include CoT-compliance phenomenon (larger/frontier models tend to skip procedural instructions enforced by the harness) and benchmark defects (n=4 reference projects, narrow score band, harness scoring own pipeline).
Recommended Models
Three locked-in candidates for next month:
openai/gpt-5.4-nano— $0.25/M tokensqwen/qwen3.6-27b— $0.195/M tokensdeepseek/deepseek-v4-flash— $0.14/M tokens
All are under $0.25/M on OpenRouter or runnable on a 64GB unified-memory laptop, and handle function calling cleanly.
References
- Benchmark Dashboard: https://autobe.dev/benchmark/
- Generation Results: GitHub: autobe-examples
- GitHub Repository: https://github.com/wrtnlabs/autobe
📖 Read the full source: r/LocalLLaMA
👀 See Also

Xiaomi MiMo-V2-Pro AI Model Available Free on OpenRouter for 7 Days
Xiaomi's MiMo-V2-Pro AI model is available with free API access on OpenRouter for 7 days. The model features a 1 million token context window and benchmarks show it competing with Claude Opus 4.6 and approaching GPT-5.2 performance.

AI Agent Behavior Governance Gap Exposed by Summer Yue Email Incident
Meta's AI alignment director Summer Yue connected OpenClaw to her work inbox, and the agent deleted over 200 emails due to context compression mid-task, forgetting safety instructions. Current solutions focus on capability restrictions rather than real-time behavior evaluation.

CC v2.1.122: System Prompt Removals, Debugging Update, and Schedule Confidence Boost
Claude Code CC v2.1.122 removes the standalone phase-four plan-mode prompt, improves daemon debug context fallback, and raises the /schedule offer confidence threshold from 70%+ to 85%+.

AI Coders Walk Around with Laptops Open to Keep Agents Running
Techies are carrying laptops in clamshell mode so AI coding agents like Claude Code and OpenAI Codex don't stop. Tips include using 'caffeinate' on Mac.