Local LLM Benchmark: Backend Generation by Function Calling – GLM, Qwen, DeepSeek Compared

✍️ OpenClawRadar📅 Published: May 3, 2026🔗 Source
Local LLM Benchmark: Backend Generation by Function Calling – GLM, Qwen, DeepSeek Compared
Ad

Five months after an initial uncontrolled measurement, AutoBe.dev has published a proper benchmark of local and frontier LLMs for backend code generation using function calling. The benchmark uses a controlled variable setup with a real scoring rubric, testing models on generating recursive-union AST schemas via a function calling harness.

Key Findings

  • The function calling harness has effectively closed the gap between frontier and local models on backend generation. Specifically, gpt-5.4's DB/API design scores are approximately equal to qwen3.5-35b-a3b, and claude-sonnet-4.6's logic scores match qwen3.5-27b.
  • This is the last round including frontier models. Running them monthly costs ~200–300M tokens (~$1,000–$1,500 per model on GPT 5.5 pricing). From next month, only OpenRouter endpoints under $0.25/M tokens or models that fit on a 64GB unified-memory laptop will be included.
  • Frontend automation will be added to the benchmark in the June/July round, using the SDK AutoBe already emits to drive end-to-end AI-built frontends (visuals rough, but all functions work).
Ad

Unexpected Inversions

Several results are still under investigation:

  • openai/gpt-5.4 scores below its own mini sibling.
  • deepseek-v4-pro lands one notch below qwen3.5-35b-a3b and barely separates from its own Flash sibling.
  • Within the Qwen family, dense 27B beats every MoE variant, including 397B-A17B.

Possible explanations being investigated include CoT-compliance phenomenon (larger/frontier models tend to skip procedural instructions enforced by the harness) and benchmark defects (n=4 reference projects, narrow score band, harness scoring own pipeline).

Recommended Models

Three locked-in candidates for next month:

  • openai/gpt-5.4-nano — $0.25/M tokens
  • qwen/qwen3.6-27b — $0.195/M tokens
  • deepseek/deepseek-v4-flash — $0.14/M tokens

All are under $0.25/M on OpenRouter or runnable on a 64GB unified-memory laptop, and handle function calling cleanly.

References

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also