Local vs Cloud Models: Qwen-3.6-27B, Gemma-4-31B, Claude Haiku, Codex-Spark on Hard Code Gen

✍️ OpenClawRadar📅 Published: April 30, 2026🔗 Source
Local vs Cloud Models: Qwen-3.6-27B, Gemma-4-31B, Claude Haiku, Codex-Spark on Hard Code Gen
Ad

A Reddit user compared locally-ran Qwen-3.6-27B (GGUF q4_k_m) against API equivalents: Qwen-3.6-27B via OpenRouter, Gemma-4-31B via OpenRouter, Claude Haiku 4.5, and GPT-Codex-Spark. The test involved implementing an autoresearch loop from a design document — a deliberately hard task to evaluate failure cleanliness, not success rate.

Hardware Setup

  • CPU: Ryzen 7 7800X3D
  • RAM: 64 GB DDR5-6400
  • GPU: RTX 5080 (16 GB VRAM)
  • Local model: Qwen-3.6-27B q4_k_m (GGUF) — fits 16 GB VRAM via quantization

Results

  • Gemma-4-31B (API): Failed completely. Wrote skeleton with mocked modules, no tests, no config files (__init__.py, requirements.txt, pyproject.toml). Cost: $0.112, 803k context tokens consumed, 21k generated.
  • Codex-Spark (API): Produced beautiful folder structure and code, but imports were hallucinated. No unit tests. Used 1% of $100/mo Spark limits.
  • Claude Haiku 4.5 (API): Detailed implementation but failed on correctness. (Further details truncated in source.)
  • Qwen-3.6-27B (local q4_k_m): Not explicitly scored, but user notes quantized inference degrades quality vs full-precision API version.
Ad

Context

The user argues that typical local-model evals use trivial tasks (e.g., Snake in HTML) where both local and frontier models succeed, making local models look better than they are. This test used a real work project with a design document; only Codex-Spark produced fully written (but broken) code. The point: local models are not yet ready for complex code generation without substantial fixes.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also