Local vs Cloud Models: Qwen-3.6-27B, Gemma-4-31B, Claude Haiku, Codex-Spark on Hard Code Gen

A Reddit user compared locally-ran Qwen-3.6-27B (GGUF q4_k_m) against API equivalents: Qwen-3.6-27B via OpenRouter, Gemma-4-31B via OpenRouter, Claude Haiku 4.5, and GPT-Codex-Spark. The test involved implementing an autoresearch loop from a design document — a deliberately hard task to evaluate failure cleanliness, not success rate.
Hardware Setup
- CPU: Ryzen 7 7800X3D
- RAM: 64 GB DDR5-6400
- GPU: RTX 5080 (16 GB VRAM)
- Local model: Qwen-3.6-27B q4_k_m (GGUF) — fits 16 GB VRAM via quantization
Results
- Gemma-4-31B (API): Failed completely. Wrote skeleton with mocked modules, no tests, no config files (
__init__.py,requirements.txt,pyproject.toml). Cost: $0.112, 803k context tokens consumed, 21k generated. - Codex-Spark (API): Produced beautiful folder structure and code, but imports were hallucinated. No unit tests. Used 1% of $100/mo Spark limits.
- Claude Haiku 4.5 (API): Detailed implementation but failed on correctness. (Further details truncated in source.)
- Qwen-3.6-27B (local q4_k_m): Not explicitly scored, but user notes quantized inference degrades quality vs full-precision API version.
Context
The user argues that typical local-model evals use trivial tasks (e.g., Snake in HTML) where both local and frontier models succeed, making local models look better than they are. This test used a real work project with a design document; only Codex-Spark produced fully written (but broken) code. The point: local models are not yet ready for complex code generation without substantial fixes.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Claude Max $100 subscription usage data for API extension task
A Claude Max $100 subscription user reports consuming 13% of a 5-hour session to extend an existing API with favorite library functionality, with context usage at 11% and weekly usage increasing from 5% to 6%.

Wikipedia Bans AI-Generated Content, Allows Limited AI Use with Human Review
Wikipedia has officially banned its 260,000 editors from using AI like ChatGPT to write articles, citing accuracy and reliability concerns. Editors can still use AI for translation and copy editing with human approval.

MTP Multi-Token Prediction: 2x Faster Token Generation on AMD Strix Halo & Radeon 9700 AI Pro
MTP accelerates LLM inference up to 2x, especially for coding agents. Video covers MTP mechanics and performance on Qwen 3.6 with AMD Strix Halo and Dual Radeon 9700.

Andrej Karpathy Joins Anthropic's Pre-Training Team to Drive Recursive Self-Improvement Using Claude
Andrej Karpathy, former OpenAI cofounder, joins Anthropic's pre-training team under Nick Josef to build a new team focused on using Claude to accelerate pre-training research, enabling recursive self-improvement.