What Breaks When Running Coding Agents on Small Local Models

After weeks running real multi-file coding tasks through small local models (sub-7B) and small cloud models on free tiers, a Reddit user documented consistent failure points beyond typical benchmark noise. Here's what actually breaks.
Markdown Fences Are the Most Common Failure
Even with "output only raw code, no markdown formatting" in the system prompt, most models wrap responses in triple backticks. Qwen3.5:9b and Gemma4:e4b follow instruction most consistently but still slip occasionally. The fix isn't better prompting — it's stripping fences in post-processing as a default.
Structured Output Is Unreliable Below 7B
When agents need JSON for task lists or action types, small models fail far more often than benchmarks suggest. Benchmarks test valid JSON; real use adds complex multi-step instructions with edge cases. Gemma4:e4b is the most reliable among local models; Qwen3.5:9B is close behind. Codellama struggles. On cloud, Llama 3.3 70B on Groq is rock solid. Practical workaround: validate JSON, retry once with explicit instruction, then fall back to a permissive parser that extracts JSON from prose.
Models Edit the Wrong File
Give a small model a task to rename validateToken to verifyToken with a project map of similar names, and it may rename validateUser or modify the wrong file entirely. The model treats the project map as suggestions, not constraints. Fix at the orchestration layer: validate file paths exist and function names are in the claimed files. Throw errors on mismatch — small models lie confidently.
Question vs. Action Classification
Asking "how many lines does utils.js have" should be read-only. But if the executor only has one edit mode, it will edit the file to contain the answer. The fix: the planner must classify requests into action types before execution. Read-only queries route to a separate code path that never touches disk.
What Works Better Than Expected
- Token budget enforcement in code: Count tokens before every call; small models have no concept of context limits and will not be brief if trusted.
- Per-file isolation: Sending one file at a time is dramatically more reliable than two — models mix up fixes.
- Synthesis-style memory: Store a one-sentence summary of what the model did, not the full task list. Works for undo and additive requests.
Still Figuring Out
Whether any local model under 7B is viable for an agent role — the author hasn't found one that doesn't fail at structured output frequently enough. Open-sourced test harness at github.com/razvannec for contributions.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Fixing OpenClaw Prompt Bloat and Slow Response Loops
Users experiencing long delays since 2026.4.26 can reclaim performance by reducing context bloat: trim always-injected files, limit visible skills, and avoid pasting huge tool outputs in main chat.

OpenClaw 101: The Ultimate Setup Guide for New Users

Running a 1 Trillion Parameter LLM Locally on AMD Ryzen AI Max+ Cluster
AMD demonstrates running the Kimi K2.5 open-source model (375GB, 1 trillion parameters) across four Framework Desktop systems with Ryzen AI Max+ 395 processors using llama.cpp RPC. The guide covers TTM kernel modifications for 120GB VRAM per node and provides two setup options: Lemonade SDK pre-built binaries or manual ROCm 7.0.2 installation.

OpenClaw: Your Ultimate Quick Reference Cheatsheet
Dive into the nitty-gritty of OpenClaw with our handy reference cheatsheet. Extract critical features and functionalities to streamline your AI coding experience.