Coding Agents on Small Local Models: 4 Failure Points

After weeks running real multi-file coding tasks through small local models (sub-7B) and small cloud models on free tiers, a Reddit user documented consistent failure points beyond typical benchmark noise. Here's what actually breaks.

Markdown Fences Are the Most Common Failure

Even with "output only raw code, no markdown formatting" in the system prompt, most models wrap responses in triple backticks. Qwen3.5:9b and Gemma4:e4b follow instruction most consistently but still slip occasionally. The fix isn't better prompting — it's stripping fences in post-processing as a default.

Structured Output Is Unreliable Below 7B

When agents need JSON for task lists or action types, small models fail far more often than benchmarks suggest. Benchmarks test valid JSON; real use adds complex multi-step instructions with edge cases. Gemma4:e4b is the most reliable among local models; Qwen3.5:9B is close behind. Codellama struggles. On cloud, Llama 3.3 70B on Groq is rock solid. Practical workaround: validate JSON, retry once with explicit instruction, then fall back to a permissive parser that extracts JSON from prose.

Models Edit the Wrong File

Give a small model a task to rename validateToken to verifyToken with a project map of similar names, and it may rename validateUser or modify the wrong file entirely. The model treats the project map as suggestions, not constraints. Fix at the orchestration layer: validate file paths exist and function names are in the claimed files. Throw errors on mismatch — small models lie confidently.

Question vs. Action Classification

Asking "how many lines does utils.js have" should be read-only. But if the executor only has one edit mode, it will edit the file to contain the answer. The fix: the planner must classify requests into action types before execution. Read-only queries route to a separate code path that never touches disk.

What Works Better Than Expected

Token budget enforcement in code: Count tokens before every call; small models have no concept of context limits and will not be brief if trusted.
Per-file isolation: Sending one file at a time is dramatically more reliable than two — models mix up fixes.
Synthesis-style memory: Store a one-sentence summary of what the model did, not the full task list. Works for undo and additive requests.

Still Figuring Out

Whether any local model under 7B is viable for an agent role — the author hasn't found one that doesn't fail at structured output frequently enough. Open-sourced test harness at github.com/razvannec for contributions.

📖 Read the full source: r/LocalLLaMA