Applying Claude Code's Architecture to Local 9B Models: Key Findings and Optimizations

Experimental Setup and Key Discovery
The developer used an RTX 5070 Ti (16GB VRAM) with qwen3.5:9b via Ollama (6.6GB) and the OpenClaw local agent framework. After 18 tests and 10 optimizations, the key finding was that qwen3.5:9b has native structured tool_calls, while qwen2.5-coder:14b and qwen2.5:14b put JSON in the content field instead of proper tool_calls, requiring extra parsing.
Performance Comparison
Model performance comparison:
- qwen3.5:9b: Native tool_calls structure, thinking chain enabled, 39 tok/s
- qwen2.5-coder:14b: Broken tool calling (in content field), no thinking chain, ~30 tok/s
- qwen2.5:14b: Broken tool calling (in content field), no thinking chain, ~35 tok/s
10 Optimizations from Claude Code's Architecture
- Structured system prompt → +600% output quality (A/B tested: 4 issues found vs 25+)
- MicroCompact (tool result compression) → 80-93% compression, 11KB down to 367 chars
- Hard cutoff (explore→produce forced transition) → Solved exploration loops where 9B models get stuck reading files without producing output
- think=false → 8-10x token efficiency, eliminates language contamination
- ToolSearch deferred loading → -60% prompt space (229 vs 568 tokens)
- Four-type memory system (user/feedback/project/reference) → Personalized responses
- KV cache forking → Minimal effect on single GPU (1.1x), needs vLLM
- Strict write discipline → Verify before updating memory, prevents memory corruption
- Parallel bootstrap → 9% faster cold start
- Cache break tracking → Ollama caches identical prompts (182ms→75ms)
Core Finding: Self-Discipline as the Real Ceiling
The biggest finding was that the real ceiling for 9B models isn't reasoning ability or tool-use accuracy, but self-discipline—knowing when to stop exploring and start producing output. Without hard cutoff, the model used all 12 steps reading files and produced 0 bytes of report. With hard cutoff: 5 steps reading + 1 step writing = 6080 bytes structured report.
What qwen3.5:9b Can Actually Do
- Read 800-line bash scripts and find real bugs (race conditions, non-atomic operations) — 2 min
- Design a sales feedback system architecture — 8.7KB document in 2.5 min
- Build a complete project (calculator + tests + run tests) — 28 seconds
- 10-step autonomous execution: write web scraper → pip install fails → find workaround → retry → tests pass — zero human intervention
- Full mini-factory pipeline: search → write article → review → publish to HTML — 2.5 min
Complete Engine Performance
All 10 optimizations were packaged into a single Python engine (~280 lines). First run results:
- Bootstrap: 527ms (parallel memory + model warmup)
- Explore: 5 tool steps with MicroCompact (88% compression)
- Produce: 1947 chars structured report
- Total: 39.4s / zero API cost
What Didn't Work
- KV cache forking on single GPU (needs multi-GPU or vLLM)
- Step budget in system prompt (model ignores meta-instructions about its own behavior)
- qwen2.5 series for tool calling (format issues)
The developer ran this on WSL2 + Ubuntu 24.04 and is willing to share more details or the engine code.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Building a Coding Agent for 8k Context: Planner/Executor Split, Token Budgeting, and Parallel Execution
A detailed breakdown of building a CLI coding agent designed around 8k token limits, using a planner/executor architecture, strict token budgeting, and parallel task execution.

Atoo Studio: Open-Source Workspace for Managing Multi-Project Claude Code Workflows
Atoo Studio is an open-source workspace built to address terminal and tab chaos when using Claude Code across multiple projects. It introduces session forking like Git branches and allows continuation across Claude Code, Codex CLI, and Gemini CLI.

Blindspot MCP: An External Brain for AI Coding Agents
Blindspot MCP is a tool that indexes full codebases using tree-sitter and SQLite to help AI coding agents understand symbols, dependencies, and relationships across files, preventing changes that break code outside their immediate context.

Master-plan: A Minimal Terminal Task System Built for Claude Code Users
A developer has built master-plan, a Claude Code plugin with four slash commands that manages tasks directly in the terminal using a markdown file and git. The system captures ideas mid-session without context switching and auto-detects test runners.