Qwen 3.6 27B F16 Passes Pacman Coding Test, But 8-Bit Quants Fail — Key Lessons on Templates and MTP Speculative Decoding

A developer on r/LocalLLaMA shared a practical coding benchmark: one-shot a single-page Pacman clone from a good prompt, three attempts, keep the best. Qwen 3.6 27B F16 produced two nearly perfect games — the first local model to succeed. However, dropping to 8-bit quantization made good results unreproducible even after five attempts, reinforcing the claim that 8-bit quant is not lossless for complex generative tasks.
Key technical findings from the post:
- Chat template is critical: The official Qwen chat template is tuned for vLLM and contains errors in llama.cpp and other runners. The author fixed bugs iteratively, and after fine-tuning, the model felt "a new level of intelligence."
- MTP speculative decoding speeds vary by task: For deterministic tasks like coding, generative tok/s ranged from 8 to 18 tok/s (baseline without MTP: 6.6 tok/s). Creative tasks see less acceleration.
- Harness choice affects speed more than code quality: Qwen CLI performed surprisingly well — comparable to Claude Code in output quality, but much faster because Claude Code's extra prompts slow down local models. With a slow model like Qwen 3.6 27B at ~6 tok/s, every extra prompt adds painful latency.
- Don't interfere with context management: The model's native context caching and compaction work well. Plugins or tools that manipulate cache or context confuse the model and degrade performance.
- Tool calls and subagents work flawlessly after proper chat template fixes. Context compaction, shell usage, and parallel subagents all function as expected.
The author warns that your mileage depends heavily on runner configuration: use F16 weights, a corrected chat template, and avoid heavy harnesses unless you have fast inference. The full playable Pacman result is available at guigand.com/pacman.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Applying Claude Code's Architecture to Local 9B Models: Key Findings and Optimizations
A developer extracted architectural patterns from Claude Code's leaked source code and applied 10 optimizations to qwen3.5:9b running locally on an RTX 5070 Ti. The key discovery was that qwen3.5:9b has native structured tool_calls, and the biggest limitation for 9B models is self-discipline in knowing when to stop exploring and start producing output.

Spec-Driven Development Workflow for Claude Code: Decomposition, Context Clearing, and Cost Control
A spec-driven development approach for Claude Code that uses two-dimensional decomposition, context clearing between steps, and specs written to disk to improve agent performance and reduce costs.

Claude Code Ultracode Mode Spawns 70-Agent Pipeline for Deep Search
A single 'deep search' request in Claude Code's ultracode mode auto-generated a 4-phase pipeline with ~70 agents, each fetching and cross-checking projects independently. The orchestrator script keeps intermediate results out of the context window, preventing context overload.

Introducing Lean Collab: A Multi-Agent Orchestrator for Long-Running LLM Tasks
Lean Collab is an open-source orchestrator designed to manage long-running LLM tasks using coordinated, parallel sub-agents.