Benching local Qwen 3.6 27B as a Codex validator co-agent

A developer on r/LocalLLaMA has been running a local Qwen model beside OpenAI's Codex as a validator and challenger, and built a small reproducible eval suite to quantify which GGUF quant profiles work best for this role. The workflow: Codex handles main repo work; local Qwen challenges the plan, checks for overbuilding, missed hard directives, UI/design issues, bad assumptions, and long-context misses. The author reviews each interaction before proceeding.
Eval suite setup
The suite tests Qwen 3.6 27B GGUF profiles through llama.cpp, including Bartowski and Unsloth variants at different context sizes and KV cache formats (q8, f16). The focus is on real-world failures: missed directives, bad challenge behavior, overbuilding, UI judgment, and long-context misses.
Key findings
- The top-performing profiles on this suite were:
bartowski-128k-f16,bartowski-128k-q8, andunsloth-128k-q8. All three tied on accuracy. - q8 KV cache showed no measured accuracy loss in this specific suite.
- Context size mattered more than f16-vs-q8 KV for this workflow. 65k profiles failed when the suite required >65k tokens.
unsloth-128k-f16loaded but hit memory/throughput pressure on long-context cases on an RTX 5090.
Practical observations
The author reports Qwen is extremely good at catching silent bypasses, overbuilding, and coding-to-completion shortcuts in Codex. For UI-related tasks, Qwen takes the lead in design while Codex implements. The roles reverse: Qwen challenges the plan, and the human reviews before each stage.
Resources
- Project page: https://robert896r1.github.io/qwen-realworld-accuracy-evals/
- Repo: https://github.com/robert896r1/qwen-realworld-accuracy-evals
📖 Read the full source: r/LocalLLaMA
👀 See Also

Eqho: Local Voice-to-Text App for Claude Code Sessions
Eqho is a free, open-source voice-to-text app that uses OpenAI's Whisper model locally to type spoken input into any focused application. Currently Windows-only with command-line setup required.

Claude Desktop + Blender via MCP: Real-Time 3D Workflow Closes the Feedback Loop
An open-source Blender add-on runs an MCP server inside Blender, letting Claude Desktop inspect scenes, create objects, render images, and read results—closing the script-paste feedback loop.

Reddit User Shares AI Tool for Gathering Financial Account Balances
A Reddit post on r/openclaw presents an AI agent designed to streamline the collection of financial account balances using Python. Users discuss automation potential via custom scripts leveraging APIs like Plaid.

Open Source Second Brain System Built on Claude Code for Task Management
An open source system called Kipi System uses Claude Code to track open threads, draft follow-ups, and manage tasks by pulling from calendar, email, CRM, and social feeds. It generates a daily HTML file with pre-written actions sorted by friction.