Optimizing Qwen 3.6 27B/35B on RTX 3090: Flags, Quantization, and Auto-Routing

A developer running Qwen 3.6 models locally on an RTX 3090 (24GB VRAM), Ryzen 5700X, 64GB RAM, Windows 11, is hitting performance and reliability issues. They're using llama-server with custom flags and seeking advice on quant choice, throughput, and automatic model routing.
Commands and Quantizations
35B (UD Q4_K_M):
llama-server.exe -m "path\Qwen3.6-35B-A3B-UD-Q4_K_M.gguf" -ngl 99 -c 131072 -np 2 -fa on -ctk f16 -ctv f16 -b 2048 -ub 512 -t 8 --mlock -rea on --reasoning-budget 2048 --reasoning-format deepseek --jinja --metrics --slots --port 8081 --host 0.0.0.027B (UD Q4_K_XL):
llama-server.exe -m "path\Qwen3.6-27B-UD-Q4_K_XL.gguf" -ngl 99 -c 196608 -np 1 -fa on -ctk q8_0 -ctv q8_0 -b 2048 -ub 512 -t 8 --no-mmap -rea on --reasoning-budget -1 --reasoning-format deepseek --jinja --metrics --slots --port 8081 --host 0.0.0.0Reported Issues
- 35B too slow – even simple iterative tasks feel unusable.
- 27B faster but unreliable – code output breaks; simple tasks can take 20–30 minutes.
- Manual model switching – must kill server, paste new command, reload model.
Specific Questions
- Are the flags suboptimal? (e.g., context size, batch size, cache type)
- Which quant / model gives best balance of speed and coding accuracy on 24GB VRAM?
- How to auto-switch models per request, or keep multiple models warm and route?
Context
The user runs Hermes agent on a Raspberry Pi 5 for scraping and automation, and local coding with OpenCode/QwenCode. They want a setup that doesn't require manual server restarts.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Developer shares 25 tested Claude prompts for SaaS development workflows
A developer has shared 25 specific prompts they use daily for SaaS development, covering backend architecture, API design, frontend copy, product documentation, and go-to-market tasks. The prompts are designed to save time on repetitive tasks like code review, documentation generation, and edge case testing.

How Small Model Evaluation Prompts Can Mislead and How to Fix Them
A Reddit post explains that small model evaluation prompts often produce misleading results due to triggering the wrong cognitive pathways in transformers, specifically identifying three distinct modes: factual recall, application/instruction following, and emotional/empathic inference.

Building API endpoints with Claude: Practical prompt engineering lessons from a 70+ endpoint project
A developer built 70+ LinkedIn automation API endpoints with Claude writing 80% of the code, discovering that treating prompts like contracts with explicit constraints works better than natural language instructions for action-taking agents.

Free OpenClaw Gateway with Local LLM on Oracle Cloud
A developer shares how to run OpenClaw Gateway with a local Qwen3.5 27B A3B 4-bit LLM on Oracle Cloud's free tier using a VM.Standard.A2.Flex instance with 4 OCPUs, 24GB RAM, and 200GB SSD, managed remotely via the QCAI app.