Optimizing Qwen 3.6 27B/35B on RTX 3090: Flags, Quantization, and Auto-Routing

✍️ OpenClawRadar📅 Published: May 5, 2026🔗 Source

A developer running Qwen 3.6 models locally on an RTX 3090 (24GB VRAM), Ryzen 5700X, 64GB RAM, Windows 11, is hitting performance and reliability issues. They're using llama-server with custom flags and seeking advice on quant choice, throughput, and automatic model routing.

Commands and Quantizations

35B (UD Q4_K_M):

llama-server.exe -m "path\Qwen3.6-35B-A3B-UD-Q4_K_M.gguf" -ngl 99 -c 131072 -np 2 -fa on -ctk f16 -ctv f16 -b 2048 -ub 512 -t 8 --mlock -rea on --reasoning-budget 2048 --reasoning-format deepseek --jinja --metrics --slots --port 8081 --host 0.0.0.0

27B (UD Q4_K_XL):

llama-server.exe -m "path\Qwen3.6-27B-UD-Q4_K_XL.gguf" -ngl 99 -c 196608 -np 1 -fa on -ctk q8_0 -ctv q8_0 -b 2048 -ub 512 -t 8 --no-mmap -rea on --reasoning-budget -1 --reasoning-format deepseek --jinja --metrics --slots --port 8081 --host 0.0.0.0

Reported Issues

35B too slow – even simple iterative tasks feel unusable.
27B faster but unreliable – code output breaks; simple tasks can take 20–30 minutes.
Manual model switching – must kill server, paste new command, reload model.

Specific Questions

Are the flags suboptimal? (e.g., context size, batch size, cache type)
Which quant / model gives best balance of speed and coding accuracy on 24GB VRAM?
How to auto-switch models per request, or keep multiple models warm and route?

Context

The user runs Hermes agent on a Raspberry Pi 5 for scraping and automation, and local coding with OpenCode/QwenCode. They want a setup that doesn't require manual server restarts.

📖 Read the full source: r/LocalLLaMA

👀 See Also

Guides

Developer shares 25 tested Claude prompts for SaaS development workflows

A developer has shared 25 specific prompts they use daily for SaaS development, covering backend architecture, API design, frontend copy, product documentation, and go-to-market tasks. The prompts are designed to save time on repetitive tasks like code review, documentation generation, and edge case testing.

Mar 16, 2026, 03:45 AM UTC

OpenClawRadar

Guides

How Small Model Evaluation Prompts Can Mislead and How to Fix Them

A Reddit post explains that small model evaluation prompts often produce misleading results due to triggering the wrong cognitive pathways in transformers, specifically identifying three distinct modes: factual recall, application/instruction following, and emotional/empathic inference.

Mar 9, 2026, 11:45 AM UTC

OpenClawRadar

Guides

Building API endpoints with Claude: Practical prompt engineering lessons from a 70+ endpoint project

A developer built 70+ LinkedIn automation API endpoints with Claude writing 80% of the code, discovering that treating prompts like contracts with explicit constraints works better than natural language instructions for action-taking agents.

Mar 22, 2026, 11:45 PM UTC

OpenClawRadar

Guides

Free OpenClaw Gateway with Local LLM on Oracle Cloud

A developer shares how to run OpenClaw Gateway with a local Qwen3.5 27B A3B 4-bit LLM on Oracle Cloud's free tier using a VM.Standard.A2.Flex instance with 4 OCPUs, 24GB RAM, and 200GB SSD, managed remotely via the QCAI app.

Apr 13, 2026, 05:21 PM UTC

OpenClawRadar