Optimizing Qwen 3.6 27B/35B on RTX 3090: Flags, Quantization, and Auto-Routing

✍️ OpenClawRadar📅 Published: May 5, 2026🔗 Source
Optimizing Qwen 3.6 27B/35B on RTX 3090: Flags, Quantization, and Auto-Routing
Ad

A developer running Qwen 3.6 models locally on an RTX 3090 (24GB VRAM), Ryzen 5700X, 64GB RAM, Windows 11, is hitting performance and reliability issues. They're using llama-server with custom flags and seeking advice on quant choice, throughput, and automatic model routing.

Commands and Quantizations

35B (UD Q4_K_M):

llama-server.exe -m "path\Qwen3.6-35B-A3B-UD-Q4_K_M.gguf" -ngl 99 -c 131072 -np 2 -fa on -ctk f16 -ctv f16 -b 2048 -ub 512 -t 8 --mlock -rea on --reasoning-budget 2048 --reasoning-format deepseek --jinja --metrics --slots --port 8081 --host 0.0.0.0

27B (UD Q4_K_XL):

llama-server.exe -m "path\Qwen3.6-27B-UD-Q4_K_XL.gguf" -ngl 99 -c 196608 -np 1 -fa on -ctk q8_0 -ctv q8_0 -b 2048 -ub 512 -t 8 --no-mmap -rea on --reasoning-budget -1 --reasoning-format deepseek --jinja --metrics --slots --port 8081 --host 0.0.0.0
Ad

Reported Issues

  • 35B too slow – even simple iterative tasks feel unusable.
  • 27B faster but unreliable – code output breaks; simple tasks can take 20–30 minutes.
  • Manual model switching – must kill server, paste new command, reload model.

Specific Questions

  • Are the flags suboptimal? (e.g., context size, batch size, cache type)
  • Which quant / model gives best balance of speed and coding accuracy on 24GB VRAM?
  • How to auto-switch models per request, or keep multiple models warm and route?

Context

The user runs Hermes agent on a Raspberry Pi 5 for scraping and automation, and local coding with OpenCode/QwenCode. They want a setup that doesn't require manual server restarts.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also

Developer shares 25 tested Claude prompts for SaaS development workflows
Guides

Developer shares 25 tested Claude prompts for SaaS development workflows

A developer has shared 25 specific prompts they use daily for SaaS development, covering backend architecture, API design, frontend copy, product documentation, and go-to-market tasks. The prompts are designed to save time on repetitive tasks like code review, documentation generation, and edge case testing.

OpenClawRadar
How Small Model Evaluation Prompts Can Mislead and How to Fix Them
Guides

How Small Model Evaluation Prompts Can Mislead and How to Fix Them

A Reddit post explains that small model evaluation prompts often produce misleading results due to triggering the wrong cognitive pathways in transformers, specifically identifying three distinct modes: factual recall, application/instruction following, and emotional/empathic inference.

OpenClawRadar
Building API endpoints with Claude: Practical prompt engineering lessons from a 70+ endpoint project
Guides

Building API endpoints with Claude: Practical prompt engineering lessons from a 70+ endpoint project

A developer built 70+ LinkedIn automation API endpoints with Claude writing 80% of the code, discovering that treating prompts like contracts with explicit constraints works better than natural language instructions for action-taking agents.

OpenClawRadar
Free OpenClaw Gateway with Local LLM on Oracle Cloud
Guides

Free OpenClaw Gateway with Local LLM on Oracle Cloud

A developer shares how to run OpenClaw Gateway with a local Qwen3.5 27B A3B 4-bit LLM on Oracle Cloud's free tier using a VM.Standard.A2.Flex instance with 4 OCPUs, 24GB RAM, and 200GB SSD, managed remotely via the QCAI app.

OpenClawRadar