Speculative Decoding Benchmarks on RTX 3090 with Qwen Models for HVAC Business Use

Hardware and Setup
The developer used an RTX 3090 24GB, Ryzen 7600X, 32GB RAM, and WSL2 Ubuntu. They moved from Ollama on Windows to llama.cpp on WSL Linux with speculative decoding for an internal AI platform handling customer lookups, quote formatting, equipment research, and parsing messy job notes.
Testing Methodology
They tested 16 GGUF models across Qwen2.5, Qwen3, and Qwen3.5 families, every target+draft combination that fits in 24GB VRAM, cross-generation draft pairings (Qwen2.5 drafts on Qwen3 targets and vice versa), and monitored VRAM on every combo to catch CPU offloading. Quality evaluation used real HVAC business prompts for SQL generation, quote formatting, messy field note parsing, and equipment compatibility reasoning. They used draftbench and llama-throughput-lab for speed sweeps, with Claude Code automating the process overnight.
Top Speed Results
- Qwen3-8B Q8_0 + Qwen3-1.7B Q4_K_M: 279.9 tok/s (+236% speedup, 13.6 GB VRAM)
- Qwen2.5-7B Q4_K_M + Qwen2.5-0.5B Q8_0: 205.4 tok/s (+50% speedup, ~6 GB VRAM)
- Qwen3-8B Q8_0 + Qwen3-0.6B Q4_0: 190.5 tok/s (+129% speedup, 12.9 GB VRAM)
- Qwen3-14B Q4_K_M + Qwen3-0.6B Q4_0: 159.1 tok/s (+115% speedup, 13.5 GB VRAM)
- Qwen2.5-14B Q8_0 + Qwen2.5-0.5B Q4_K_M: 137.5 tok/s (+186% speedup, ~16 GB VRAM)
- Qwen3.5-35B-A3B Q4_K_M (baseline, no draft): 133.6 tok/s (22 GB VRAM)
- Qwen2.5-32B Q4_K_M + Qwen2.5-1.5B Q4_K_M: 91.0 tok/s (+156% speedup, ~20 GB VRAM)
The Qwen3-8B + 1.7B draft combo achieved 100% acceptance rate—perfect draft match where the 1.7B predicts exactly what the 8B would generate.
Qwen3.5 Thinking Mode Issue
Qwen3.5 models enter thinking mode by default on llama.cpp, generating hidden reasoning tokens before responding. This caused erratic benchmark results: 0 tok/s alternating with 700 tok/s, TTFT jumping between 1s and 28s. Only three methods worked to disable it:
--jinja+ patched chat template withenable_thinking=falsehardcoded ✅- Raw
/completionendpoint (bypasses chat template entirely) ✅ - Everything else (system prompts,
/no_thinksuffix, temperature tricks) ❌
If running Qwen3.5 on llama.cpp, you need the patched template or you'll get garbage benchmarks.
Quality Evaluation Findings
They ran four hard HVAC-specific prompts testing ambiguous customer requests, complex quotes, messy notes with typos, and equipment compatibility reasoning. Key findings:
- Every single model failed the pricing formula math: 8B, 14B, 32B, 35B—none could correctly compute $4,811 / (1 - 0.47) = $9,077. LLMs cannot do business math reliably—put your formulas in code.
- The 8B handled 3/4 hard prompts—good on ambiguous requests, messy notes, daily tasks—but failed on technical equipment reasoning.
- The 35B-A3B was the only model with real HVAC domain knowledge—correctly sized a mini split for an uninsulated Chicago garage, knew to recommend Hyper-Heat series for cold climate, correctly said no branch box needed for single zone—but missed a model number in messy notes and failed the math.
- Bigger ≠ better across the board: The Qwen3-14B Q4_K_M (159 tok/s) performed worse than the 8B on most prompts. The 32B recommended a 5-ton unit for a 400 sqft garage.
- Qwen2.5-7B hallucinated on every note parsing test—consistently invented details.
📖 Read the full source: r/LocalLLaMA
👀 See Also

How an AI Personal Assistant Transformed Management of My Twitter Account
Discover how an AI personal assistant revolutionized the management of a Twitter account with increased engagement and efficiency. Learn from this real success story sourced from the OpenClaw community.

Running Claude Code 24/7 as a Background Agent — 2 Weeks of Experience
A developer shares their setup for running Claude Code continuously on a VPS, handling code reviews, refactoring, and deployments while they sleep.

OpenClaw as a Process Replication Engine: Multi-Agent Workflows for Automated Development
A developer found OpenClaw more effective as a 'process replication engine' than a personal assistant, building multi-agent workflows that automate complex development pipelines from idea to deployment for around $80/month.

Mac Studio local LLM loadout: GLM 5.1, Kimi K2.6, and what's working for coding with Claude Code
A developer shares their May 2026 Mac Studio (M3 Ultra) setup with quantized GLM 5.1 (380GB, 17 tps decode), Kimi K2.6 (460GB, 21 tps decode), and notes on Minimax 2.7, Gemma 4 31B, Qwen 3.5 9B, and pending Deepseek/Mimo support.