Setting Up Qwen3.5-27B Locally: vLLM vs llama.cpp Comparison

Qwen3.5-27B Performance and Capabilities
The Qwen3.5-27B model demonstrates strong performance in various benchmarks according to the source: MMLU-Pro: 85.3, MMLU-Redux: 93.3, C-Eval: 90.2, overall intelligence score: 42.1 (better than 91% of compared models), and coding index: 34.9 (tops 88% in coding capabilities). The model features a dense architecture with native 262k context that's extensible to 1M+ tokens.
Backend Comparison: llama.cpp vs vLLM
The source compares two main approaches for local deployment:
Option 1: llama.cpp
- Pros: Low footprint, easy setup, supports q4 KV cache for reasonable VRAM usage
- Cons: Major issue with KV cache getting wiped randomly, forcing full prompt reprocessing mid-session. Speculative decoding via MTP doesn't work. Known bug with no solid fixes yet.
Option 2: vLLM
- Pros: Stable sessions, no KV wipeouts, supports speculative decoding with MTP for faster generations
- Cons: No q4 KV support, so VRAM spikes at 256k context. Tool call parsing is buggy for Qwen3.5 in v0.17.1, with fixes in open GitHub PRs but not merged yet. This breaks agentic coding flows with malformed JSON outputs.
Recommended vLLM Configuration
The source provides specific configuration recommendations for stable, high-speed runs using the model from HF: osoleve/Qwen3.5-27B-Text-NVFP4-MTP:
- Use the flashinfer cutlass backend for optimized performance
- Set context window to 128k (balances VRAM and usability; bump to 256k if you have the hardware)
- Limit GPU utilization to 0.82 to avoid OOM crashes
- Set max-num-seq to 2 (handles a single session fine without overcommitting)
- Enable MTP speculative decoding for speed improvements
- Patch vLLM with the Qwen tool call parsing fixes from the open PRs
- Use Claude code cli - open code still has tool call parsing issues that don't appear on Claude code after the patch
Performance Results
According to the source, performance varies by hardware:
- On an RTX 5090 (32GB VRAM): ~50 TPS
- On an RTX Pro 6000 (96GB VRAM): 70 TPS at full 256k context
📖 Read the full source: r/LocalLLaMA
👀 See Also

OpenClaw v2.0 Update: Critical Pre-Update Checklist to Avoid Breaking Changes
OpenClaw's latest update introduces 12 breaking changes, a new plugin system, and 30+ security patches. This guide outlines five essential checks to perform before updating, including environment variable renaming, state directory migration, and browser automation reconfiguration.

Practical Prompt Structure for Claude AI Execution Agents
A developer shares prompt engineering techniques that reduced hallucinations in Claude AI agents performing API calls, data extraction, and multi-step workflows. Key strategies include writing prompts as contracts, dedicating 40% of tokens to error handling, and separating 'wait' from 'stop' conditions.

Access GPT-5.4 via Codex subscription in OpenClaw
A Reddit post details how to configure OpenClaw to use GPT-5.4 through an OpenAI Codex subscription by modifying the openclaw.json configuration file and restarting the gateway.

Practical Guide to Creating Claude Skills: Structure, Triggers, and Scripts
Claude Skills are instruction manuals that automate repetitive tasks, stored as folders with a SKILL.md file in ~/.claude/skills/. The guide explains YAML triggers, script integration, and multi-skill orchestration rules.