Qwen3.5 27B vs Larger Models: Local Coding Test

A developer tested several large language models for local coding tasks, comparing performance and hardware requirements. The testing focused on Qwen3.5 variants and Nemotron models, with comparisons to GPT-5.4 High.

Test Results and Findings

The developer tested these specific models:

unsloth/Qwen3.5-27B-GGUF:UD-Q4_K_XL
unsloth/Qwen3.5-35B-A3B-GGUF:UD-Q4_K_XL
unsloth/Qwen3.5-122B-A10B-GGUF
unsloth/Qwen3.5-27B-GGUF:UD-Q6_K_XL
unsloth/Qwen3.5-27B-GGUF:UD-Q8_K_XL
unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF:UD-IQ4_XS
unsloth/gpt-oss-120b-GGUF:F16

Key findings from the testing:

Nemotron-3-Super-120B performed "very, very good," on par with GPT-5.4 High
Qwen3.5-27B performed well for development tasks
GPT-OSS-120B and Qwen3.5-122B performed worse than the other two models
Nemotron-3-Super-120B consistently responded in Spanish (the tester's native language) while others responded in English

Performance Metrics

The developer provided specific performance numbers:

Nemotron-3-Super-120B: 80 tokens per second (tg/s), ~2000 prompt processing (pp), 100k context on vast.ai with 4x RTX 3090
Qwen3.5-27B Q6: 803 pp, 25 tg/s, 256k context on vast.ai

Hardware Requirements

The developer noted hardware constraints:

Qwen3.5-122B would require a new motherboard and 1-2 more RTX 3090 cards, making it too expensive
Qwen3.5-27B runs on existing 2x RTX 3090 hardware without additional investment
If they had the hardware for Nemotron-3-Super-120B, they would use it instead

Implementation Details

The developer plans to use Qwen3.5-27B-GGUF:UD-Q6_K_XL for real development tasks locally and provided the llama.cpp command used for testing:

./llama.cpp/llama-server -hf unsloth/Qwen3.5-27B-GGUF:UD-Q6_K_XL --ctx-size 262144 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 -ngl 999

The developer mentioned they'll continue using CODEX for complex tasks but can replace API subscriptions for daily tasks with the local setup.

📖 Read the full source: r/LocalLLaMA

Developer Tests Qwen3.5 27B vs Larger Models for Local Coding Tasks

Test Results and Findings

Performance Metrics

Hardware Requirements

Implementation Details

👀 See Also

Distilled Qwen 3.5 27B Model Shows Strong Performance with Cursor AI Coding Agent

Project Headroom: Netflix Engineer's Open Source Tool Slashes AI Token Costs by 90%

Open-source structural hallucination checker for AI agent pipelines

ClawMetry adds remote monitoring with E2E encryption for OpenClaw agents