Qwen3-30B-A3B vs Qwen3.5-35B-A3B Performance Comparison on RTX 5090

Performance Comparison: Qwen3-30B-A3B vs Qwen3.5-35B-A3B
A detailed benchmark comparing Qwen3-30B-A3B and the newly released Qwen3.5-35B-A3B on an NVIDIA RTX 5090 reveals trade-offs between speed and context handling. Both models use the same Mixture of Experts architecture with 3B active parameters, with the 3.5 version adding 5B more total parameters and including a vision projector.
Hardware and Setup
- GPU: NVIDIA RTX 5090 (32 GB VRAM, Blackwell)
- Server: llama.cpp b8115 (Docker: ghcr.io/ggml-org/llama.cpp:server-cuda)
- Quantization: Q4_K_M for both models
- KV Cache: Q8_0 (-ctk q8_0 -ctv q8_0)
- Context: 32,768 tokens (-c 32768)
- Parameters: -ngl 999 -np 4 --flash-attn on -t 12
- Model A: Qwen3-30B-A3B-Q4_K_M (17 GB on disk)
- Model B: Qwen3.5-35B-A3B-Q4_K_M (21 GB on disk)
Both models were warmed up with a throwaway request before timing. Server-side timings came from API responses, not wall-clock measurements.
Raw Inference Speed Results
Direct llama.cpp /v1/chat/completions testing showed:
- Short prompts (8-9 tokens): 30B: 248.2 tok/s, 3.5: 169.5 tok/s
- Medium prompts (73-78 tokens): 30B: 236.1 tok/s, 3.5: 163.5 tok/s
- Long-form (800 tokens): 30B: 232.6 tok/s, 3.5: 116.3 tok/s
- Code generation (298-400 tokens): 30B: 233.9 tok/s, 3.5: 161.6 tok/s
- Reasoning (200 tokens): 30B: 234.8 tok/s, 3.5: 158.2 tok/s
Average generation speed: 30B: 237.1 tok/s, 3.5: 153.8 tok/s (30B is 35% faster)
Prompt processing averages: 30B: 773.5 tokens/s, 3.5: 518.1 tokens/s
The 3.5 model shows an interesting regression on long outputs (800 tokens), dropping to 116 tok/s versus ~160 tok/s on shorter outputs. Prompt processing is slower on the 3.5 due to its larger vocabulary (248K vs 152K tokens).
Memory Usage
VRAM usage: 30B uses 27.3 GB idle, 3.5 uses 29.0 GB idle. Both fit comfortably on the RTX 5090.
Response Quality Observations
Testing at temperature=0.7 showed both models produce competent output. Key observations:
- Creative writing: Both solid, with 3.5 showing slightly more atmospheric prose
- Haiku generation: Both produce valid 5-7-5 structures
- Coding tasks: Both correctly implement LRU cache with O(1) get/put operations
The 3.5 model handles long context significantly better with flat token scaling versus the 30B's 21% degradation. Quality differences are minimal with a slight edge to 3.5 in structure and formatting.
📖 Read the full source: r/LocalLLaMA
👀 See Also

World's First GitHub Exclusive for AI Agents Launched: Limited Beta for 100 Users
An innovative GitHub exclusive for AI coding agents has been developed, with a limited beta of 100 users. Dive into how this tool is set to revolutionize AI collaboration.

Differences Between Using Claude via GitHub Copilot and as a VS Code Extension
Explore the differences between using Claude AI via GitHub Copilot target sessions and as a VS Code extension based on their integration and functionality.

Local LLM Benchmark: Backend Generation by Function Calling – GLM, Qwen, DeepSeek Compared
A rigorous benchmark of local and frontier LLMs for backend code generation via function calling, with scoring rubric. Key findings: qwen3.5-35b-a3b matches gpt-5.4 on DB/API design, and dense Qwen 27B beats 397B MoE. Frontier models dropped due to cost.

Anthropic files lawsuit to prevent Pentagon blacklisting over AI restrictions
Anthropic has filed a lawsuit seeking to block the Pentagon from blacklisting the company over restrictions on AI use, according to a Reuters report shared on Hacker News.