Optimizing Qwen3.5-9B on RTX 3070 Mobile with ik_llama.cpp: Config Tweaks and Benchmarks

Hardware and Software Setup
A developer documented their experience optimizing local inference on a laptop with an RTX 3070 Mobile GPU (8GB VRAM, effectively ~7.7GB usable). The system runs CachyOS (Arch-based Linux 6.19) with 32GB RAM and an Intel i7-10750H CPU. They used ik_llama.cpp (ikawrakow's optimized fork of llama.cpp) with the Qwen3.5-9B Q4_K_M model from Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2-GGUF.
Initial Configuration Issues
The initial naive configuration included several problems:
- MoE-specific flags (
--n-cpu-moe,-ger,-ser) were incorrectly applied to a non-MoE model (n_expert = 0) --mlockwas silently failing due to memory allocation limits (requiresulimit -l unlimitedor limits.conf entry)- Batch size
-b 4096was consuming excessive VRAM (2004 MiB compute buffer), nearly 2GB on an 8GB card
This configuration produced ~47.8 t/s generation speed and ~82 t/s prompt evaluation with VRAM at ~97%.
Optimization Results
After fixing the configuration issues and adjusting batch sizes to -b 2048 -ub 512 (reducing compute buffer to 501 MiB), the developer tested different KV cache configurations:
- Original (q4_0/q4_0, b4096): 47.8 t/s gen, 82.6 t/s prompt, ~97% VRAM
- Fixed flags + b2048/ub512, q8_0K/q4_0V: 48.4 t/s gen, 189.9 t/s prompt, ~80% VRAM
- q8_0K/q8_0V: 50.0 t/s gen, 213.0 t/s prompt, ~84% VRAM
The prompt evaluation speed increased dramatically from ~82 to ~213 t/s, primarily from reducing batch size to free up GPU memory. While generation speed showed minimal change (~2% difference between q4_0 and q8_0), the q8_0/q8_0 configuration produced noticeably more coherent and complete responses on longer outputs, worth the extra ~256 MiB VRAM usage.
Final Configuration
The optimized command for single-user local server use:
./build/bin/llama-server \
-m ./models/Qwen3.5-9B.Q4_K_M.gguf \
-ngl 999 \
-fa on \
-c 65536 \
-b 2048 \
-ub 512 \
-ctk q8_0 \
-ctv q8_0 \
--threads 6 \
--threads-batch 12Open Questions and Future Testing
The developer identified several areas for further investigation:
- GPU power limit tuning on mobile GPUs (potential to reduce TGP with minimal speed loss since inference is memory-bandwidth bound)
- Other 8GB-compatible models with good coding or reasoning performance
- Comparison of ik_llama.cpp vs mainline llama.cpp (ik-specific optimizations include fused ops and graph reuse)
- Tips for hybrid SSM architecture (context shift warnings cause hard stops when context fills, no sliding window)
The testing used a prompt requesting implementation of a Rust Sieve of Eratosthenes program with algorithm explanation, complexity analysis, and example output for N=50.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Four Common Setup Mistakes That Make People Quit OpenClaw
A Reddit user reports seeing over 50 people quit OpenClaw due to four specific setup issues: missing SOUL.md files, excessive API costs from using Opus model for everything, installing too many skills at once, and creating multiple agents before the first one works properly.

The LLM Voice Problem: Avoiding AI-Generated Writing Patterns
A developer discusses the common issue of LLM-assisted writing having recognizable "LLM-isms" that trigger immediate AI detection, and shares an article on identifying these patterns and editing for authenticity.

Modifying OpenClaw's default system prompt to bypass content restrictions
A user modified OpenClaw's configuration file to change the default system prompt from "You are a helpful, respectful and honest assistant" to a custom prompt that ignores external safety filters, effectively removing content restrictions. The process involves editing config.js in the node-llama-cpp installation directory.

Claude vs GPT for PhD Academic Writing: Preserving Technical Meaning in Methods Sections
A PhD candidate compares Claude and GPT for polishing computer vision / hardware co-design papers, finding Claude more reliable at preserving technical meaning and argument structure while GPT sometimes oversimplifies claims.