12GB VRAM Benchmarks: Running Qwen 3.6 and Gemma 4 Models on a RTX 4070 Super

A Reddit user has published speed benchmarks for running several large MoE models on a 12 GB RTX 4070 Super (with +10% OC), paired with an AMD 9800X3D CPU and 64 GB DDR5-6000 RAM. The user offloads display to the iGPU to save VRAM, noting a ~10% performance penalty otherwise. Setup uses CUDA 13.1 and the latest llama.cpp with the following hardware configuration:
n-gpu-layers = 999
threads = 8
threads-batch = 16
batch-size = 4096
ubatch-size = 4096
ctx-size = 65536
flash-attn = true
Benchmark Results
The user tested four models via Unsloth GGUF quants in VS Code with Cline and KiloCode (no tool call issues). All measurements are tokens per second (tgs) and processing per second (pps).
- Qwen3.6-35B-A3B-GGUF Q6_K_XL: 40 tgs, 2100 pps
- Qwen3.6-27B-IQ3_XXS: 16 tgs, 1000 pps
- Gemma 4 26B-A4B-it-UD-Q8: 26 tgs, 2150 pps
- Gemma-4-31B-it-IQ3_XXS: 13-16 tgs, 650 pps
Notable Config Details
The user shared individual model configs with specific tuning. Key highlights:
- For Qwen3.6-35B-A3B:
n-cpu-moe = 35(offloads 35 MoE experts to CPU),cache-type-k = q8_0,cache-type-v = q8_0,swa-full = true,cache-reuse = 512, context size 131072, reasoning enabled with budget 8096. - For Gemma 4 26B:
n-cpu-moe = 27, context 102400,fit = onwithfit-target = 256andfit-ctx = 32768. - For Gemma 4 31B: uses speculative decoding with
ngram-mod(spec-type = ngram-mod),n-gpu-layers = 58(partial GPU offload),cache-type-k = q4_0,no-kv-offload = true. - All models use
flash-attn = trueandno-mmproj-offload = true.
The user's preferred model for web dev is Qwen3.6-35B-A3B, praising its quality with no tool call issues in VS Code extensions.
📖 Read the full source: r/LocalLLaMA
👀 See Also

OpenClaw 5.28: Codex Plugin Broken After Upgrade — Fix with Symlink Shim
OpenClaw 5.28 breaks Codex plugin due to binary path mismatch. Fix: create symlink from expected path to actual bin/codex.

Free OpenClaw Gateway with Local LLM on Oracle Cloud
A developer shares how to run OpenClaw Gateway with a local Qwen3.5 27B A3B 4-bit LLM on Oracle Cloud's free tier using a VM.Standard.A2.Flex instance with 4 OCPUs, 24GB RAM, and 200GB SSD, managed remotely via the QCAI app.

OpenClaw 3.22 Upgrade Checklist: Practical Steps from a Developer Who Got Burned
A developer shares specific upgrade steps for OpenClaw 3.22, including checking for deprecated environment variables, creating backups, running migration commands, and verifying plugin compatibility.

Mac Mini M4 Pro vs Mac Studio M4 Max for Local LLM Inference – Key Considerations
A developer compares Mac Mini M4 Pro (12C CPU/16C GPU, 273 GB/s) vs Mac Studio M4 Max (16C CPU/40C GPU, 546 GB/s), both 64GB/1TB, for local inference with Gemma 4 and Qwen. Key question: is the bandwidth jump worth $600?