12GB VRAM Benchmarks: Running Qwen 3.6 and Gemma 4 Models on a RTX 4070 Super

✍️ OpenClawRadar📅 Published: April 30, 2026🔗 Source

A Reddit user has published speed benchmarks for running several large MoE models on a 12 GB RTX 4070 Super (with +10% OC), paired with an AMD 9800X3D CPU and 64 GB DDR5-6000 RAM. The user offloads display to the iGPU to save VRAM, noting a ~10% performance penalty otherwise. Setup uses CUDA 13.1 and the latest llama.cpp with the following hardware configuration:

n-gpu-layers = 999
threads = 8
threads-batch = 16
batch-size = 4096
ubatch-size = 4096
ctx-size = 65536
flash-attn = true

Benchmark Results

The user tested four models via Unsloth GGUF quants in VS Code with Cline and KiloCode (no tool call issues). All measurements are tokens per second (tgs) and processing per second (pps).

Qwen3.6-35B-A3B-GGUF Q6_K_XL: 40 tgs, 2100 pps
Qwen3.6-27B-IQ3_XXS: 16 tgs, 1000 pps
Gemma 4 26B-A4B-it-UD-Q8: 26 tgs, 2150 pps
Gemma-4-31B-it-IQ3_XXS: 13-16 tgs, 650 pps

Notable Config Details

The user shared individual model configs with specific tuning. Key highlights:

For Qwen3.6-35B-A3B: n-cpu-moe = 35 (offloads 35 MoE experts to CPU), cache-type-k = q8_0, cache-type-v = q8_0, swa-full = true, cache-reuse = 512, context size 131072, reasoning enabled with budget 8096.
For Gemma 4 26B: n-cpu-moe = 27, context 102400, fit = on with fit-target = 256 and fit-ctx = 32768.
For Gemma 4 31B: uses speculative decoding with ngram-mod (spec-type = ngram-mod), n-gpu-layers = 58 (partial GPU offload), cache-type-k = q4_0, no-kv-offload = true.
All models use flash-attn = true and no-mmproj-offload = true.

The user's preferred model for web dev is Qwen3.6-35B-A3B, praising its quality with no tool call issues in VS Code extensions.

📖 Read the full source: r/LocalLLaMA

👀 See Also

Guides

OpenClaw 5.28: Codex Plugin Broken After Upgrade — Fix with Symlink Shim

OpenClaw 5.28 breaks Codex plugin due to binary path mismatch. Fix: create symlink from expected path to actual bin/codex.

Jun 1, 2026, 12:15 AM UTC

OpenClawRadar

Guides

Free OpenClaw Gateway with Local LLM on Oracle Cloud

A developer shares how to run OpenClaw Gateway with a local Qwen3.5 27B A3B 4-bit LLM on Oracle Cloud's free tier using a VM.Standard.A2.Flex instance with 4 OCPUs, 24GB RAM, and 200GB SSD, managed remotely via the QCAI app.

Apr 13, 2026, 05:21 PM UTC

OpenClawRadar

Guides

OpenClaw 3.22 Upgrade Checklist: Practical Steps from a Developer Who Got Burned

A developer shares specific upgrade steps for OpenClaw 3.22, including checking for deprecated environment variables, creating backups, running migration commands, and verifying plugin compatibility.

Mar 31, 2026, 03:45 AM UTC

OpenClawRadar

Guides

Mac Mini M4 Pro vs Mac Studio M4 Max for Local LLM Inference – Key Considerations

A developer compares Mac Mini M4 Pro (12C CPU/16C GPU, 273 GB/s) vs Mac Studio M4 Max (16C CPU/40C GPU, 546 GB/s), both 64GB/1TB, for local inference with Gemma 4 and Qwen. Key question: is the bandwidth jump worth $600?

Apr 29, 2026, 12:19 AM UTC

OpenClawRadar