Running Qwen3.6-35B-A3B with ~190k Context on 8GB VRAM + 32GB RAM – Setup & Benchmarks

A Reddit user has posted a detailed setup for running Qwen3.6-35B-A3B GGUF models with ~190k context on a laptop with 8GB VRAM (RTX 4060) and 32GB DDR5 RAM. They report 37-43 tok/s out of the box, with tweaks pushing to ~51 tok/s.
Hardware & Models
- GPU: RTX 4060 8GB VRAM
- RAM: 32GB DDR5 5600MHz
- OS: Linux (performance noted as better than Windows)
- Models tested (Q5 quant):
mudler/Qwen3.6-35B-A3B-APEX-GGUF– ~40 tok/s to 37 tok/shesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF– ~43 tok/s to 37 tok/s
Key Configuration
Using a fork of llama.cpp with TurboQuant support (turboquant_plus), the user runs llama-server with the following flags:
--model "<path>" \
--host 0.0.0.0 \
--port 8085 \
--ctx-size 192640 \
--n-gpu-layers 430 \
--n-cpu-moe 35 \
--cache-type-k "turbo4" \
--cache-type-v "turbo4" \
--flash-attn on \
--batch-size 2048 \
--parallel 1 \
--no-mmap \
--mlock \
--ubatch-size 512 \
--threads 6 \
--cont-batching \
--timeout 300 \
--temp 0.2 \
--top-p 0.95 \
--min-p 0.05 \
--top-k 20 \
--metrics \
--chat-template-kwargs '{"preserve_thinking": true}'
To push speeds to ~51 tok/s, adjust three flags: --ctx-size 192640, --n-gpu-layers 430, --n-cpu-moe 35 (tweak slightly based on stability/memory).
Caveats
- Q4 quant is noticeably worse for long-context reasoning vs Q5.
--no-mmap+--mlockreduces stuttering slowdowns.- TurboQuant KV cache is critical at high context sizes.
- High RAM bandwidth (DDR5) is important for these speeds.
- Linux outperforms Windows significantly for this workload.
Who This Is For
Developers running local LLMs with very long contexts (170k+ tokens) on consumer hardware, especially those with 8-12GB VRAM and fast system RAM.
📖 Read the full source: r/LocalLLaMA
👀 See Also

How to fix OpenClaw 'Cannot find module' error after update
After updating OpenClaw from version 2026.3.24 to 2026.4.5, users are encountering a 'Cannot find module @buape/carbon' error. The solution involves manually running a post-installation script instead of installing the package globally.

How to run OpenClaw agents for free using cloud APIs or local models
A detailed guide explains how to run OpenClaw agents at zero cost using free cloud tiers from OpenRouter, Gemini, and Groq, or by running local models via Ollama with specific configuration tips to avoid common pitfalls.

Understanding AI Agent Architecture: Deterministic vs Probabilistic Layers
A Reddit user shares a mental model for AI agent systems that separates deterministic layers (scripts, commands, APIs) from probabilistic layers (LLM reasoning and decisions). The key insight: push as much work as possible to the deterministic side.

Making an MCP Server Install Itself: Three Hosts, Three Mechanisms, Gotchas
A deep dive into programmatically installing MCP servers across VS Code, Cursor, and Claude Code — covering APIs, file writes, and edge cases like malformed JSON, atomic writes, and idempotent updates.