Running Qwen3.6-35B-A3B with ~190k Context on 8GB VRAM + 32GB RAM – Setup & Benchmarks

✍️ OpenClawRadar📅 Published: May 10, 2026🔗 Source

Running Qwen3.6-35B-A3B with ~190k Context on 8GB VRAM + 32GB RAM – Setup & Benchmarks

Ad

A Reddit user has posted a detailed setup for running Qwen3.6-35B-A3B GGUF models with ~190k context on a laptop with 8GB VRAM (RTX 4060) and 32GB DDR5 RAM. They report 37-43 tok/s out of the box, with tweaks pushing to ~51 tok/s.

Hardware & Models

GPU: RTX 4060 8GB VRAM
RAM: 32GB DDR5 5600MHz
OS: Linux (performance noted as better than Windows)
Models tested (Q5 quant):
- mudler/Qwen3.6-35B-A3B-APEX-GGUF – ~40 tok/s to 37 tok/s
- hesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF – ~43 tok/s to 37 tok/s

Key Configuration

Using a fork of llama.cpp with TurboQuant support (turboquant_plus), the user runs llama-server with the following flags:

--model "<path>" \
--host 0.0.0.0 \
--port 8085 \
--ctx-size 192640 \
--n-gpu-layers 430 \
--n-cpu-moe 35 \
--cache-type-k "turbo4" \
--cache-type-v "turbo4" \
--flash-attn on \
--batch-size 2048 \
--parallel 1 \
--no-mmap \
--mlock \
--ubatch-size 512 \
--threads 6 \
--cont-batching \
--timeout 300 \
--temp 0.2 \
--top-p 0.95 \
--min-p 0.05 \
--top-k 20 \
--metrics \
--chat-template-kwargs '{"preserve_thinking": true}'

To push speeds to ~51 tok/s, adjust three flags: --ctx-size 192640, --n-gpu-layers 430, --n-cpu-moe 35 (tweak slightly based on stability/memory).

Ad

Caveats

Q4 quant is noticeably worse for long-context reasoning vs Q5.
--no-mmap + --mlock reduces stuttering slowdowns.
TurboQuant KV cache is critical at high context sizes.
High RAM bandwidth (DDR5) is important for these speeds.
Linux outperforms Windows significantly for this workload.

Who This Is For

Developers running local LLMs with very long contexts (170k+ tokens) on consumer hardware, especially those with 8-12GB VRAM and fast system RAM.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also

How to fix OpenClaw 'Cannot find module' error after update

How to fix OpenClaw 'Cannot find module' error after update

After updating OpenClaw from version 2026.3.24 to 2026.4.5, users are encountering a 'Cannot find module @buape/carbon' error. The solution involves manually running a post-installation script instead of installing the package globally.

Apr 16, 2026, 08:45 AM UTC

How to run OpenClaw agents for free using cloud APIs or local models

How to run OpenClaw agents for free using cloud APIs or local models

A detailed guide explains how to run OpenClaw agents at zero cost using free cloud tiers from OpenRouter, Gemini, and Groq, or by running local models via Ollama with specific configuration tips to avoid common pitfalls.

Apr 14, 2026, 09:53 PM UTC

Understanding AI Agent Architecture: Deterministic vs Probabilistic Layers

Understanding AI Agent Architecture: Deterministic vs Probabilistic Layers

A Reddit user shares a mental model for AI agent systems that separates deterministic layers (scripts, commands, APIs) from probabilistic layers (LLM reasoning and decisions). The key insight: push as much work as possible to the deterministic side.

Mar 9, 2026, 04:45 AM UTC

Making an MCP Server Install Itself: Three Hosts, Three Mechanisms, Gotchas

Making an MCP Server Install Itself: Three Hosts, Three Mechanisms, Gotchas

A deep dive into programmatically installing MCP servers across VS Code, Cursor, and Claude Code — covering APIs, file writes, and edge cases like malformed JSON, atomic writes, and idempotent updates.

Jun 2, 2026, 12:15 AM UTC