Running Qwen3.6 27B and 35B on 6GB VRAM with ik_llama: Practical Configs and Benchmarks

A Reddit user reports successfully running Qwen3.6 27B and 35B A3B models on an old gaming laptop with an RTX 2060 Mobile (6 GB VRAM) and 32 GB RAM using ik_llama and llama.cpp. Key optimizations include double speculative decoding with MTP and ngram, --fit and --mtp-requantize-output-tensor, plus output tensor repacking. Below are the exact configs and observed speeds.
Config for Qwen3.6 27B (Q3_K_XL)
export GGML_CUDA_GRAPHS=1
./llama-server \
-m /mnt/second-ssd/lib/llama.cpp/models/Qwen3.6-27B-MTP-UD-Q3_K_XL.gguf \
-c 16000 \
-b 512 -ub 512 \
--fit --fit-margin 3076 \
-fa on \
-np 1 \
-ctk q4_0 -ctv q4_0 \
--mtp-requantize-output-tensor q4_0 \
-khad -vhad -rtr \
--threads 6 --threads-batch 8 \
--slot-save-path ./slots \
--prompt-cache "prompt.cache" \
--port 8888 --host 0.0.0.0 \
--spec-stage ngram-mod:n_max=64,n_min=2,spec-ngram-size-n=16 \
--spec-stage mtp:n_max=1,draft-p-min=0.0 \
--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 \
--jinja \
--chat-template-kwargs '{"preserve_thinking": true}' \
--reasoning on
Config for Qwen3.6 35B A3B (IQ4_XS, Claude Opus Distill)
export GGML_CUDA_GRAPHS=1
./llama-server \
-m /mnt/second-ssd/lib/llama.cpp/models/lordx64-Claude-4.7-Opus-Reasoning-Distilled-Qwen3.6-35B-A3B-MTP-IQ4_XS.gguf \
-c 80000 \
-b 1024 -ub 1024 \
--fit --fit-margin 2048 \
-fa on \
-np 1 \
-ctk q8_0 -ctv q4_0 \
--mtp-requantize-output-tensor q4_0 \
-khad -vhad -rtr \
--threads 6 --threads-batch 8 \
--slot-save-path ./slots \
--prompt-cache "prompt.cache" \
--mlock --no-mmap \
--port 8888 --host 0.0.0.0 \
--spec-stage ngram-mod:n_max=64,n_min=2,spec-ngram-size-n=16 \
--spec-stage mtp:n_max=3,draft-p-min=0.0 \
--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 \
--jinja \
--chat-template-kwargs '{"preserve_thinking": true}' \
--reasoning on
Performance Numbers
- 27B: prefill ~100 t/s, first token up to 4 t/s, ~1 t/s at 10k context
- 35B A3B: prefill ~40 t/s, first token up to 15 t/s, constant ~11 t/s at 10k context
The user notes that 27B became usable for reasoning about files up to 1000 lines (taking minutes but useful), and the 35B Opus distill runs at a steady 11 t/s output. They use it to generate mermaid charts, images, markdown, and PDFs with little-coder or agentic coding workflows.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Guide to Setting Up OpenClaw on a Hostinger VPS
A step-by-step guide for deploying OpenClaw on a Hostinger VPS, connecting AI APIs from OpenAI and Entropics, and integrating with Telegram for 24/7 operation.

Workaround for OpenClaw Claude Access via Claude Code CLI
A method routes OpenClaw through Claude Code CLI to maintain Claude subscription access after Anthropic blocked direct third-party harnesses. The process involves installing the CLI, setting up an OAuth token, and configuring OpenClaw to use the ACP plugin.

Claude Certified Agent Foundations Exam Guide Discrepancies Identified
A recent CCA-F exam taker reports significant discrepancies between the official exam guide, practice exam, and actual test content. The real exam may include up to 13 scenarios while the guide only lists 6, and the practice exam covers just 4 of them.

Two $0 OpenClaw setups using free cloud models or local Ollama
A Reddit post details two approaches to run OpenClaw agents at zero cost: using free tiers from OpenRouter, Gemini, and Groq with rate limits, or running local models via Ollama with no API keys or data leaving your machine.