Local Claude Code Setup with Qwen3.5 27B via llama.cpp

Local Claude Code Configuration
A developer documented their setup for running Claude Code completely offline using a local LLM with llama.cpp. The system uses Qwen3.5 27B quantized with unsloth/UD-Q4_K_XL on Arch Linux with Strix Halo hardware.
Environment Configuration
To disable telemetry and make Claude Code fully offline, the following environment variables were set in ~/.bashrc:
export ANTHROPIC_BASE_URL="http://127.0.0.1:8001" export ANTHROPIC_API_KEY="not-set" export ANTHROPIC_AUTH_TOKEN="not-set" export CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1 export CLAUDE_CODE_ENABLE_TELEMETRY=0 export DISABLE_AUTOUPDATER=1 export DISABLE_TELEMETRY=1 export CLAUDE_CODE_DISABLE_1M_CONTEXT=1 export CLAUDE_CODE_MAX_OUTPUT_TOKENS=4096 export CLAUDE_CODE_AUTO_COMPACT_WINDOW=32768
The developer noted that using claude/settings.json is more stable and controllable than environment variables.
llama.cpp Server Configuration
The llama.cpp server was launched with these parameters:
ROCBLAS_USE_HIPBLASLT=1 ./build/bin/llama-server \ --model models/Qwen3.5-27B-Q4_K_M.gguf \ --alias "qwen3.5-27b" \ --port 8001 --ctx-size 65536 --n-gpu-layers 999 \ --flash-attn on --jinja --threads 8 \ --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 \ --cache-type-k q8_0 --cache-type-v q8_0
The ROCBLAS_USE_HIPBLASLT=1 flag was required for Strix Halo hardware, and the developer emphasized researching specific hardware to specialize llama.cpp setup.
Performance Benchmarks
Seven runs were conducted with the following results:
- Run 1 (File operations): 1m44s, 9.71 tokens/second, 23K context, correct output
- Run 2 (Git clone + code read): 2m31s, 9.56 t/s, 32.5K context, excellent quality
- Run 3 (7-day plan + guide): 4m57s, 8.37 t/s, 37.9K context, excellent quality
- Run 4 (Skills assessment): 4m36s, 8.46 t/s, 40K context, very good quality (web search broken)
- Run 5 (Write Python script): 10m25s, 7.54 t/s, 60.4K context, good quality (7/10)
- Run 6 (Code review + fix): 9m29s, 7.42 t/s, 65,535 context (CRASH), very good quality (8.5/10)
- Run 7 (/compact command): ~10m, ~8.07 t/s, 66,680 context (failed), N/A quality
Key Findings
- Generation speed degraded approximately 24% across the context range: from 9.71 t/s at 23K context down to 7.42 t/s at 65K context
- Claude Code system prompt consumes 22,870 tokens (35% of the 65K budget)
- Auto-compaction was completely broken: Claude Code assumed 200K context, so the 95% threshold was 190K, but the 65K limit was hit at 33% of what Claude Code thought was the window
- The /compact command needs output headroom: with 4096 max output tokens, the compaction summary couldn't fit, requiring 16K+ tokens
- Web search functionality is broken without Anthropic connectivity; potential solutions include SearXNG via MCP
📖 Read the full source: r/LocalLLaMA
👀 See Also

OpenClaw 4.1 with Gemma 4 Stack: Hybrid Architecture and Setup Fixes
A Reddit post details an optimized local agent stack combining OpenClaw 4.1 with Google's Gemma 4 model, featuring a hybrid architecture, specific configuration fixes for Ollama tool calling, and context window adjustments.

Four aarch64-specific failure modes when running vLLM on Blackwell GB10 with CUDA 13.0
A developer encountered four specific failure modes when setting up vLLM v0.7.1 with DeepSeek-R1-32B on a Blackwell GB10 system running aarch64 architecture with CUDA 13.0, including ABI mismatches and missing dependencies.

A Solo Developer's Two-Phase Prompting Method for Large Projects with Claude AI
A solo developer shares a workflow using Claude Chat as the architect and Claude Code as the builder, with a two-phase prompting method that includes failure mode analysis and verification gates.

How to Fix OpenClaw Response Times by Reducing Context Bloat
A developer resolved 10-minute response times in OpenClaw by reducing injected workspace files from 47,000 characters to 16,000 characters through file restructuring and configuration changes, including setting bootstrapMaxChars to 8000 and adding compaction safeguards.