Running OmniCoder-9B locally with llama.cpp configuration details

Hardware and Model Setup
The setup uses mid-range hardware: AMD Ryzen 9 5900X CPU (12 threads used for inference), 62GB DDR4 RAM, NVIDIA RTX 3080 with 10GB VRAM, NVMe SSD, and Ubuntu 22.04 on a remote server.
The model is OmniCoder-9B, based on Qwen3.5-9B, fine-tuned on 425k+ coding agent trajectories by Tesslate. It uses Q6_K quantization (6.85GB file size) with 128K token context window, sourced from HuggingFace.
llama.cpp Configuration
The model runs via llama.cpp server with these specific flags:
llama-server \ --model /home/openclaw/models/omnicoder-9b/omnicoder-9b-q6_k.gguf \ --host 0.0.0.0 --port 8080 \ --ctx-size 131072 \ --n-gpu-layers 99 \ --cache-type-k q8_0 \ --cache-type-v q4_0 \ --threads 12 \ --batch-size 128 \ --flash-attn on \ --temp 0.4 \ --top-k 20 \ --top-p 0.95 \ --jinja \ --reasoning-budget 0
Key parameters explained:
--ctx-size 131072: 128K context window for large codebases--n-gpu-layers 99: Offload all layers to GPU--cache-type-k q8_0 --cache-type-v q4_0: Compressed KV cache to fit 128K context in 10GB VRAM--threads 12: Match physical cores (not hyperthreads)--flash-attn on: Faster attention computation--reasoning-budget 0: Disables chain-of-thought output in the reasoning_content field, making the model output code directly
Performance and Testing
Performance metrics: prompt evaluation at ~300 tokens/s, generation at ~80-90 tokens/s, VRAM usage ~8.5GB/10GB, latency 1-5 seconds for typical coding tasks.
The testing was conducted by Agent Zero, an autonomous agent framework using GLM-5 as its main brain. Agent Zero discovered the --reasoning-budget 0 flag, SSH'd into the remote server, updated the systemd service, created benchmark scripts from scratch, ran multiple benchmarks (HumanEval base, HumanEval Pro, MBPP, MultiPL-E), and iterated on prompt engineering.
Benchmark Results
Benchmark results compared to official claims:
- HumanEval base: Official 92.7%, Run 1: 100%, Run 2: 95%, Run 3: 95%, Average: 96.7%
- HumanEval Pro: Official 70.1%, Run 1: 70%, Average: 70%
The average HumanEval base score of 96.7% exceeds the official 92.7%, while HumanEval Pro matches exactly at 70%.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Setting Up MCP Servers in llama-server Web UI: A Practical Guide
A Reddit user shares specific steps to configure MCP servers in llama-server's web UI, including installing uv, creating a config.json file with server definitions, running mcp-proxy, and modifying URLs for proper integration.

How OpenCLAW Memory Actually Works: Fixing Agent 'Forgetting'
OpenCLAW agents don't have persistent memory between conversations - they reconstruct context from files like SOUL.md, USER.md, and MEMORY.md each time. Common 'forgetting' issues stem from old sessions, unstructured memory files, and storing important info in chat history instead of permanent files.

Guide to Setting Up OpenClaw on a Hostinger VPS
A step-by-step guide for deploying OpenClaw on a Hostinger VPS, connecting AI APIs from OpenAI and Entropics, and integrating with Telegram for 24/7 operation.

OpenClaw Memory Plugin Testing Results and Recommended Stack
A Reddit user tested every OpenClaw memory plugin and found the default markdown setup causes token bloat and instruction compression. The recommended setup combines Obsidian for human-readable notes, QMD for token-free searching, and SQLite for structured data.