MTP + Unified Memory Boosts llama.cpp Inference 30% on RTX 5090
Combining GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 with Multi-Token Prediction (MTP) speculation in llama.cpp yields a ~30% throughput improvement — 64 tok/sec vs 49 tok/sec on a Qwen3.6-27B Q8_0 model. The benchmark was run on an RTX 5090 paired with 128GB DDR5 5600 CL36 and a Ryzen 9 9950X3D.
Command & Configuration
CUDA_VISIBLE_DEVICES=0 GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 /home/marcin/llama-server \
-m /home/marcin/Pobrane/Qwen3.6-27B-Q8_0.gguf \
--threads 16 \
-c 262144 -fa on -np 1 \
--spec-type mtp --spec-draft-n-max 3 \
--webui-mcp-proxy \
--chat-template-kwargs '{"preserve_thinking": true}' \
--host 0.0.0.0 \
--port 8090 \
--jinja
Key flags:
GGML_CUDA_ENABLE_UNIFIED_MEMORY=1— allows the GPU to directly access host memory, bypassing CUDA malloc for large contexts.--spec-type mtp --spec-draft-n-max 3— enables Multi-Token Prediction speculation with a draft depth of 3.Qwen3.6-27B-Q8_0.gguf— a 27B parameter Qwen3.6 model quantized to Q8_0, prepared with Unsloth’s MTP support.-c 262144— 256K context window;-fa onfor flash attention.
Results
- Without MTP (only unified memory): 49 tok/sec
- With MTP + unified memory: 64 tok/sec
- Gain: 30% higher throughput
The draft-n-max of 3 means the model speculates up to 3 tokens ahead, reducing serial decoding overhead. Combined with unified memory, it avoids expensive PCIe transfers between CPU and GPU RAM.
Who This Is For
Developers running large-context local inference on high-end consumer GPUs (RTX 5090) with ample system RAM (≥128GB). Suitable for chatbots, code assistants, or any latency-sensitive LLM workload where speculative sampling is supported.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Portable Mind Format (PMF): Provider-Agnostic Agent Specification with 15 Open-Source Agents
The Portable Mind Format (PMF) is a JSON-based specification for defining AI agent identities that can run across multiple models and providers, including Claude, GPT-4, Gemini, DeepSeek, and local models via Ollama. It includes 15 MIT-licensed production agents and converters for Claude Code, Cursor, GitHub Copilot, and Gemini CLI.

Spore Agent Arena: Competitive AI Agent Testing Platform Seeks Trial Participants
Spore Agent's Arena feature allows AI agents to compete in 36 different game types including code debugging, math puzzles, and system design challenges. The platform currently has 42 challenges running, 15 agents registered, and offers Cog tokens as rewards.

ClamBot: AI Agent Runs LLM-Generated Code in WASM Sandbox for Security
ClamBot is an AI agent framework that executes all LLM-generated code in a WebAssembly sandbox using QuickJS in Wasmtime, eliminating the need for exec() or subprocess calls. It includes an approval gate for tool calls, persistent script caching as 'clams', and supports multiple LLM providers.

Claude Code's Tool API Details Revealed
A Reddit user extracted details about Claude Code's tool API, including file system operations, bash execution, web search, and how tool calls are structured using XML-like blocks.