llama.cpp Massive Prompt Reprocessing with Coding Agents: Debugging KV Cache and Context Swapping

✍️ OpenClawRadar📅 Published: May 14, 2026🔗 Source

A developer on r/LocalLLaMA is hitting a serious performance issue with llama.cpp when running long-context coding agents (opencode + pi.dev) via llama-swap. Even with highly similar prompts (LCP similarity often >0.99), the system periodically discards the KV cache and reprocesses 40k+ tokens, causing TTFT of multiple minutes.

Observed Behavior

Context grows to 50k+ tokens.
After several normal reuses (e.g., prompt eval time = 473 ms / 19 tokens), n_past suddenly drops to ~4-5k.
llama.cpp then reprocesses the full prompt: n_tokens = 4750 prompt eval time = 222411 ms / 44016 tokens.
Cache usage hits 4676 MiB, exceeding the configured limit (2500 MiB).

Current Configuration

llama-server --ctx-size 150000 --parallel 1 --ctx-checkpoints 32 --cache-ram 2500 --cache-reuse 256 -no-kvu --no-context-shift

Suspected Causes

Cache invalidation due to overflow of --cache-ram limit – the log shows 4676 MiB used vs 2500 MiB limit.
Bad KV reuse mechanism when early prompt tokens change (possibly frequent alterations by opencode).
Insufficient --ctx-checkpoints or --cache-reuse for the 150k context size.

Recommendations from the Community

The thread is thin on answers so far, but obvious first steps include increasing --cache-ram to match typical usage (e.g., 5000+ MiB), or reducing --ctx-size to stay under the cache limit. Also check if opencode is intentionally mutating prompt prefixes; if so, locking the system prompt or using a fixed prefix could improve reuse.

For developers running similar setups, share your working configs in the source thread.

📖 Read the full source: r/LocalLLaMA

👀 See Also

Tips

The Prompt Structure That Fixed Claude AI Summaries of Large PDF Reports

A developer shares how switching from 'summarize this' to role + decision + specific extraction prompts turned Claude's generic summary output into actionable risk flags and concrete action items.

May 10, 2026, 02:15 PM UTC

OpenClawRadar

Tips

TLS Interception by Antivirus Breaks Claude Desktop’s Connection; Workaround with AV Exclusions

Antivirus TLS inspection on bridge.claudeusercontent.com causes Cowork (Claude desktop companion) to fail with 'Claude in Chrome is not connected'. Fix: add *.claudeusercontent.com and *.anthropic.com to AV HTTPS exclusions. Node.js --use-system-ca would prevent this.

May 10, 2026, 06:16 PM UTC

OpenClawRadar

Tips

Routing cuts OpenClaw Max usage cost by 85%: $200/mo to $30/mo with API routing

A user tracked token usage and found only 15% of tasks need Opus. By routing routine work to Sonnet via API, monthly cost dropped from $200 to $30 with identical output quality.

May 5, 2026, 12:17 AM UTC

OpenClawRadar

Tips

Reducing Claude Hallucinations with Pre-Output Prompt Injection

A Reddit post details a method to cut Claude AI hallucinations by half using a pre-output prompt that forces the model to record uncertainties and next steps before responding. The approach involves adding specific markdown instructions to Claude's system prompt and creating a Python script.

Mar 24, 2026, 01:45 PM UTC

OpenClawRadar