llama.cpp Massive Prompt Reprocessing with Coding Agents: Debugging KV Cache and Context Swapping

A developer on r/LocalLLaMA is hitting a serious performance issue with llama.cpp when running long-context coding agents (opencode + pi.dev) via llama-swap. Even with highly similar prompts (LCP similarity often >0.99), the system periodically discards the KV cache and reprocesses 40k+ tokens, causing TTFT of multiple minutes.
Observed Behavior
- Context grows to 50k+ tokens.
- After several normal reuses (e.g.,
prompt eval time = 473 ms / 19 tokens),n_pastsuddenly drops to ~4-5k. - llama.cpp then reprocesses the full prompt:
n_tokens = 4750 prompt eval time = 222411 ms / 44016 tokens. - Cache usage hits 4676 MiB, exceeding the configured limit (2500 MiB).
Current Configuration
llama-server --ctx-size 150000 --parallel 1 --ctx-checkpoints 32 --cache-ram 2500 --cache-reuse 256 -no-kvu --no-context-shiftSuspected Causes
- Cache invalidation due to overflow of
--cache-ramlimit – the log shows 4676 MiB used vs 2500 MiB limit. - Bad KV reuse mechanism when early prompt tokens change (possibly frequent alterations by opencode).
- Insufficient
--ctx-checkpointsor--cache-reusefor the 150k context size.
Recommendations from the Community
The thread is thin on answers so far, but obvious first steps include increasing --cache-ram to match typical usage (e.g., 5000+ MiB), or reducing --ctx-size to stay under the cache limit. Also check if opencode is intentionally mutating prompt prefixes; if so, locking the system prompt or using a fixed prefix could improve reuse.
For developers running similar setups, share your working configs in the source thread.
📖 Read the full source: r/LocalLLaMA
👀 See Also

The Prompt Structure That Fixed Claude AI Summaries of Large PDF Reports
A developer shares how switching from 'summarize this' to role + decision + specific extraction prompts turned Claude's generic summary output into actionable risk flags and concrete action items.

TLS Interception by Antivirus Breaks Claude Desktop’s Connection; Workaround with AV Exclusions
Antivirus TLS inspection on bridge.claudeusercontent.com causes Cowork (Claude desktop companion) to fail with 'Claude in Chrome is not connected'. Fix: add *.claudeusercontent.com and *.anthropic.com to AV HTTPS exclusions. Node.js --use-system-ca would prevent this.

Routing cuts OpenClaw Max usage cost by 85%: $200/mo to $30/mo with API routing
A user tracked token usage and found only 15% of tasks need Opus. By routing routine work to Sonnet via API, monthly cost dropped from $200 to $30 with identical output quality.

Reducing Claude Hallucinations with Pre-Output Prompt Injection
A Reddit post details a method to cut Claude AI hallucinations by half using a pre-output prompt that forces the model to record uncertainties and next steps before responding. The approach involves adding specific markdown instructions to Claude's system prompt and creating a Python script.