Ctxpact: Context Compaction Proxy for Local LLMs

✍️ OpenClawRadar📅 Published: April 13, 2026🔗 Source
Ctxpact: Context Compaction Proxy for Local LLMs
Ad

Ctxpact is a lightweight OpenAI-compatible proxy that sits between AI agents and local LLMs to intelligently compress oversized inputs before they hit models with limited context windows. It's designed for agentic workflows like OpenClaw and Hermes that send 100k+ token payloads to models with only 16k context windows, where truncation would lose critical information.

How It Works

The system uses a 3-stage compaction pipeline:

  • DCP (Dynamic Context Pruning): Dedups tool calls, strips superseded file writes, truncates error stack traces. Zero LLM calls, purely structural.
  • Summarize: Evicts old conversation turns, replaces with LLM-generated summaries. Keeps a sliding window of recent turns intact.
  • Extract: When input is still too large (like a 110k novel), uses one of 16 extraction strategies to pull the most relevant content within token budget.

Extraction Strategies

The extraction stage implements 16 strategies ranging from:

  • 0 LLM calls: Embedding similarity (ChromaDB), section headers, heuristic keyword grep, LLMLingua compression
  • 1 LLM call: LLM generates search terms, IDF-weighted word-level matching assembles context
  • 2 LLM calls (best accuracy): readagent — embed + BM25 + RRF fusion, dual LLM term expansion, position-aware excerpting
  • N LLM calls: Multi-turn tool-calling loops, DSPy code generation, map-reduce chunking

Benchmark Results

Tested 12 strategies across 2 models (LFM2-8B-A1B and Qwen3.5-9B) on 331 GGUF models total:

  • Frankenstein test: 110k tokens compressed to 12k tokens, 8 reading comprehension questions; 8/8 correct, deterministic across 3 consecutive runs, 0% variance
  • LoCoMo-MC10: Multi-session conversation QA, 10-choice, random baseline is 10%; readagent + Qwen3.5-9B scores 15/20 (75%)
  • Combined performance: readagent + Qwen3.5-9B achieves 87.5%, rlm + Qwen3.5-9B achieves 80.0%
Ad

Key Findings

  • Model choice matters more than strategy choice: Switching from LFM2 to Qwen3.5 improved every single strategy by +25-50 percentage points. Median strategy went from 5/8 to 7/8 just by changing model.
  • NR-MMLU predicts context engineering performance: LFM2's 47% NR-MMLU vs Qwen3.5's 65% maps directly to accuracy differences.
  • 2 LLM extraction calls is the sweet spot: Going from 0 to 1 call gives meaningful boost; 1 to 2 calls reaches peak accuracy. Beyond 2 calls, accuracy drops.
  • readagent and rlm are breakthrough strategies: Both achieve 8/8 on Frankenstein. Only strategies that solve Q4 (Ireland question). readagent leads cross-domain at 75% LoCoMo vs rlm's 60%.

Technical Details

  • Architecture: Standalone proxy (considered LiteLLM plugin and sidecar process) because breakthrough strategies need mid-pipeline LLM calls
  • Implementation: ~11k lines of Python, FastAPI server, 3 endpoints, OpenAI-compatible, no heavy frameworks
  • Compatibility: Drops in front of any llama-server / Ollama / vLLM backend. No API keys, no cloud, everything runs on your hardware

For developers running local LLMs with agentic workflows that exceed context windows, Ctxpact provides a practical solution to maintain information integrity while staying within hardware constraints.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also

Claude Code gains TLA+ model checking via tla-mcp MCP server
Tools

Claude Code gains TLA+ model checking via tla-mcp MCP server

tla-mcp is a new MCP server that lets Claude Code call the TLA+ model checker tla-rs as a first-class tool — validate specs, run bounded checks with counterexample traces, and replay scenarios from the chat.

OpenClawRadar
The Human Creativity Benchmark: Separating Convergence from Divergence in AI Creative Evaluation
Tools

The Human Creativity Benchmark: Separating Convergence from Divergence in AI Creative Evaluation

Contra Labs introduces the Human Creativity Benchmark (HCB), a framework that distinguishes objectively verifiable criteria (e.g., prompt adherence) from subjective taste (e.g., visual appeal) in evaluating generative AI for creative work. The benchmark reveals that no current model is reliably both correct and steerable, addressing mode collapse and the need for differentiated output.

OpenClawRadar
GitVelocity: AI Scoring of 50k PRs Reveals Insights on Code Complexity
Tools

GitVelocity: AI Scoring of 50k PRs Reveals Insights on Code Complexity

GitVelocity uses Claude to score merged pull requests 0-100 across six dimensions: scope, architecture, implementation, risk, quality, and performance/security. After analyzing 50,000+ PRs across TypeScript, Python, Rust, Go, Java, and Elixir, the team found surprising patterns about PR size, test coverage, and AI adoption.

OpenClawRadar
OpenClaw Client Adds Live API Cost Tracking, Spending Caps, and Granular Agent Controls
Tools

OpenClaw Client Adds Live API Cost Tracking, Spending Caps, and Granular Agent Controls

OpenClaw Client now features live usage UI with circular progress bars, per-agent spending caps, sub-agent management, skill toggling, and model switching from different providers.

OpenClawRadar