Merlin: Local-first LLM context dedup – measure up to 71% chunk overlap, free & open-core

The author has released Merlin, a local-first deduplication tool for LLM context windows. Benchmarks across 22 million passages from real agent sessions and RAG pipelines show 22% duplicate content in typical agent context and up to 71% on RAG-heavy queries. For local models with 8K/16K/32K context, stripping that redundancy means more useful tokens fit before truncation.
Three integration modes
1. HTTP proxy mode
Best for Ollama, vLLM, SGLang, OpenWebUI, llama.cpp server, or anything with an OpenAI-compatible endpoint. Run the proxy locally and point your client at http://localhost:8787/v1 instead of your model server directly. Chunk-level dedup happens in the outgoing request before reaching the model.
Default is cache-aware: leaves the conversation prefix untouched (so vLLM/SGLang prefix-caching still hits) and only dedups the most recent user message. There's an opt-in aggressive mode if your cache hit rate is already low.
2. MCP server
For Claude Desktop, Claude Code, OpenClaw, Cursor. Exposes tools:
merlin_dedupe– dedup textmerlin_dedupe_file– dedup file contentsmerlin_savings_summary– show statsmerlin_status– check service
These tools are not auto-invoked; you must instruct the model to call them on chunky pastes.
3. Standalone CLI
For shell pipelines and preprocessing. Single-threaded, ~250 KB binary, no runtime dependencies, no network calls. Takes a positional input file and writes deduped lines via --output-dedup=path.txt.
Installation (one command per setup)
curl -LO https://github.com/corbenicai/merlin-community/releases/latest/download/merlin-community.zip
unzip merlin-community.zip && cd merlin-community
python shared/install_helpers.py <integration> enable
Where <integration> is claude_desktop, claude_code, openclaw, cursor, or proxy.
Measurements & tradeoffs
- Papers: arXiv:2605.09611 (architecture), arXiv:2605.09990 (22M-passage measurement), Zenodo: 10.5281/zenodo.20090991
- Community tier caps: 50 MB per run, 200 MB per day, 2 GB per month. Refuses oversized work cleanly (verified on 51 MB file). Hobby use is fine.
- Open-core: Public repo is the community edition; a separate closed-source Pro engine exists for high-throughput servers.
- Doesn't fix session fragmentation where the whole conversation is replayed every turn — that's an orchestration problem above this tool's scope.
- Binary availability: Windows x64 in v0.2.1. Linux + macOS CI pipeline pending.
Who it's for
Local LLM users running agents or RAG with Ollama, vLLM, SGLang, llama.cpp, or any OpenAI-compatible backend who want to pack more real tokens into limited context windows.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Qwen2-0.5B Fine-Tuned for Local Task Automation with llama.cpp
A developer fine-tuned Qwen2-0.5B for task automation using LoRA on ~1000 custom examples, creating a 300MB GGUF model that runs locally on CPU via llama.cpp. The model takes natural language tasks, detects task types, and generates execution plans with CLI commands and hotkeys.

Hardware widget and Chrome extension monitor Claude API rate limits
A developer built a hardware widget using ESP8266 and OLED display that tracks Claude's rate limits in real time, plus a Chrome extension that intercepts Claude's internal /usage API and shows usage patterns. The total BOM cost is approximately $6.50.

Sense: Go SDK for LLM-powered test assertions and structured text extraction
Sense is a Go SDK that uses Claude for two main functions: evaluating non-deterministic output in tests with plain English assertions, and extracting typed structs from unstructured text through reflection and forced tool_use.

Mike: Open-Source Legal AI with Self-Hosting, Multi-Model Support
Mike is an open-source alternative to Harvey and Legora, offering document chat, tabular extraction, and workflow templates — all self-hostable with your own Claude or Gemini API keys.