CAL: Open-Source Library Cuts Claude Token Use 83%

What CAL Does

CAL is a Python library that sits between your existing code and LLM API calls, intelligently selecting, compressing, and assembling context for each request. It addresses the cost and context problems in token-heavy agent setups, particularly relevant with recent Claude Pro/Max subscription changes.

Performance Benchmarks

In production with Claude Opus 4 and 103 context chunks:

Without CAL: Every request sends all 103 chunks (~23,000 tokens) at $0.043 per request
With CAL: Drops to ~6 chunks and 4,100 tokens at $0.008 per request
Results: 83% reduction in tokens, 81% reduction in cost

Validated against 5,000 WildChat prompts (an open academic dataset of real LLM conversations across 57 languages) with 97.6% average savings.

Key Features

Selector: IDF-weighted scoring picks only relevant chunks per query. Uses stable prefix + dynamic chunks selected per request.
Tool Stubs: Three-tier lazy tool loading with lightweight stubs until the model signals intent to use a specific tool.
Cost Engine: Provider-aware savings calculator that knows Anthropic's 4 input tiers and Google's cache storage pricing.
Noise Suppression: IDF floor + require-any gates to stop common words from loading irrelevant chunks on every request.
Cache-Stable Ordering: Uses scores only for selection, then alphabetical order for position to maintain cache hits.

Technical Details

Multi-turn context handling: Tool stubs are history-aware. If the model used a tool in a previous turn, the full schema stays loaded to maintain conversation continuity.

Provider support: CAL is provider-agnostic and works with any provider having a chat completions endpoint. The cost engine already handles Anthropic's 4 input tiers and Google's cache storage pricing.

Edge cases: Uses IDF floors and noise suppression for ambiguous queries. Hybrid keyword+semantic scoring is on the roadmap.