AVP Protocol Enables LLM Agents to Share KV-Cache Instead of Text for Token Efficiency

What AVP Does
AVP (Agent Vector Protocol) is a protocol that enables LLM agents in multi-agent setups to pass KV-cache directly between agents instead of text. This eliminates redundant tokenization and forward passes that occur when each agent re-processes the entire conversation history.
How It Works
Instead of the traditional text-based approach where each agent re-tokenizes everything, AVP allows Agent A to serialize its key-value attention states after reasoning, and Agent B injects them directly. This means:
- Same model on both sides: Direct KV-cache transfer with zero overhead
- Same family, different size (e.g., Qwen2.5-7B talking to 1.5B): Vocabulary-mediated projection with no learned parameters or calibration data needed
- Different families: Falls back to JSON
- Transport-agnostic: Works alongside A2A, MCP, gRPC, or whatever you're already using
- Binary wire format: Not JSON+Base64 (which has 33% overhead on tensor data)
Performance Results
Testing across Qwen2.5, Llama 3.2, and DeepSeek-R1-Distill models showed:
- Token savings of 73-78%
- 2-4x speedups
- These results held consistent across all three model families
- The gap widens with chain length: at 4 agents it's roughly 2x, at 16 agents (projected) it would be around 6x
The efficiency comes from text prompt sizes ballooning at each hop (186 → 545 → 1,073 → 1,397 tokens in a 4-agent GSM8K chain), while latent stays flat at ~164-207 tokens per hop because prior context arrives as pre-computed KV-cache.
Limitations
- Sample sizes are n=20 per model (enough for token/speed claims but not for accuracy claims)
- Tested on small models only (1.5B-3B on an RTX 3070 Ti) with 7B+ results pending
- Requires 1 Gbps+ bandwidth minimum (KV-cache for a 3B model runs about 130 MB per sample)
- Self-hosted only (requires KV-cache access, won't work with OpenAI/Anthropic/etc. APIs)
- Same model only for now (cross-model implementation exists but not benchmarked)
- Latent uses 17-54x more VRAM than text because you're holding KV-cache across hops
Getting Started
Install with: pip install avp
Two API levels available:
import avp
msg = avp.pack("Hello", model="Qwen/Qwen2.5-7B-Instruct", think_steps=20)
answer = avp.unpack(msg, model="Qwen/Qwen2.5-7B-Instruct")Or with more control:
from avp import HuggingFaceConnector
connector = HuggingFaceConnector.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")
context = connector.think("Analyze this problem", steps=20)
answer = connector.generate("Solve it.", context=context)vLLM connector also available: pip install "avp[vllm]"
Project Links
- SDK: github.com/VectorArc/avp-python (MIT, 377 tests, 7 benchmarks)
- Spec: github.com/VectorArc/avp-spec
- Benchmark details: BENCHMARKS.md
📖 Read the full source: r/LocalLLaMA
👀 See Also

Open-source Claude plugin generates interactive visual tuners with live preview
A developer built an open-source plugin that lets Claude Code generate single HTML pages with sliders and Figma-style infinite canvases for fine-tuning CSS values. The plugin reads source files, reproduces elements on an interactive canvas, and provides controls for precise adjustments with live preview.

Cloken: A Chrome Extension That Shows Real-Time Claude Context Usage as a Percentage
Cloken is a free Chrome extension that displays your current Claude.ai chat context usage as a percentage — including messages, files, images, and system prompt.

Open-source MCP server adds built-in session memory for Claude Desktop
A developer built a TypeScript MCP server with integrated session memory to preserve context between Claude Desktop coding sessions, eliminating the need for separate memory infrastructure. The server includes session save/load functions and additional tools like Brave Search and Google Gemini integration.

Open Source Claude Code Skills for Personalized Social Media Content
A developer has open sourced 13 Claude Code skills that help Claude write social media content in your own voice. The skills include context definition, strategy, creation, and analysis tools for LinkedIn, Twitter/X, Threads, and Bluesky.