Monarch v3: NES-Inspired KV Paging for 78% Faster LLM Inference

✍️ OpenClawRadar📅 Published: April 13, 2026🔗 Source
Monarch v3: NES-Inspired KV Paging for 78% Faster LLM Inference
Ad

What Monarch v3 Does

Monarch v3 is an open-source implementation of NES-inspired memory paging for transformer inference that addresses the linear growth of KV cache with sequence length. By 4K tokens, most KV cache sits unused while consuming VRAM at full precision.

How It Works

The system splits KV cache into two regions:

  • Hot region: Recent tokens kept at full precision
  • Cold region: Older tokens compressed to ~20 bytes each (vs 64-128 bytes hot)

Four components work together:

  • TurboQuant Compression: Quantizes KV to 4-bit integers with polar encoding and residual correction, achieving ~97% size reduction with ~0.3% perplexity loss
  • Sliding Window Eviction: Recent N tokens stay hot by default, old tokens compress to cold storage
  • Attention-Weighted Promotion: High-attention tokens move back to hot with sticky mechanism to prevent thrashing
  • Page Swaps: Small batches of cold tokens materialize on access with local decode loop replacing batch matmul

Benchmark Results

Setup: TinyLlama-1.1B fp16, 50 generated tokens

  • Standard: 17.01 tok/s, 2112 MB VRAM
  • Monarch-v3: 30.42 tok/s, 2131 MB VRAM, 512 hot tokens, 1024 cold tokens
  • Gain: +78.7% throughput, +0.9% VRAM
Ad

Simplified Decode Loop

for step in 1..100:
    q = project_query(next_token)
    # Compute attention: hot only (fast)
    scores_hot = q @ kv_hot.T
    # Access cold if high attention (rare)
    if max(scores_hot) < threshold:
        kv_cold_promoted = decompress(cold_pages)
        scores_cold = q @ kv_cold_promoted.T
        # Move to hot for next step
    # Aggregate, softmax, apply attn ...
    # Evict old tokens from hot → cold
    if len(kv_hot) > window_size:
        evict_oldest_to_cold()

Current Status

  • Implementation: Working on Hugging Face Transformers with custom cache backend
  • License: Apache 2.0
  • Paper: Full technical spec available
  • Next: CUDA kernel fusion for cold decompression planned

How to Try It

git clone https://github.com/JohannaWeb/Monarch.git
cd Monarch
pip install -r requirements.txt
python train_tinyllama_fp16.py
python src/benchmark_monarch.py \
    --model models/tinyllama_fp16 \
    --mode both \
    --max-new-tokens 100 \
    --promotion-threshold 0.15 \
    --sticky-threshold 3 \
    --json

Limitations

The approach relies on recency (recent tokens = high attention), which works for most tasks but may not for retrieval-heavy workloads. Attention extraction is available in base models but not chat variants; fallback uses window-only paging.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also

ARP: Stateless WebSocket Relay for Autonomous Agent Communication
Tools

ARP: Stateless WebSocket Relay for Autonomous Agent Communication

ARP (Agent Relay Protocol) is a stateless WebSocket relay for autonomous agent communication featuring Ed25519 identity, HPKE encryption per RFC 9180, binary TLV framing, and 33 bytes overhead per message. No accounts or registration required—just generate a keypair and connect.

OpenClawRadar
Buyer Eval: Claude skill for B2B vendor evaluation using AI agent conversations
Tools

Buyer Eval: Claude skill for B2B vendor evaluation using AI agent conversations

A Claude skill that evaluates B2B software vendors by researching your company, asking domain-specific questions, and directly interrogating vendor AI agents through the Salespeak Frontdoor API. It cross-references claims against independent sources and produces evidence-based scorecards with transparent verification levels.

OpenClawRadar
ClawCode: Cleanroom Rust Rewrite of Leaked Claude Code
Tools

ClawCode: Cleanroom Rust Rewrite of Leaked Claude Code

ClawCode is a cleanroom rewrite of the leaked Claude Code source code, implemented in Rust. The project emerged following Anthropic's Claude Code leak and is being compared to OpenCode for end-to-end task performance.

OpenClawRadar
Holaboss AI Runtime Moves to TypeScript, Implements Persistent MCP Ports
Tools

Holaboss AI Runtime Moves to TypeScript, Implements Persistent MCP Ports

The Holaboss AI local agent runtime has been refactored to use TypeScript exclusively, eliminating Python dependencies and reducing bundle size. It now persists MCP server ports in SQLite with UNIQUE(port) constraints to prevent collisions across restarts.

OpenClawRadar