Ninetails Memory Engine V4.5: Int8 Quantization + LRU Cache Cuts Local MCP Memory to 60MB

The Ninetails Memory Engine V4.5 addresses the memory bottleneck in local MCP (Model Context Protocol) tools by implementing Int8 scalar quantization combined with LRU cache eviction. The solution keeps the entire engine process running inside a Tauri desktop app at 40-60MB of RAM.
The Memory Problem
A standard 1536-dimension float32 embedding takes about 6144 bytes (~6KB). Storing 10,000 memories means ~60MB just for vectors, scaling to ~600MB for 100,000 memories. For a local tool running on SQLite, this resource consumption is unacceptable.
Technical Implementation
Layer 1: Int8 Scalar Quantization
By compressing float32 (4 bytes/dim) down to int8 (1 byte/dim), storage volume is reduced to a quarter of its original size. The implementation calculates the numerical range of each dimension, maps floats to a -128 to 127 integer range, and dequantizes back to float32 during retrieval for cosine similarity.
# Quantize: float32 → int8
def quantize_vector(vector_fp32, scale, zero_point):
quantized = np.round(vector_fp32 / scale) + zero_point
return np.clip(quantized, -128, 127).astype(np.int8)
# Dequantize: int8 → float32 (Approximation)
def dequantize_vector(vector_int8, scale, zero_point):
return (vector_int8.astype(np.float32) - zero_point) * scale
Real-world result: A 1536-dim vector drops from 6144 bytes to 1536 bytes. Factoring in global scale and zero_point overhead, the real compression ratio is around 3.8x - 4.0x.
Layer 2: LRU Cache Eviction
Quantized vectors are stored in a SQLite database (vector_cache.sqlite) using a Least Recently Used strategy with a hard cap of 10,000 entries. High-frequency vectors stay in RAM while stale ones are evicted.
Precision Considerations
Int8 quantization is lossy but acceptable for memory retrieval because:
- The engine uses hybrid search: 70% vector similarity + 30% BM25. Even if quantization slightly skews vector ranking, exact keyword matching via BM25 pulls relevant memories back up.
- AI memory retrieval only needs to surface context into the Top-5 results, unlike recommendation algorithms that need absolute precision for the #1 spot.
Clarification on "TurboQuant"
The engine uses standard Int8 scalar quantization for SQLite vector storage, not Google's TurboQuant (ICLR 2026), which is a 3-bit compression algorithm (PolarQuant + QJL) designed for KV Cache during LLM GPU inference. The branding "TurboQuant Compression" in the UI is a nod to the philosophy of aggressive bit-reduction.
Full Tech Stack
- Vector Compression: Int8 Scalar Quantization (~4x real compression)
- Cache Management: SQLite + LRU Eviction (Cap: 10,000 entries)
- Search Engine: Hybrid: 70% Vector Similarity + 30% BM25
- Profile Manager: Automatic STATIC/DYNAMIC fact extraction
- Fact Extraction:
asyncio.to_threadbackground async LLM calls - Data Storage: 3x SQLite Databases (100% Local)
- Desktop App: Tauri + Vue 3 + PyInstaller sidecar
The engine is open-source under MIT License at GitHub: sunhonghua1/ninetails-memory-engine.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Founder Operations in Claude: 19 Reusable Skills for Early-Stage Startups
A founder who exited their first startup published 19 Claude-compatible skill prompts for functions like positioning, pricing, prospecting, and copy — based on their own SOPs and Notion workflows.

Cold Validation Architecture: Dual-Agent Code Review System Open-Sourced
Open-sourced system uses two separate AI agents for code validation: one builds code, another reviews it with zero context about the builder's reasoning. The reviewer only sees plan documents, code diffs, and test outputs.

Developer Builds MCP Server for Claude WhatsApp Integration, Shares Challenges
A developer built an MCP server to give Claude access to real WhatsApp conversations, discovering that conversation context management was trickier than expected and required a database to track conversations.

MOOSE-Star: A 7B Model and 108K-Paper Dataset for Scientific Hypothesis Discovery – ICML 2026
MiroMind releases MOOSE-Star on Hugging Face: a 7B model (DeepSeek-R1-Distill-Qwen-7B fine-tune) for scientific hypothesis discovery, alongside the 108K-paper TOMATO-Star dataset. Benchmark shows MS-7B achieves 54.34% inspiration retrieval accuracy, beating GPT-5.4 and approaching Gemini-3 Pro.