12ms Semantic Search: Build With fastembed & LanceDB

A developer has implemented a local semantic search system for AI conversation history, processing 368K messages without cloud dependencies or API keys. The project uses fastembed with the BAAI/bge-small-en-v1.5 model for CPU-based embeddings and LanceDB as a vector store that operates as a single directory without a server process.

Technical Stack

Embeddings: fastembed with BAAI/bge-small-en-v1.5 model (384 dimensions)
Vector store: LanceDB - single directory, no server process, append-friendly
Ingest: Pulls from JSONL session transcripts (Claude Code, any chat export)
Embedding performance: ~500 docs/sec on M4 CPU

Key Implementation Details

The developer learned several practical lessons during the 4-month iteration:

Selective embedding: Early versions embedded every message, which reduced signal-to-noise. The current implementation only embeds user messages and assistant messages with substance (skipping responses like "sure, here's that code"), cutting vector count by 60% while improving search quality.
Chunking strategy: Switching from fixed-size chunks to conversation-turn chunks made a massive difference in retrieval relevance. Model choice (tried nomic-embed-text, bge-large, all-MiniLM) showed marginal differences compared to chunking approach.
LanceDB advantages: The developer found LanceDB "stupidly underrated for personal-scale" - no server, no Docker, just a directory with instant appending of new vectors, replacing an overengineered pgvector setup.
Re-embedding workflow: The bge-small-en-v1.5 model at 384 dimensions is fast enough to re-embed hourly as a cron job. A full re-index of 117K vectors takes approximately 4 minutes on M2 hardware.

Performance Metrics

Total messages ingested: 407K
Vectors indexed: 87K
Search latency (p50): 12ms across 117K vectors
Full re-index time: ~4 minutes (M2)
Storage: ~180MB on disk
API keys needed: 0

The project is open source under MIT license and available at github.com/mordechaipotash/brain-mcp. Installation is via pipx install brain-mcp && brain-mcp setup.

📖 Read the full source: r/LocalLLaMA

Local semantic search for AI conversations with fastembed and LanceDB

Technical Stack

Key Implementation Details

Performance Metrics

👀 See Also

ETL-D MCP Server: Deterministic CSV Parsing for Claude to Prevent Financial Hallucinations

OpenClaw Kubernetes Operator with Embedded Ollama Support

Measuring Off-Task Token Spend in Claude Code: The 'Undeclared-Intent' Metric

NaNMesh MCP checks GitHub issues before Claude recommends libraries