Pali v0.1: Open Source Memory Infrastructure for LLMs with Reproducible Benchmarks

What Pali Is
Pali is open source memory infrastructure for LLMs that's infrastructure-first. It's built in Go as a single binary out of the box with configs for plug and play attachments like qdrant, neo4j, ollama, and openrouter. The project is MIT licensed and fully self-hostable.
Key Features
- Multi-tenant memory APIs with tenant-scoped isolation
- Hybrid retrieval across lexical, dense, fusion, reranking, and optional multi-hop expansion
- MCP server with memory-first tools and tenant-aware resolution
- REST API with respective Python and JavaScript packages live
- Dashboard for operators inspecting tenants, memories, and system state
- Plug-and-play extension points for vector stores, embedders, entity-fact backends, and scoring/routing
Benchmark Approach
The creator addresses common issues with memory stack benchmarks by implementing a reproducible approach:
- Every run stores the exact config files used (profile + rendered)
- Hardware is fully disclosed (CPU, GPU, RAM, model versions)
- Paired comparisons only — same fixture/eval/top_k across all profiles
- Speed lanes and retrieval quality lanes are kept separate
Performance Numbers
Benchmarks from testing on a Ryzen 9 7950X + RTX 5070:
- sqlite + lexical: 208 store ops/s, Top1=0.32, Recall@5=0.54
- qdrant + ollama (all-minilm): 98 store ops/s, Top1=0.34, Recall@5=0.52
- parser+graph (structured memory stress lane): 2.4 store ops/s — slow due to structured extraction cost, but gets ~30 avg on LoCoMo with temporal highs around ~40
Important Clarification
Pali is not LLM memory in the SaaS sense. It returns raw retrieval results you optimize for your own workflow — no black box scoring, no locked provider decisions. You can swap vector backends, embedders, and scorers through config without changing your app contract.
Project Status
Version 0.1 was recently pushed with a proper benchmark suite added. The creator is looking for contributors.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Clawion: OpenClaw wrapper with Claude Max support and GitHub integration
Clawion is an OpenClaw wrapper that supports Claude Max without requiring an API key. Setup involves picking a template, connecting Telegram, and deploying a code companion with GitHub integration for automated PR creation.

PRECC Tool Cuts Claude Code API Costs with Pre-Tool-Call Compression
A developer built PRECC, an open source tool that intercepts Claude Code tool calls and compresses payloads using RTK (Redundancy-aware Token Kompression), reducing input tokens by 40-66% with no perceptible latency impact.

Echo-TTS Ported to Apple Silicon with MLX for Native TTS with Voice Cloning
Echo-TTS, a 2.4B parameter diffusion text-to-speech model with voice cloning, has been ported from CUDA to run natively on Apple M-series silicon using MLX. On a base 16GB M4 Mac mini, a 5-second voice clone takes about 10 seconds to generate, while 30-second clones take about 60 seconds.

Building a Local Voice AI Assistant with SwiftUI and CSM-1B on Apple Silicon
A developer built mobiGlas, a SwiftUI app that pairs with OpenClaw to enable hands-free conversations via AirPods, using local voice cloning (CSM-1B on M2 Ultra) and no cloud APIs.