Open Source Auto-Memory System for LLM Agents Achieves 94% Recall Accuracy

✍️ OpenClawRadar📅 Published: March 21, 2026🔗 Source
Open Source Auto-Memory System for LLM Agents Achieves 94% Recall Accuracy
Ad

A developer has open-sourced an auto-memory system for LLM-based agents that automatically extracts, classifies, and persists facts across sessions without requiring explicit "save this" commands. The entire project—including plugin code, benchmark design, and test harness—was built using Claude Code as the primary development tool.

How the Memory System Works

The system operates with two layers:

  • Layer 1 (per-turn): A lightweight LLM summarizes each turn in real-time and writes to a staging file
  • Layer 2 (session boundary): Asynchronous classification into four skill files: identity, knowledge, lessons, and preferences

Retrieval works by having the agent load relevant skill files based on keyword matching in descriptions. The approach uses structured markdown files that the agent reads as "skills" rather than vector databases or RAG pipelines.

Development with Claude Code

Claude Code assisted in multiple aspects of the project:

  • Architecture design: Helped evaluate LongMemEval as a benchmark candidate, identified the paradigm mismatch (long-context retrieval vs. progressive memory), and proposed an adapted 6-question-type benchmark
  • Benchmark authoring: Designed the full 20-session/48-fact test suite including fact planting table, update chains (A→B→C), interference pairs, abstention questions, and two-hop trigger placement
  • Test harness: Built the entire autotest framework including serial executor, multi-turn polling, lifecycle management, rule evaluator, and LLM judge pipeline
  • Debugging in the loop: Diagnosed issues live during test runs, such as an update popup blocking Agent restarts, which was fixed by locking the updater state file to read-only
Ad

Benchmark Results

The 20-session benchmark was inspired by LongMemEval and tested 48 planted facts across 6 question types:

  • Deep recall: Facts from sessions 1-2 tested 15+ sessions later - 89%
  • Knowledge update: 3-level correction chain (A→B→C) - 100%
  • Cross-session reasoning: Combine facts from 3+ sessions - 100%
  • Interference resistance: Similar names that shouldn't be confused - 100%
  • Temporal reasoning: "Which came first?" ordering questions - 80%
  • Abstention: "I don't know" for never-mentioned facts - 86%

Overall: 49/52 checkpoints passed (94.2%). The one hard failure occurred when the agent inferred "you've done social media marketing" from a vaguely related fact ("promotion work") when the correct answer was "never discussed"—a classic LLM over-inference problem.

Availability and Questions

The project is open source with code and benchmark available on GitHub. The developer is looking for feedback on the skill-file approach (structured markdown vs. vector search), better ways to test abstention (identified as the hardest dimension), and information about others benchmarking cross-session memory in agents (not just long-context).

📖 Read the full source: r/ClaudeAI

Ad

👀 See Also

Self-Hosted GitHub Bot Runs Claude Code with 40+ Webhook Triggers and MCP Tools
Tools

Self-Hosted GitHub Bot Runs Claude Code with 40+ Webhook Triggers and MCP Tools

A self-hosted GitHub bot leverages Claude Agent SDK with full Claude Code features, supporting 40+ webhook triggers, 4 built-in MCP servers, and custom YAML-based workflows for PR review, CI auto-fix, and issue triage.

OpenClawRadar
Ninetails Memory Engine V4.5: Int8 Quantization + LRU Cache Cuts Local MCP Memory to 60MB
Tools

Ninetails Memory Engine V4.5: Int8 Quantization + LRU Cache Cuts Local MCP Memory to 60MB

The Ninetails Memory Engine V4.5 uses Int8 scalar quantization and LRU cache eviction to reduce vector storage from 6KB to 1.5KB per embedding, keeping the entire engine at 40-60MB RAM. It combines 70% vector similarity with 30% BM25 search in a fully local SQLite implementation.

OpenClawRadar
SIDJUA v0.9.7: Open Source Multi-Agent AI with Pre-Action Governance Enforcement
Tools

SIDJUA v0.9.7: Open Source Multi-Agent AI with Pre-Action Governance Enforcement

SIDJUA v0.9.7 is a self-hosted, open source multi-agent AI framework that enforces governance rules before agents act, blocking unauthorized actions like budget overruns or scope violations. It supports multiple LLM providers, runs on 4GB RAM, and includes a desktop GUI built with Tauri v2.

OpenClawRadar
Cowork Chrome Extension Automates Personal Data Removal from Data Brokers
Tools

Cowork Chrome Extension Automates Personal Data Removal from Data Brokers

A Reddit user reports that using the Cowork Chrome extension with a Gmail connection automated filling forms, writing emails, and verifying removal requests to delete personal data from major data providers in just a few hours.

OpenClawRadar