Benchmark Results: Claude Agent Swarm with Memory System Shows 30-43% Token Cost Savings

Memory System Benchmark for Claude Agent Swarms
A developer has been building a memory system called Stompy for nine months, evolving from file-based to SQLite to PostgreSQL. The goal was to minimize token usage when running Claude agent swarms. They conducted a benchmark comparing performance with and without the memory system.
Test Setup
The benchmark used a 40-point coding task requiring a full booking feature with backend, frontend, and tests. A 6-agent swarm was tested with three different Claude models as lead: Sonnet 4.6, Opus 4.6, and Haiku 4.5. All tests used the same codebase, same teammates, and same scoring system. Teammate agents always ran Opus regardless of the lead model.
Benchmark Results
- Sonnet 4.6 + memory: 40/40, $3.98, 6.5min, 2 turns
- Sonnet 4.6 no memory: 40/40, $7.04, 9.6min, 4 turns
- Opus 4.6 + memory: 40/40, $4.34, 9.6min, 29 turns
- Opus 4.6 no memory: 40/40, $7.65, 10.0min, 70 turns
- Haiku 4.5 + memory: 39/40, $4.95, 7.5min, 2 turns
- Haiku 4.5 no memory: 0/40, $3.97, 5.8min, 3 turns
Key Findings
Opus and Sonnet with memory saved about 43% on cost compared to running without memory. The developer notes that these models are smart enough to complete the task without memory, but they burn tokens on codebase exploration that the memory system eliminates.
The Haiku result was unexpected: it scored 0/40 without memory but 39/40 with memory. The developer observed that Haiku couldn't coordinate the Opus teammate agents without understanding the project structure, but became a competent lead with memory access.
Sonnet with memory was the best overall configuration, beating memoryless Opus on every metric at roughly half the cost. The takeaway is that making project knowledge available to the model matters more than using expensive models.
Technical Details
The memory system is called Stompy and is MCP/API/CLI-based, working with Claude Code. The benchmark setup is available on GitHub for others to use or improve. The developer notes this is n=1 per condition so far, with more runs planned.
📖 Read the full source: r/ClaudeAI
👀 See Also
Voker Launches Agent Analytics Platform with Intent/Correction/Resolution Primitives
YC S24 startup Voker launches an agent analytics platform with a lightweight SDK that automatically annotates user intents, corrections, and resolutions — providing self-service dashboards without relying on LLMs for data engineering.

cc+ Desktop App for Claude Code: Multi-Session Management and Fleet Orchestration
cc+ is an open-source desktop application for Claude Code built on the Claude Agent SDK, available for macOS and Linux. It provides multi-session tabs, live activity tree visualization, security scoring, workflow enforcement, and fleet orchestration capabilities.

Detecting Silent Tool Failures in AI Coding Agents with Vibeyard
Vibeyard is a tool that detects when AI coding agents experience silent tool failures—where agents fall back to alternative strategies without alerting developers—and surfaces these inefficiencies during sessions. It can suggest fixes to prevent repeated inefficient workflows.

RCFlow: Open-source orchestrator for Claude Code, Codex, and OpenCode with multi-session management
RCFlow is an AGPL v3 orchestrator for AI coding agents (Claude Code, Codex, OpenCode) providing a unified UI to manage parallel sessions across machines, with worktree support, task planning, artifact tracking, and live telemetry.