Local Qwen3-0.6B INT8 as Embedding Backbone for AI Memory System

A developer has shared their implementation of a local embedding system using Qwen3-0.6B quantized to INT8 via ONNX Runtime as the backbone for an AI memory lifecycle system that runs inside Claude Code.
Problem and Requirements
The system addresses scaling issues with embedding APIs: typical AI coding assistants make hundreds of API calls per day (15-25 sessions), creating latency on every write and dependency on external services with variable pricing. Requirements included 1024-dimensional vectors, cosine similarity above 0.75 indicating genuine semantic relatedness, batch processing for 20+ entries, and zero API calls.
Model Selection and Implementation
After testing several models, Qwen3-0.6B at 1024 dimensions provided better separation between genuinely related entries and structural noise (session logs sharing format but not topic) compared to sentence-transformers models.
The implementation uses ONNX Runtime with INT8 quantization. The cold start problem (3-second model loading) was solved with a persistent embedding server on localhost:52525 that loads the model once at system boot. Warm inference achieves ~12ms per batch, roughly 250x faster than cold start.
System Architecture
- The server starts automatically via a startup hook
- If the server goes down, the system falls back to direct ONNX loading (slower but functional)
- All CPU-based, no GPU needed
- Single Python script, ~2,900 lines, SQLite + ONNX
Memory Lifecycle Phases
The system processes knowledge through 5 phases, with embeddings driving phases 2 through 4:
- Buffer
- Connect: New entries get linked to existing entries above 0.75 cosine similarity. Isolated entries fade over time while connected entries survive. Expiry based on isolation, not time.
- Consolidate: Groups of 3+ connected entries get merged into proven knowledge by an LLM (Gemini Flash free tier)
- Route: Proven knowledge gets routed to the right config file based on embedding distance to existing content
- Age
Technical Details
- Model: Qwen3-0.6B quantized to INT8
- Vector dimensions: 1024
- Similarity threshold: 0.75 cosine similarity for genuine semantic relatedness
- Performance: ~12ms per batch for warm inference
- Hardware: Runs on any modern machine with CPU only
The project is open source at github.com/living0tribunal-dev/claude-memory-lifecycle with a detailed engineering story covering threshold decisions and failure modes after processing 3,874 memories.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Using OpenClaw with AI video tools to scale short-form content creation
A developer shares their workflow using OpenClaw to find content angles and hooks, then pairing it with an AI video tool to create and batch-post Shorts, Reels, and TikToks, resulting in consistent affiliate clicks and platform payouts.

Building a Personal AI Agent with Claude Code: Lessons from 6 Months of Wiz
A developer shares their experience building Wiz, a personal AI agent on Claude Code that handles morning reports, evening summaries, and inbox triage. The post details 9 mistakes made during development, including starting with overly ambitious goals and letting Claude generate core instructions without review.

Developer Uses Claude Code to Build SetForge Web App for Band Management
A developer with no professional coding experience used Claude Code to build SetForge, a React app deployed to Vercel that helps bands manage song libraries and setlists. The app includes features like Jam Set for finding overlapping songs, Excel/CSV import, flow scoring, auto-arrange modes, and real-time collaboration.

Developer uses Claude Code to iterate spending chart from wireframe to production quality in one night
A developer building a personal finance app solo used Claude Code to redesign a spending chart through four rounds of fixes in a single session, going from basic wireframe to near-production quality in about 3 hours.