How to Build a Permanent Local AI Companion: Lessons

Setup and Architecture

A developer has been running a self-hosted AI agent on an M4 Mac mini for several months. The setup uses a Rust runtime with qwen2.5:14b on Ollama for fast local inference. The system implements a model ladder that escalates to cloud models when tasks require more capability. Memory is handled with SQLite and local embeddings using nomic-embed-text for semantic recall across sessions. The agent runs 24/7 via launchd and performs various tasks including monitoring a trading bot, checking email, deploying websites, and delegating heavy implementation work to Claude Code through a task runner.

Key Lessons Learned

Memory architecture is everything: The developer found that hybrid recall combining BM25 keyword search with vector similarity, weighted and merged, was a breakthrough. A 14B model with good memory recall outperforms a 70B model that starts every conversation cold.

The system prompt tax is real: Initial identity files started at ~10K tokens, but were reduced to ~2,800 tokens by cutting anything the agent could look up on demand. The rule: if the agent needs something occasionally, put it in memory; if it needs it every message, put it in the system prompt.

Local embeddings changed the economics: Using nomic-embed-text on Ollama alongside the conversation model makes every memory store and recall operation free, eliminating costs that previously accumulated from OpenAI embedding requests.

The model ladder matters more than the default model: The agent defaults to local qwen for conversation (free, fast) but can escalate to Minimax, Kimi, Haiku, Sonnet, or Opus depending on task requirements. The key insight: let humans switch models manually with commands like /model sonnet for reasoning tasks and /model qwen for chatting, rather than trying to auto-detect.

Tool iteration limits need headroom: Starting with 10 max tool calls per message proved insufficient. Simple tasks burn 3-5 tool calls, while complex tasks need 15-20. The current setup uses 25 tool calls with a 200 action/hour rate limit as a safety net.

The hardest bug was cross-session memory: Memories stored explicitly via a store tool initially had no session_id, and recall queries filtered by current session_id. This made deliberately memorized facts invisible in future sessions. The fix was adding OR session_id IS NULL to the SQL query.

📖 Read the full source: r/LocalLLaMA

Practical Lessons from Building a Permanent Local AI Companion Agent

Setup and Architecture

Key Lessons Learned

👀 See Also

Claude AI Adopts Custom Terminology from 300-Page Specifications Without Prompting

Modified vLLM 0.17.0 runs on Tesla P40 for real-time transcription with Qwen3 ASR 1.7B

Agentic Infrastructure: Replacing Splunk with Claude Code Agents for Server Monitoring

Non-developer builds personalized AI news editor with Claude