Practical Lessons from Building a Permanent Local AI Companion Agent

Setup and Architecture
A developer has been running a self-hosted AI agent on an M4 Mac mini for several months. The setup uses a Rust runtime with qwen2.5:14b on Ollama for fast local inference. The system implements a model ladder that escalates to cloud models when tasks require more capability. Memory is handled with SQLite and local embeddings using nomic-embed-text for semantic recall across sessions. The agent runs 24/7 via launchd and performs various tasks including monitoring a trading bot, checking email, deploying websites, and delegating heavy implementation work to Claude Code through a task runner.
Key Lessons Learned
Memory architecture is everything: The developer found that hybrid recall combining BM25 keyword search with vector similarity, weighted and merged, was a breakthrough. A 14B model with good memory recall outperforms a 70B model that starts every conversation cold.
The system prompt tax is real: Initial identity files started at ~10K tokens, but were reduced to ~2,800 tokens by cutting anything the agent could look up on demand. The rule: if the agent needs something occasionally, put it in memory; if it needs it every message, put it in the system prompt.
Local embeddings changed the economics: Using nomic-embed-text on Ollama alongside the conversation model makes every memory store and recall operation free, eliminating costs that previously accumulated from OpenAI embedding requests.
The model ladder matters more than the default model: The agent defaults to local qwen for conversation (free, fast) but can escalate to Minimax, Kimi, Haiku, Sonnet, or Opus depending on task requirements. The key insight: let humans switch models manually with commands like /model sonnet for reasoning tasks and /model qwen for chatting, rather than trying to auto-detect.
Tool iteration limits need headroom: Starting with 10 max tool calls per message proved insufficient. Simple tasks burn 3-5 tool calls, while complex tasks need 15-20. The current setup uses 25 tool calls with a 200 action/hour rate limit as a safety net.
The hardest bug was cross-session memory: Memories stored explicitly via a store tool initially had no session_id, and recall queries filtered by current session_id. This made deliberately memorized facts invisible in future sessions. The fix was adding OR session_id IS NULL to the SQL query.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Claude AI Adopts Custom Terminology from 300-Page Specifications Without Prompting
A developer loaded over 300 pages of formal specifications into Claude AI as project knowledge, including 88,000 words across 20 papers, 35 falsifiers, a glossary, field guide, test suite, and compression toolkit. Claude began using the custom vocabulary operationally to describe its own processes without being prompted.

Modified vLLM 0.17.0 runs on Tesla P40 for real-time transcription with Qwen3 ASR 1.7B
A developer modified vLLM 0.17.0 to run on Pascal architecture Tesla P40 GPUs, achieving near-complete hardware acceleration for real-time lecture transcription using the Qwen3 ASR 1.7B model. The fork is available on GitHub.

Agentic Infrastructure: Replacing Splunk with Claude Code Agents for Server Monitoring
A developer deploys Claude Code sessions as services — router, monitors, dashboard poller — connected via WebSocket hub. Watchers are cheap bash; LLM wakes every 5 min for drain cycle. Dashboard tiles are natural-language queries cached in SQLite.

Non-developer builds personalized AI news editor with Claude
A non-technical user created a personalized daily news briefing system using Claude AI, starting with a simple summarization prompt and evolving into a full toolkit with context-aware filtering and bias checking.