Local RAG Tool Built with Nemotron Nano 9B v2 and vLLM Tool Calling

Technical Implementation Details
A developer has shared their approach to building a local-first RAG research tool that runs entirely on a single GPU. The entire backend is contained in a single app.py file.
Stack and Configuration
The tool uses Nemotron Nano 9B v2 Japanese on vLLM with FP16 quantization, running on an RTX 5090 GPU. The backend combines FastAPI + SQLite FTS5 + Jinja2. For tool calling, the developer uses NVIDIA's official parser plugins, specifically --tool-call-parser nemotron_json and --tool-parser-plugin, noting that Nemotron v2 requires custom parser plugins rather than the built-in vLLM parsers (which are for v3).
Key Design Decisions
The system implements an extract → execute two-step flow:
- When a question is asked, the system first extracts bilingual keywords (English and Japanese) via LLM
- Runs FTS5 search on local sources AND DuckDuckGo web search in parallel
- Shows results with checkboxes for user selection
- Only after user selection does it generate the final response
This approach avoids dumping 100k+ tokens of context and hoping the model figures it out.
Performance and Features
- Tool Calling: The model autonomously decides when to search the web, working surprisingly well at temperature 0.1
- Prefix Cache Warmup: Instead of caching everything at source load, the KV cache is warmed up when the user sees the source preview. By the time they click Execute, the prefix is already cached using
--enable-prefix-cachingon vLLM - Bilingual FTS5 Search: User query → Nemotron extracts keywords in both English and Japanese → OR-joined FTS5 MATCH query, effective for multilingual patent/research data
Performance Numbers
- ~80-120 tok/s output
- 8192 max tokens
- Source extraction: ~3-5s (keyword extraction + FTS5 + DDG parallel)
- Full response with 5 sources + 3 web results: ~50s for a detailed answer on RTX 5090
Setup and Source
The source code is available at https://github.com/soy-tuber/SoyLM. It's a single file application that can be installed with uv pip install -r requirements.txt. Note that it requires vLLM with the Nemotron parser plugins separately.
📖 Read the full source: r/LocalLLaMA
👀 See Also

PgAdmin 4 9.13 Adds AI Assistant Panel to Query Tool
PgAdmin 4 version 9.13 introduces an AI Assistant panel in the Query Tool that can generate SQL from natural language when AI is configured. The update also includes a Workspace layout for distraction-free query editing and ad-hoc server connections.

yoyo: Local MCP Server for Grounded Codebase Reads and Guarded Writes with Claude Code
yoyo is an open-source local MCP server that provides coding agents like Claude Code with grounded repository reads and guarded writes across 16 languages, including Rust, Go, Python, and TypeScript. It prevents broken edits from silently landing by returning machine-readable guard_failure output and enabling retry_plan for targeted repairs.

Two Patterns for Preventing AI Agent Memory Rot: AutoDream and Skeptical Retrieval
OpenClaw introduces two MIT-licensed patterns to address file-based AI memory rot: AutoDream for nightly memory consolidation and Skeptical Retrieval for decay-weighted memory scoring. Both work together in a self-improving loop to keep agent context current.

Nelson v2.2.3 Released: Multi-Agent Coordination for Claude Code, Plus a Discrete-Event Simulation Benchmark
Nelson v2.2.3 ships a multi-agent coordination skill for Claude Code using a naval metaphor. A 13-configuration benchmark shows opus-4-7 with thinking dominates; skill choice is a smaller delta.