Build an Agentic RAG for Obsidian with Claude: Eval Harness Cuts Hallucinations

A developer on r/ClaudeAI built an agentic RAG system over their Obsidian vault to let Claude answer questions from engineering PDFs without burning through the weekly token limit. The workflow: convert engineering PDFs to markdown, drop them in an Obsidian vault, use a cheap agent (Kimi K2.5) for BM25 retrieval over the vault, and have Claude only see relevant chunks instead of whole books. This dropped token cost per question from ~50k to ~5k.

The new problem: the agent was sometimes confidently wrong — e.g., saying "Marcus Aurelius wrote about death in Book IX section 3" when the canonical passage was in Book IV section 5. Plausible enough that manual verification was needed. So the developer built an eval harness using Claude Sonnet 4.6 as the LLM judge, deliberately a different model family from the Kimi agent to avoid grading its own output.

Initial rubric had four buckets including a 0.7 "thin but not wrong." On hand-grading, the human grader (the same developer, blind, on a different day) also collapsed everything borderline into 0.7. The agreement number looked respectable but was actually measuring shared bias. After four rubric iterations, the working version collapsed the middle bucket entirely and added a 0.9 bucket for one specific case: "right answer, wrong chunk." This case previously caused a false positive (1.0 papering over a retrieval miss) or false negative (0.4 punishing a correct answer). The split fixed it.

Under the new rubric, judge agreement with human on 18 rows went from 7/18 (39%) to 17/18 (94%). Caveats: 18 rows is a small sample, single grader (inter-grader reliability not established), BM25 isn't novel (but works well for technical/literary corpora where query/document vocabulary overlap is high). A negative result: the same chunking technique that lifted one corpus by 33pp regressed another by 17pp on the same eval — the harness caught it on the first run.

The full writeup with the four-iteration rubric story, calibration worksheet, and negative-result note is on Medium. The author is curious about others using Claude Sonnet as judge for their RAG/agent setups, what rubric they landed on, and how they handle inter-grader reliability with a single human in the loop.

📖 Read the full source: r/ClaudeAI

Building an Agentic RAG for Obsidian with Claude and an Eval Harness to Detect Hallucinations

👀 See Also

Code-Graph-MCP: Open Source MCP Server Reduces Claude Code Token Usage by 40-60%

Noren AI: Voice Extraction Tool Identifies Writing Patterns from Samples

PayClaw Launches Sandbox for Payment MCP Server with Virtual Visa Cards

Developer Builds Scheme Compiler to WASM Using AI in 4 Days