Building an Agentic RAG for Obsidian with Claude and an Eval Harness to Detect Hallucinations

A developer on r/ClaudeAI built an agentic RAG system over their Obsidian vault to let Claude answer questions from engineering PDFs without burning through the weekly token limit. The workflow: convert engineering PDFs to markdown, drop them in an Obsidian vault, use a cheap agent (Kimi K2.5) for BM25 retrieval over the vault, and have Claude only see relevant chunks instead of whole books. This dropped token cost per question from ~50k to ~5k.
The new problem: the agent was sometimes confidently wrong — e.g., saying "Marcus Aurelius wrote about death in Book IX section 3" when the canonical passage was in Book IV section 5. Plausible enough that manual verification was needed. So the developer built an eval harness using Claude Sonnet 4.6 as the LLM judge, deliberately a different model family from the Kimi agent to avoid grading its own output.
Initial rubric had four buckets including a 0.7 "thin but not wrong." On hand-grading, the human grader (the same developer, blind, on a different day) also collapsed everything borderline into 0.7. The agreement number looked respectable but was actually measuring shared bias. After four rubric iterations, the working version collapsed the middle bucket entirely and added a 0.9 bucket for one specific case: "right answer, wrong chunk." This case previously caused a false positive (1.0 papering over a retrieval miss) or false negative (0.4 punishing a correct answer). The split fixed it.
Under the new rubric, judge agreement with human on 18 rows went from 7/18 (39%) to 17/18 (94%). Caveats: 18 rows is a small sample, single grader (inter-grader reliability not established), BM25 isn't novel (but works well for technical/literary corpora where query/document vocabulary overlap is high). A negative result: the same chunking technique that lifted one corpus by 33pp regressed another by 17pp on the same eval — the harness caught it on the first run.
The full writeup with the four-iteration rubric story, calibration worksheet, and negative-result note is on Medium. The author is curious about others using Claude Sonnet as judge for their RAG/agent setups, what rubric they landed on, and how they handle inter-grader reliability with a single human in the loop.
📖 Read the full source: r/ClaudeAI
👀 See Also

Code-Graph-MCP: Open Source MCP Server Reduces Claude Code Token Usage by 40-60%
code-graph-mcp is an MCP server that indexes codebases into an AST knowledge graph, replacing multiple grep/read calls with single structured queries. The developer reports 40-60% total session token savings and 80% fewer tool calls per navigation task.

Noren AI: Voice Extraction Tool Identifies Writing Patterns from Samples
Noren AI analyzes 5-10 writing samples to automatically generate a voice guide based on actual patterns, matching 90% of manually identified patterns and discovering additional ones.

PayClaw Launches Sandbox for Payment MCP Server with Virtual Visa Cards
PayClaw has launched a sandbox environment for its payment MCP server, featuring merchant-locked virtual Visa cards with 15-minute expiry, MFA-gated human approval per transaction, and intent declaration before card issuance. Production cards are scheduled for March 4.

Developer Builds Scheme Compiler to WASM Using AI in 4 Days
A developer created Puppy Scheme, a Scheme compiler that targets WebAssembly, in about 4 days using AI assistance. The compiler supports 73% of R5RS and R7RS, uses WASM GC, and achieved compilation time improvements from 3½ minutes to 11 seconds overnight.