Per-File LLM Graphs vs Vector Embeddings for Code Retrieval

A year-long experiment building a code indexing system for AI coding tools yielded clear results: vector embeddings on code chunks and Tree-sitter AST parsing both have critical flaws, while per-file LLM analysis stored in a Neo4j graph with semantic fulltext search works best. The findings echo recent papers like RepoGraph (ICLR 2025) and Code-Craft.

Approaches tested

Vector embeddings on code chunks – discarded entirely. A function named process() in a payments service and one in an image pipeline embed to similar vectors, despite having nothing to do with each other. Vectors flatten call graphs, inheritance, imports — all structural relationships. Retrieval precision was unacceptable.
Tree-sitter AST parsing – precise and fast, but structural-only. It can tell you a function exists and what it calls, but cannot answer the question “this function handles webhook retries for failed Stripe payments.” Falls short when developers phrase questions in business language.
Per-file LLM analysis → graph – works. Every file gets an LLM call generating purpose, summary, and businessContext, stored as nodes in Neo4j with edges to classes, functions, keywords, and imports. Retrieval uses fulltext search across those semantic fields instead of vector similarity. SHA-256 diffing limits reindexing to changed files, making the upfront cost manageable.

Benchmarks from literature

RepoGraph (ICLR 2025) showed +32.8% improvement on SWE-bench with graph approaches. Code-Craft achieved +82% top-1 retrieval precision using bottom-up LLM summaries from code graphs.

Comparison to existing tools

The team published a side-by-side in comparison.md. Key differences:

Bytebell: per-file LLM → purpose + summary + businessContext + entities; Neo4j + MongoDB storage; SHA-256 diff-aware reindex.
PageIndex: TOC reasoning tree for long PDFs/docs; no code-specific semantics.
GitNexus: Tree-sitter AST + community detection; optional per-symbol semantics; uses LadybugDB.
GraphRAG: per-chunk LLM entities + community clustering for general text, not code.
Sourcegraph/Cody: LSIF/SCIP search index; no per-node semantics; deployment is self-hosted or SaaS.
Augment: proprietary semantic index with embeddings; SaaS-only; continuous indexing managed.