Knowledge Follows Code Paths
Why we built code structure awareness into Surchin's retrieval pipeline, and what it means for teams accumulating developer knowledge at scale.
Most knowledge retrieval systems work the same way: embed the query, find similar content, return results. It works well for document search. It works less well for codebases.
The problem is that code has structure that text doesn't. A pitfall discovered in session.ts matters when you're editing login.ts — not because those files share vocabulary, but because one imports the other. The relationship is structural, not semantic. An embedding model can't see it.
We've been thinking about this gap for a while. Surchin has always used file paths and symbol names as locality signals in its scoring pipeline. But those are flat metadata — they tell you where an insight was deposited, not what that code connects to. A developer working in auth/login.ts would see insights tagged to that file, but miss the pitfall their teammate deposited about auth/session.ts three weeks ago — even though login.ts imports directly from it.
What Changed
Surchin now understands the import and call graph of your codebase. When you query for insights, the retrieval pipeline traces the structural connections from the files you're working on — upstream (who depends on this?) and downstream (what does this depend on?) — and surfaces relevant knowledge from across that blast radius.
This isn't a separate tool or an extra step. The existing query_insights flow does it transparently. If a local code index is available, the file context you provide is automatically expanded to include structurally connected code before the search runs. If no index exists, everything works exactly as before.
The result is a subtle but meaningful improvement: insights don't just match what you're looking at — they match what your change might affect.
How It Works, Briefly
The indexer runs locally. It builds a lightweight graph of which files import from which other files, and which symbols call which other symbols. The graph is stored in a local SQLite database — no source code leaves your machine.
On startup, the indexer works in waves. The first wave covers your immediate working directory and finishes in under a second. Broader waves fill in the rest of the repo in the background. By the time you're a few minutes into a session, the full graph is available.
When you query for insights, a BFS traversal walks the graph outward from your files. The files it finds are added to the locality context before scoring. Insights anchored to those connected files now have a path into your results — weighted by their relevance and strength, same as any other result.
Results from the blast radius are presented separately from direct matches, clearly labeled. This is intentional. Expanded context is useful but should be weighed differently by the agent. We track helpful/unhelpful ratings for expanded results independently so we can measure whether the expansion is actually helping.
What We Learned Building This
Accuracy doesn't require a full AST. For building the import graph — which is what powers the BFS traversal — regex extraction is fast and sufficient. A simple pass over import statements gets you 90% of the graph at roughly one millisecond per file. We use tree-sitter for deeper analysis (symbols, call sites, class hierarchies) but only on demand, for files that actually appear in a blast radius result. Cold code stays regex-only. No wasted work.
The graph is surprisingly shareable. The structural relationships in a codebase are the same for everyone on the same branch. When one team member indexes the repo, we upload a lightweight summary — just community labels and directory-level import edges, not source code — so others can skip re-indexing entirely. For a 100-person team, that's the difference between 100 indexing runs and one.
Community detection is useful beyond visualization. We run Louvain community detection on the import graph to identify functional clusters in the codebase. These clusters show up in the Knowledge Map on the dashboard, but they're also useful as metadata in the retrieval pipeline. When the blast radius crosses a community boundary, that's a signal worth surfacing — changes that ripple across functional areas tend to be the ones where accumulated knowledge matters most.
Worktrees need special handling. Many teams using AI coding agents work extensively with git worktrees — isolated working copies for parallel tasks. We maintain a shared base index with per-worktree deltas, so switching between worktrees doesn't trigger a full re-index. A registry tracks which worktrees exist and cleans up stale ones automatically.
The Knowledge Map
We also shipped a new dashboard page: the Knowledge Map. It visualizes where knowledge has accumulated across your codebase — which modules have deep coverage, which have gaps, where pitfalls cluster.
Two views: a force-directed graph colored by functional community, and a treemap showing insight density by directory. Both are interactive. Click a node to see the insights anchored to that code.
This is available on all tiers, including free. The indexer runs locally and the data it surfaces comes from the insights your team has already deposited. We think making this visible is valuable regardless of plan level — it helps teams see what they know and, more importantly, what they don't.
What's Next
The current implementation covers TypeScript, JavaScript, Python, Go, Rust, and Java with full extraction support, plus fifteen additional languages with grammar-level support. We're looking at cross-repo impact analysis for monorepos with shared packages, and at surfacing knowledge map data through email digests so teams don't need to visit the dashboard to benefit from it.
The core bet is straightforward: the more your knowledge retrieval understands about how code connects, the more relevant the knowledge it surfaces. Embeddings find what's semantically similar. Code structure finds what's structurally related. Both matter. Now Surchin uses both.