perf(index): bulk-write the graph, parallelise parse, resolve in-memory (~12–15× faster) by prom3theu5 · Pull Request #8 · SimCubeLtd/synapse

prom3theu5 · 2026-06-02T10:29:10Z

Summary

Indexing is bound almost entirely by the store-write path, not parsing. Profiling aeontis-backend (119 files / 338 symbols) put the per-file write drain at ~83% of wall-clock, the parse at ~3%, and the resolve passes at ~11%. This PR attacks all three, with the write path being the dominant win.

Every change is verified to leave output byte-identical — symbol/edge counts and deterministic related JSON were checked at each step on real repos.

Repo	baseline	this PR	speedup
aeontis-backend (119f / 338s / 648 ref edges)	6.8s	0.58s	~12×
aeontis-new-reactnative (1593f / 10935s / 61547 ref edges)	311s	21.4s	~15×

Counts identical on both at every step; related --symbol <X> --json byte-identical across two re-indexes (determinism) and identical to the pre-change graph.

What changed (all in `src/indexer/mod.rs` + the store layer)

A — Resolve passes stop scanning the store per name. resolve_supertypes / resolve_references issued one unindexed symbols_matching (a full Symbol-table scan under the global mutex) per supertype and per distinct referenced name. Both now resolve against a single in-memory SymbolIndex built from one full scan — by_name_ci (case-insensitive) + by_file, each bucket sorted by (start_line, end_line, id) for deterministic selection. Semantics unchanged: case-insensitive match, same-file → same-project → all-candidates ambiguity policy, deterministic edge order, the no-declared-symbol false-positive guard.

B — The per-file scan parses in parallel (rayon). read + blake3 + detect + 4× tree-sitter extract + manifest parse are pure, owned, Send work, so they run across rayon's pool (par_iter, order-preserving collect). A sequential drain in candidate order then does every store write, so symbol-insertion and pending_* order are byte-identical to the old loop. The store is never touched from a rayon thread (lbug Connection is &mut under one mutex — writes stay serial).

C — File + symbol node writes go in one transaction. Each upsert_symbol / link_file_declares_symbol used to be its own auto-committed statement (~2 per symbol). New additive GraphStore::write_files_batch writes a batch under one BEGIN/COMMIT (default trait impl falls back to per-file writes, so the in-memory test store is unchanged).

D — The batched writes use UNWIND $rows. write_files_batch and link_edges pass all rows of a kind as one list-of-structs parameter and MERGE them in a single execute (UNWIND $rows AS r ...). A 61k-edge or 11k-symbol batch becomes a handful of FFI calls instead of tens of thousands — the dominant win at scale (RN: 177s → 21.4s came almost entirely from this).

Why writes aren't parallelised

lbug allows concurrent connections but only one write transaction at a time (per the LadybugDB transaction docs), so threading the write path wouldn't beat that ceiling — it'd just move the serialization point and add contention. The lever is fewer, bigger write operations (C + D), which is what this does.

Notes

New unconditional dependency: rayon.
No schema / trait-read / async changes beyond the additive write_files_batch.
Version bumped to 0.2.1 (perf-only, no API change).

Test plan

cargo fmt --check && cargo clippy --all-targets && cargo test — all green (91 tests). The resolution correctness guards (index_creates_inherits_and_implements_edges, index_creates_reference_edges_*, reference_ambiguous_name_links_all_candidates, reference_local_variable_creates_no_edge, reference_lookup_is_case_insensitive, reference_same_project_is_segment_safe) pass unchanged — proving identical semantics.
Real-repo before/after on aeontis-backend and aeontis-new-reactnative: identical status --json counts and deterministic, identical related output at every step.

Indexing was bound almost entirely by the store-write path. Profiling aeontis-backend (119 files / 338 symbols) put the per-file write drain at ~83% of wall-clock, the parse at ~3%, and the resolve passes at ~11%. Four changes, each surgical and verified to leave output byte-identical (symbol/edge counts and deterministic `related` output checked at every step): A. Resolve passes stop scanning the store per name. resolve_supertypes and resolve_references issued one unindexed `symbols_matching` (a full Symbol-table scan under the global mutex) per supertype and per distinct referenced name. Both now resolve against a single in-memory SymbolIndex built from one full scan: by_name_ci (case-insensitive) and by_file, each bucket sorted by (start_line, end_line, id) for deterministic from/child selection. Semantics unchanged (case-insensitive match, same-file -> same-project -> all ambiguity policy, deterministic edge order, the no-declared-symbol guard). B. The per-file scan parses in parallel. read + blake3 + detect + the four tree-sitter extractors + the manifest parse are pure, owned, Send work, so they run across rayon's pool (par_iter, order-preserving collect). A sequential drain in candidate order then does every store write, so symbol-insertion and pending_* order are byte-identical to the old loop. The store is never touched from a rayon thread. C. File + symbol node writes go in one transaction. Each upsert_symbol / link_file_declares_symbol was its own auto-committed statement (~2 per symbol). The new GraphStore::write_files_batch collects a file's nodes and writes them under one BEGIN/COMMIT (default impl falls back to per-file writes for the in-memory test store). D. The batched writes use UNWIND $rows. write_files_batch and link_edges pass all rows of a kind as one list-of-structs parameter and MERGE them in a single `execute` (UNWIND $rows AS r ...), so a 61k-edge or 11k-symbol batch is a handful of FFI calls instead of tens of thousands. This was the dominant win at scale. Measured (index --force): aeontis-backend (119f / 338s / 648 ref edges): 6.8s -> 0.58s (~12x) aeontis-rn (1593f / 10935s / 61547 edges): 311s -> 21.4s (~15x) rayon is a new unconditional dependency. No schema, trait-read, or async changes beyond the additive write_files_batch method.

qodo-code-review · 2026-06-02T10:29:30Z

Review Summary by Qodo

Optimize indexing: parallel parse, in-memory resolve, batched writes

✨ Enhancement

Walkthroughs

Description

• Parallelize file parsing with rayon for 3% speedup
• Build in-memory symbol index to eliminate per-lookup store scans
• Batch file/symbol writes in single transaction via UNWIND
• Batch edge writes using UNWIND $rows for 61k→5 FFI calls
• Achieves ~12–15× overall indexing speedup on real repos

Diagram

flowchart LR
  A["Parse candidates<br/>in parallel<br/>rayon"] --> B["Drain in order<br/>collect FileWork"]
  B --> C["Batch file+symbol<br/>writes in one<br/>transaction"]
  C --> D["Build SymbolIndex<br/>from one scan"]
  D --> E["Resolve supertypes<br/>& references<br/>against index"]
  E --> F["Batch edge writes<br/>via UNWIND"]
  F --> G["Result:<br/>12–15× faster"]

File Changes

1. src/indexer/mod.rs ✨ Enhancement +406/-188

Parallel parse, in-memory symbol index, deferred manifest writes

• Introduced parallel parse stage using rayon: parse_file extracts symbols, imports, supertypes,
 references, and manifests in parallel threads
• Added FileWork enum and IndexedFileWork struct to represent parsed file results
• Created ParseContext to share immutable parse inputs across rayon workers
• Implemented SymbolIndex struct with by_name_ci and by_file lookups to replace per-lookup
 store scans
• Refactored resolve_supertypes and resolve_references to use in-memory index instead of
 repeated symbols_matching calls
• Split manifest parsing into pure parse_csproj_manifest and parse_package_json_manifest
 functions
• Added ManifestWrite and ManifestOp types to defer manifest writes until sequential drain
• Changed main loop to accumulate file_writes and manifest_writes for batched store operations
• Added write_manifest function to replay parsed manifests in order

src/indexer/mod.rs

2. src/graph/ladybug_store.rs ✨ Enhancement +190/-31

Batch edge writes and add write_files_batch transaction

• Refactored link_edges to use UNWIND $rows instead of per-edge prepared statements
• Changed from executing one statement per edge to grouping edges by kind and executing one
 statement per kind with all rows as a list parameter
• Added kind_ix helper to group edges deterministically while preserving per-kind order
• Implemented new write_files_batch method with six stages: remove declared symbols, remove by
 filepath, remove file, upsert files, upsert symbols, link DECLARES edges
• All stages use UNWIND $rows with list-of-structs parameters inside a single transaction

src/graph/ladybug_store.rs

3. src/graph/model.rs ✨ Enhancement +17/-0

Add FileWrite struct for batched writes

• Added new FileWrite struct containing an IndexedFile and its declared IndexedSymbol vector
• Struct carries complete file node payload for batched write operations

src/graph/model.rs

View more (2)

4. src/graph/store.rs ✨ Enhancement +24/-2

Add write_files_batch trait method

• Added FileWrite to trait imports
• Implemented new write_files_batch method with default implementation that falls back to
 per-file/per-symbol operations
• Default impl preserves original ordering: remove → upsert file → upsert+link each symbol
• Trait documentation explains batching benefit for transaction-supporting backends

src/graph/store.rs

5. Cargo.toml Dependencies +2/-1

Add rayon dependency and bump version
• Bumped version from 0.2.0 to 0.2.1
• Added rayon = "1.10" dependency for parallel iteration
Cargo.toml

qodo-code-review · 2026-06-02T10:29:31Z

Code Review by Qodo

🐞 Bugs (3) 📘 Rule violations (0)

1. Batch rows cloned unnecessarily 🐞 Bug ➹ Performance

Description

LadybugGraphStore::write_files_batch clones the entire Vec<Value> for each executed stage,
adding avoidable allocations/copies for large symbol batches. This undermines the PR’s performance
goals and is straightforward to eliminate by moving (not cloning) owned row vectors into
Value::List.

Code

src/graph/ladybug_store.rs[R585-588]

Evidence

The batched writer builds owned *_rows vectors, but then passes them to execute by cloning
(rows.clone()), which duplicates the row list in memory even when a list is only used once
(notably the symbol and declares stages).

src/graph/ladybug_store.rs[564-590]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`write_files_batch` executes each stage by passing `Value::List(child_ty, rows.clone())`, which clones the full row list even for the large, single-use lists (`symbol_rows`, `declares_rows`).

## Issue Context
This path is explicitly meant to reduce overhead at scale; the extra clones are avoidable.

## Fix Focus Areas
- src/graph/ladybug_store.rs[564-590]

## Suggested fix
Refactor stage execution so the `Vec<Value>` is moved into `Value::List` rather than cloned:
- Execute each stage directly (no `stages` array of `&Vec<_>`), passing `Value::List(child_ty, symbol_rows)` / `declares_rows` by move.
- For the shared `path_rows`, either:
 - build three separate small `Vec<Value>` lists (cheap), or
 - keep a helper that accepts `&[Value]` and internally clones only for those small stages.
This removes large clones without changing query semantics.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. Parsing phase progress stalls 🐞 Bug ◔ Observability

Description

During the parallel parsing stage, index_repo emits only a single progress snapshot and then does
not update progress until parsing completes, which can make the CLI/progress UI appear stuck on
large repos. This is a behavioral regression from the prior per-file sequential loop (even if
intentional).

Code

src/indexer/mod.rs[R291-319]

Evidence

The code calls the progress callback once before starting par_iter().map(...).collect() and does
not invoke it again until the sequential drain loop, so there are no intermediate progress updates
during parsing.

src/indexer/mod.rs[285-343]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Progress reporting does not advance during the rayon `par_iter()` parsing stage; only a single snapshot with phase `"parsing files"` is emitted.

## Issue Context
The sequential drain still reports per-candidate progress, but users won’t see incremental progress while the potentially expensive parse stage runs.

## Fix Focus Areas
- src/indexer/mod.rs[285-343]

## Suggested fix
Consider adding coarse-grained progress for parsing without calling the callback from rayon threads (since callback thread-safety is unknown). Options:
- Chunk candidates and parse each chunk in parallel, emitting a progress snapshot between chunks.
- Or accumulate an atomic parsed-count from workers and have a lightweight periodic ticker on the main thread that reads it and calls `progress`.
Keep determinism by preserving `collect()` order; only the reporting cadence changes.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

3. Misleading manifest error comment 🐞 Bug ⚙ Maintainability

Description

The comment in parse_file says a manifest parse error aborts the index “exactly as the previous
inline ? did,” but the new two-stage design defers all store writes until after parsing, changing
when/what gets written on error. This can mislead future maintainers about failure semantics.

Code

src/indexer/mod.rs[R196-202]

Evidence

parse_file can return an error during stage 1; stage 2 immediately propagates it (work?) and the
store write (write_files_batch) only happens after the drain loop, so an error prevents any writes
in the new structure.

src/indexer/mod.rs[196-204]
src/indexer/mod.rs[316-381]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
A comment claims manifest parse errors abort indexing exactly like the prior inline `?` behavior, but the new pipeline defers writes until after parsing/drain, so errors now abort before any `write_files_batch`/manifest writes occur.

## Issue Context
This is primarily a documentation/maintainer correctness issue.

## Fix Focus Areas
- src/indexer/mod.rs[196-204]
- src/indexer/mod.rs[316-381]

## Suggested fix
Rewrite the comment to describe the new behavior accurately (e.g., parse errors abort the run before the batched write stage), and optionally highlight that this is a more atomic failure mode than the prior write-as-you-go loop.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

prom3theu5 merged commit 695f2c1 into main Jun 2, 2026
1 check passed

prom3theu5 deleted the perf/index-resolve-and-parallel-parse branch June 2, 2026 10:29

prom3theu5 mentioned this pull request Jun 2, 2026

fix(index): live progress during parallel parse and resolve phases #9

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(index): bulk-write the graph, parallelise parse, resolve in-memory (~12–15× faster)#8

perf(index): bulk-write the graph, parallelise parse, resolve in-memory (~12–15× faster)#8
prom3theu5 merged 1 commit into
mainfrom
perf/index-resolve-and-parallel-parse

prom3theu5 commented Jun 2, 2026

Uh oh!

qodo-code-review Bot commented Jun 2, 2026

Uh oh!

qodo-code-review Bot commented Jun 2, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

prom3theu5 commented Jun 2, 2026

Summary

What changed (all in src/indexer/mod.rs + the store layer)

Why writes aren't parallelised

Notes

Test plan

Uh oh!

qodo-code-review Bot commented Jun 2, 2026

Review Summary by Qodo

Walkthroughs

File Changes

Uh oh!

qodo-code-review Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review by Qodo

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

What changed (all in `src/indexer/mod.rs` + the store layer)

qodo-code-review Bot commented Jun 2, 2026 •

edited

Loading