Skip to content

feat(scalar): implement CacheCodec for FM index#7347

Draft
wjones127 wants to merge 1 commit into
lance-format:mainfrom
wjones127:feat/fmindex-cache-codec
Draft

feat(scalar): implement CacheCodec for FM index#7347
wjones127 wants to merge 1 commit into
lance-format:mainfrom
wjones127:feat/fmindex-cache-codec

Conversation

@wjones127

Copy link
Copy Markdown
Contributor

Previously the FM index ignored the session cache entirely: every open re-parsed each partition's metadata, and the only memoization was the in-memory Arc<dyn ScalarIndex>, which a serializable cache backend cannot persist. This implements CacheCodec for the FM index so an opened index survives in a node-agnostic, restart-surviving backend (the #7160 goal).

No on-disk format change — this is pure in-memory memoization layered over the existing read path, serialized through the stable LCE1 envelope.

What's cached

Two entry types, both CacheCodecImpl wired via sized CacheKeys:

  • FMIndexPartitionState — one per partition. The skeleton needed to rebuild a LazyFMIndex without re-reading metadata: huffman codes, tree topology, c_table, row ids, doc starts, sampled SA, and per-node prefix ranks. Body: FmIndexStateHeader proto + raw blobs.
  • WaveletNodeWords — one per wavelet node. That node's bitvector words, loaded lazily and shared so neighbouring blocks come from a single read. Body: a single raw blob.

load_partition fetches the skeleton via get_or_insert_with_key, and LazyRankBitVec loads whole-node words through the cache (one shared Arc per node, with contiguous global bit-indexing — which also simplified rank/get). The default in-memory Arc<dyn ScalarIndex> memoization remains the fast in-session layer; the sized entries are the durable layer the load/query path consults.

Granularity rationale

  • Skeleton entries are large and uniform (~hundreds of KB/partition), so one per partition.
  • Wavelet-node entries are bounded by node count (≤ ~511/partition), avoiding the small-entry explosion that motivated FTS posting-list grouping; full blocks are 32 KB by construction.

Note: search prewarms all partitions per query, so block-level laziness yields no I/O saving on the search path today — the wins are skeleton metadata-read avoidance and block survival on a persistent backend.

Tests

4 new tests (codec round-trips for both entry types, real-cache populate-and-reuse with key-presence assertions, empty-index through cache) plus the existing 24 fmindex tests pass. cargo fmt and cargo clippy -p lance-index --tests -- -D warnings clean.

Closes #7277

🤖 Generated with Claude Code

The FM index ignored the session cache entirely: every open re-parsed each
partition's metadata and the only memoization was the in-memory
`Arc<dyn ScalarIndex>`, which a serializable backend cannot persist. This adds
two cache entries that round-trip through the stable cache codec, so an opened
FM index survives in a node-agnostic, restart-surviving backend:

- `FMIndexPartitionState`: one per partition; the skeleton needed to rebuild a
  `LazyFMIndex` without re-reading metadata (huffman codes, tree topology,
  c_table, row ids, doc starts, sampled SA, per-node prefix ranks).
- `WaveletNodeWords`: one per wavelet node; the node's bitvector words, loaded
  lazily and shared so neighbouring blocks come from a single read.

`load_partition` now fetches the skeleton via `get_or_insert_with_key`, and
`LazyRankBitVec` loads whole-node words through the cache. No on-disk format
change. The in-memory `Arc` memoization remains the fast in-session layer.

Closes lance-format#7277
@github-actions

Copy link
Copy Markdown
Contributor

Important

This PR touches the Lance format specification.

Substantive changes to the format specification — the .proto definitions
and the spec docs under docs/src/format/ — require a PMC vote before merge.
Minor edits such as typo fixes, wording, or formatting are excluded; use your
judgment.

If this is a meaningful format change:

  • Start a vote following the Lance community voting process.
    Format specification modifications need 3 binding +1 votes (excluding the
    proposer), held on GitHub Discussions, with a minimum voting period of 1 week.
  • Once the vote passes, link the completed vote in this PR. It should not be
    merged until the vote is linked.

@github-actions github-actions Bot added A-index Vector index, linalg, tokenizer enhancement New feature or request labels Jun 18, 2026
@codecov

codecov Bot commented Jun 18, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 90.34335% with 45 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-index/src/scalar/fmindex.rs 90.34% 4 Missing and 41 partials ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-index Vector index, linalg, tokenizer enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement CacheCodec for FM Index

1 participant