feat(scalar): implement CacheCodec for FM index by wjones127 · Pull Request #7347 · lance-format/lance

wjones127 · 2026-06-18T00:00:27Z

Previously the FM index ignored the session cache entirely: every open re-parsed each partition's metadata, and the only memoization was the in-memory Arc<dyn ScalarIndex>, which a serializable cache backend cannot persist. This implements CacheCodec for the FM index so an opened index survives in a node-agnostic, restart-surviving backend (the #7160 goal).

No on-disk format change — this is pure in-memory memoization layered over the existing read path, serialized through the stable LCE1 envelope.

What's cached

Two entry types, both CacheCodecImpl wired via sized CacheKeys:

FMIndexPartitionState — one per partition. The skeleton needed to rebuild a LazyFMIndex without re-reading metadata: huffman codes, tree topology, c_table, row ids, doc starts, sampled SA, and per-node prefix ranks. Body: FmIndexStateHeader proto + raw blobs.
WaveletNodeWords — one per wavelet node. That node's bitvector words, loaded lazily and shared so neighbouring blocks come from a single read. Body: a single raw blob.

load_partition fetches the skeleton via get_or_insert_with_key, and LazyRankBitVec loads whole-node words through the cache (one shared Arc per node, with contiguous global bit-indexing — which also simplified rank/get). The default in-memory Arc<dyn ScalarIndex> memoization remains the fast in-session layer; the sized entries are the durable layer the load/query path consults.

Granularity rationale

Skeleton entries are large and uniform (~hundreds of KB/partition), so one per partition.
Wavelet-node entries are bounded by node count (≤ ~511/partition), avoiding the small-entry explosion that motivated FTS posting-list grouping; full blocks are 32 KB by construction.

Note: search prewarms all partitions per query, so block-level laziness yields no I/O saving on the search path today — the wins are skeleton metadata-read avoidance and block survival on a persistent backend.

Tests

4 new tests (codec round-trips for both entry types, real-cache populate-and-reuse with key-presence assertions, empty-index through cache) plus the existing 24 fmindex tests pass. cargo fmt and cargo clippy -p lance-index --tests -- -D warnings clean.

Closes #7277

🤖 Generated with Claude Code

The FM index ignored the session cache entirely: every open re-parsed each partition's metadata and the only memoization was the in-memory `Arc<dyn ScalarIndex>`, which a serializable backend cannot persist. This adds two cache entries that round-trip through the stable cache codec, so an opened FM index survives in a node-agnostic, restart-surviving backend: - `FMIndexPartitionState`: one per partition; the skeleton needed to rebuild a `LazyFMIndex` without re-reading metadata (huffman codes, tree topology, c_table, row ids, doc starts, sampled SA, per-node prefix ranks). - `WaveletNodeWords`: one per wavelet node; the node's bitvector words, loaded lazily and shared so neighbouring blocks come from a single read. `load_partition` now fetches the skeleton via `get_or_insert_with_key`, and `LazyRankBitVec` loads whole-node words through the cache. No on-disk format change. The in-memory `Arc` memoization remains the fast in-session layer. Closes lance-format#7277

github-actions · 2026-06-18T00:00:36Z

Important

This PR touches the Lance format specification.

Substantive changes to the format specification — the .proto definitions
and the spec docs under docs/src/format/ — require a PMC vote before merge.
Minor edits such as typo fixes, wording, or formatting are excluded; use your
judgment.

If this is a meaningful format change:

Start a vote following the Lance community voting process.
Format specification modifications need 3 binding +1 votes (excluding the
proposer), held on GitHub Discussions, with a minimum voting period of 1 week.
Once the vote passes, link the completed vote in this PR. It should not be
merged until the vote is linked.

codecov · 2026-06-18T00:38:45Z

Codecov Report

❌ Patch coverage is 90.34335% with 45 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance-index/src/scalar/fmindex.rs	90.34%	4 Missing and 41 partials ⚠️

📢 Thoughts on this report? Let us know!

github-actions Bot added A-index Vector index, linalg, tokenizer enhancement New feature or request labels Jun 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(scalar): implement CacheCodec for FM index#7347

feat(scalar): implement CacheCodec for FM index#7347
wjones127 wants to merge 1 commit into
lance-format:mainfrom
wjones127:feat/fmindex-cache-codec

wjones127 commented Jun 18, 2026

Uh oh!

github-actions Bot commented Jun 18, 2026

Uh oh!

codecov Bot commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wjones127 commented Jun 18, 2026

What's cached

Granularity rationale

Tests

Uh oh!

github-actions Bot commented Jun 18, 2026

Uh oh!

codecov Bot commented Jun 18, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant