feat(scalar): implement CacheCodec for FM index#7347
Draft
wjones127 wants to merge 1 commit into
Draft
Conversation
The FM index ignored the session cache entirely: every open re-parsed each partition's metadata and the only memoization was the in-memory `Arc<dyn ScalarIndex>`, which a serializable backend cannot persist. This adds two cache entries that round-trip through the stable cache codec, so an opened FM index survives in a node-agnostic, restart-surviving backend: - `FMIndexPartitionState`: one per partition; the skeleton needed to rebuild a `LazyFMIndex` without re-reading metadata (huffman codes, tree topology, c_table, row ids, doc starts, sampled SA, per-node prefix ranks). - `WaveletNodeWords`: one per wavelet node; the node's bitvector words, loaded lazily and shared so neighbouring blocks come from a single read. `load_partition` now fetches the skeleton via `get_or_insert_with_key`, and `LazyRankBitVec` loads whole-node words through the cache. No on-disk format change. The in-memory `Arc` memoization remains the fast in-session layer. Closes lance-format#7277
Contributor
|
Important This PR touches the Lance format specification. Substantive changes to the format specification — the If this is a meaningful format change:
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Previously the FM index ignored the session cache entirely: every open re-parsed each partition's metadata, and the only memoization was the in-memory
Arc<dyn ScalarIndex>, which a serializable cache backend cannot persist. This implementsCacheCodecfor the FM index so an opened index survives in a node-agnostic, restart-surviving backend (the #7160 goal).No on-disk format change — this is pure in-memory memoization layered over the existing read path, serialized through the stable
LCE1envelope.What's cached
Two entry types, both
CacheCodecImplwired via sizedCacheKeys:FMIndexPartitionState— one per partition. The skeleton needed to rebuild aLazyFMIndexwithout re-reading metadata: huffman codes, tree topology, c_table, row ids, doc starts, sampled SA, and per-node prefix ranks. Body:FmIndexStateHeaderproto + raw blobs.WaveletNodeWords— one per wavelet node. That node's bitvector words, loaded lazily and shared so neighbouring blocks come from a single read. Body: a single raw blob.load_partitionfetches the skeleton viaget_or_insert_with_key, andLazyRankBitVecloads whole-node words through the cache (one sharedArcper node, with contiguous global bit-indexing — which also simplified rank/get). The default in-memoryArc<dyn ScalarIndex>memoization remains the fast in-session layer; the sized entries are the durable layer the load/query path consults.Granularity rationale
Note:
searchprewarms all partitions per query, so block-level laziness yields no I/O saving on the search path today — the wins are skeleton metadata-read avoidance and block survival on a persistent backend.Tests
4 new tests (codec round-trips for both entry types, real-cache populate-and-reuse with key-presence assertions, empty-index through cache) plus the existing 24 fmindex tests pass.
cargo fmtandcargo clippy -p lance-index --tests -- -D warningsclean.Closes #7277
🤖 Generated with Claude Code