feat: add RLE v2 run length widths#7376
Draft
Xuanwo wants to merge 2 commits into
Draft
Conversation
Contributor
|
Important This PR touches the Lance format specification. Substantive changes to the format specification — the If this is a meaningful format change:
|
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds RLE v2 run-length widths so newly created datasets can write RLE pages with
u16oru32run lengths instead of splitting every run at 255 values. The capability is recorded as a reader feature flag and is only enabled when a new dataset is created withWriteParams::enable_rle_v2; existing unflagged datasets reject attempts to turn it on mid-stream.Closes #7327.
Benchmark
Ran on
xuanwo-lance-lazy-metadata-benchwith a #6941-style sorted low-cardinalityasset_idworkload.150M rows / 5k assets / random5 value167.36 MiB164.57 MiB1.67%150M rows / 5k assets / by-asset5 value7.62 MiB2.03 MiB73.34%The first row keeps the random low-cardinality value column from the issue-like workload, which dominates total size. The second row isolates the long-run case RLE2 targets.
Validation
Validated with focused RLE2 tests and full Rust clippy before publishing.