perf: speed up cold read up to 8x by lazily load column metadata#7375
perf: speed up cold read up to 8x by lazily load column metadata#7375Xuanwo wants to merge 5 commits into
Conversation
|
😆 I was just talking to Pedro and Xiaoxuan from Meta and they were surprised we haven't had to do this yet. I guess the time has finally come 🎉 |
westonpace
left a comment
There was a problem hiding this comment.
My primary complaint is a naming complaint. If we want to proceed with FileDataReader then I can live with that. However, can we name it something like ProjectedFileReader so that it implies it is a reader that is scoped to a particular projection?
Attaching Claude's review below
Here is my review of PR #7375:
PR #7375 — perf: speed up cold read up to 8x by lazily load column metadata
Author: Xuanwo | Files changed: 4 | +1771 / -484
Overview
This PR introduces a two-tier metadata loading strategy for Lance V2.1+ files. Instead of eagerly decoding all column metadata upfront (the current behavior), it adds a FileMetadataIndex that reads only the CMO (column metadata offset) table, then fetches and decodes only the column metadata entries needed for the current projection. A new FileDataReader type wraps this path. The heuristic to engage lazy loading is selected_columns * 4 < total_columns (i.e., less than 25% of columns selected). The reported speedups (8x local, 2.5x S3) are compelling.
Correctness Concerns
P0 — storage_stats panics instead of returning an error:
fn storage_stats(&self) -> Vec<(u32, u64)> {
let Some(file_statistics) = self.reader.file_statistics() else {
panic!("storage_stats requires full file metadata");
};This panics at runtime with an unhelpful error if called through a code path that doesn't guarantee full metadata. The method doesn't return Result, so adding ? isn't trivial, but a debug_assert! plus a proper Result return would be much safer. Currently it only gets called from calculate_data_stats which calls open_readers_with_full_metadata first, but the invariant is fragile.
P1 — Cache key uses file_size_bytes as file identity proxy:
fn key(&self) -> Cow<'_, str> {
Cow::Owned(format!("column_metadata/{}/{}", self.file_size_bytes, self.column_index))
}LanceCache::with_key_prefix(path) already scopes the cache to the file path, so file_size_bytes in the key is redundant. It won't cause bugs now, but indicates a design confusion. Two files at the same path (after an overwrite) with the same size could theoretically collide. File modification time or a content hash would be safer, or just remove the size from the key and rely on path scoping alone.
P1 — Potential regression in Full path: prepare_projection returns all column_infos unfiltered:
Self::Full(metadata) => Ok(PreparedProjection {
column_infos: metadata.column_infos.clone(), // all columns, always
...
}),The previous collect_columns_from_projection filtered to only the projected columns before passing to the decoder. It's not clear from the diff whether the downstream do_read_range/do_take_rows/do_read_ranges still filter based on decoder_projection, or whether they now receive and process all column_infos redundantly. If the latter, wide-table full-metadata reads will be slower due to the extra clone. This should be confirmed.
Design / Code Quality
FileDataReadernaming is unclear — bothFileReaderandFileDataReaderread data.LazyFileReaderorIndexedMetadataFileReaderwould signal intent at the call site.require_indexed_base_projectionerror message ("indexed column metadata reader requires an explicit base projection") surfaces implementation-internal language to users.- Duplicated
read_range/take_rows/read_rangesdispatch inFileReadCoreis repetitive; a small macro or helper could compress it. - The
BoxFutureboxing inopen_reader_implandexecute_uncommitted_implis a correct fix for the recursive-async-fn limitation.
Test Coverage
The new tests are solid, especially test_lazy_reader_loads_only_requested_column_metadata (byte-level I/O measurement) and the fragment-level IOPS test. Gaps:
- No test for cache hit path (second read on the same column should issue zero column-metadata I/O).
- No test for the new validation in
decode_cmo_table(out-of-range offsets on malformed files). storage_statspanic path is not tested and is currently unreachable only by convention.
Minor Positives
- The bounds check added to
decode_column_metadatas(validatescolumn_metadata_bytes.len() >= cmo_table_size) is a good defensive correctness improvement. - The
overflow onposition.checked_add(length)guard indecode_cmo_table` is correct. - Statistics regression test (
test_calculate_data_stats_after_dropping_wide_dataset_columns) catches the drop-column edge case cleanly. - Double-gate heuristic (check before and after
FileMetadataIndexload) is harmless.
Summary
The core algorithm is sound and the perf numbers are strong. Before merging, I'd address: (1) the storage_stats panic, (2) confirm the Full path column_infos clone doesn't regress wide-table reads, and (3) clarify or simplify the cache key. A cache hit test would be valuable.
| /// and column metadata offset table. Unlike [`CachedFileMetadata`], it does not | ||
| /// hold decoded metadata for every column. | ||
| #[derive(Debug)] | ||
| pub struct FileMetadataIndex { |
There was a problem hiding this comment.
Can we derive DeepSize instead of a custom impl? I don't see anything in here missing a DeepSize impl.
| let retained_buffers_size: usize = self | ||
| .retained_global_buffers | ||
| .values() | ||
| .map(|buf| buf.len()) | ||
| .sum(); |
There was a problem hiding this comment.
This is slightly inaccurate (probably not enough to matter) because it doesn't include the BTreeMap's buffer.
| + std::mem::size_of_val(column_metadatas) | ||
| } | ||
|
|
||
| fn column_info_deep_size(column_info: &ColumnInfo) -> usize { |
There was a problem hiding this comment.
Why not impl (or derive) DeepSizeOf on ColumnInfo?
| /// A data reader for Lance files. | ||
| /// | ||
| /// This reader can use either full file metadata or indexed per-column metadata | ||
| /// internally. It intentionally does not expose APIs that require synchronous | ||
| /// access to full file metadata. |
There was a problem hiding this comment.
This description doesn't really explain why a caller would prefer this over FileReader (and they have very similar names and APIs)
We should explain that this struct establishes a base projection at construction time that cannot change. This means all reads are limited by the base projection but it also means that we only need to read the column metadata for the fields in the base projection.
Summary
This PR adds a lightweight file metadata index and a data-reader path that can load column metadata on demand for supported narrow projections. Dataset fragment reads now use this path for sparse top-level V2.1+ physical column projections and continue to fall back to full metadata when the projection is unsupported or the caller needs file-wide metadata.
The problem is that the current full-metadata path decodes every column's metadata before reading data. On ultra-wide files, a one-column scan or point lookup pays metadata cost proportional to the total column count.
Performance
Benchmarked on EC2 instance
xuanwo-lance-lazy-metadata-benchwith a 65,000-column Lance file and random single-column point lookups. The full-metadata reader is theorigin/mainbehavior. Results below are mean latency.Hot reopen reads are effectively unchanged because metadata/data page reads are already cached: Local EBS was 10.4 ms -> 9.6 ms, and S3 was 36.3 ms -> 35.7 ms.
The S3 cold path now issues only one additional request versus full metadata, not two, because dataset-known schema and row count are passed into the metadata-index reader instead of re-reading the schema global buffer.
Validation
Validated with targeted lazy metadata tests, the dropped-column statistics regression,
cargo fmt --all, and full Rust clippy with tests and benches enabled.