GH-3598: Expose getRowRanges(int)#3599
Conversation
getRowRanges(int) and add getCompressedBytesForRowRanges
…RowRanges ### Rationale for this change Opening up APIs needed by a later materialization feature in Spark. External readers (e.g. a Spark-side scanner) need (a) the column-index-derived row ranges that may pass the configured filter for a row group, and (b) a metadata-only estimate of the on-disk compressed bytes those ranges correspond to for the currently requested columns, so they can plan I/O without reading column data. ### What changes are included in this PR? - `getRowRanges(int blockIndex)`: made public; returns row ranges that may pass the configured filter. With no filter, shortcuts to all rows of the row group. - `getCompressedBytesForRowRanges(int blockIndex, RowRanges rowRanges)`: metadata-only sum of compressed page sizes for the reader's currently requested columns whose pages overlap the given row ranges. Dictionary pages are not represented in OffsetIndex and are therefore excluded. ### Are these changes tested? Yes. `TestParquetFileReaderRowRanges` covers: no-filter row ranges cover all rows, empty ranges short-circuit to 0, full ranges equal the per-page OffsetIndex sum and are strictly less than the column-chunk total (proving dictionary-page exclusion), and partial ranges fall between 0 and the full total. ### Are there any user-facing changes? No. Closes apache#3598 Co-authored-by: Matt Butrovich <mbutrovich@gmail.com>
dc0e426 to
6d3427b
Compare
| * @throws ColumnIndexStore.MissingOffsetIndexException if any requested column lacks an | ||
| * offset index | ||
| */ | ||
| public long getCompressedBytesForRowRanges(int blockIndex, RowRanges rowRanges) { |
There was a problem hiding this comment.
Do we really want to maintain this? It seems that all methods used in it are publicly available. Application code can just write this up there.
There was a problem hiding this comment.
I‘m in favor of keeping them as separate PRs.
getRowRanges(int) and add getCompressedBytesForRowRangesgetRowRanges(int)
There was a problem hiding this comment.
Pull request overview
This PR exposes ParquetFileReader#getRowRanges(int) as a public API so external readers (e.g., Spark scanners) can compute column-index-derived row ranges for a row group using footer metadata only. It also defines behavior when no record filter is configured by returning row ranges covering the entire row group, and adds a regression test for that case.
Changes:
- Promotes
getRowRanges(int blockIndex)from private to public and documents its metadata-only behavior. - Adds a no-filter fast path that returns
RowRangescovering all rows in the row group. - Adds
TestParquetFileReaderRowRangesto verify the no-filter behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java | Exposes getRowRanges(int) publicly and adds the no-filter behavior/documentation. |
| parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetFileReaderRowRanges.java | Adds a test ensuring getRowRanges covers all rows when no filter is configured. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| public RowRanges getRowRanges(int blockIndex) { | ||
| if (!FilterCompat.isFilteringRequired(options.getRecordFilter())) { | ||
| return RowRanges.createSingle(blocks.get(blockIndex).getRowCount()); | ||
| } | ||
| RowRanges rowRanges = blockRowRanges.get(blockIndex); | ||
| if (rowRanges == null) { |
Rationale for this change
This PR is based on @mbutrovich's previous work.
Opening up an API needed by a later materialization feature in Spark. External readers (e.g. a Spark-side scanner) need the column-index-derived row ranges that may pass the configured filter for a row group, so they can plan which pages to read without reading column data themselves.
getRowRanges(int)already existed as a private helper; this change makes it public and gives it a well-defined behavior when no filter is configured.What changes are included in this PR?
getRowRanges(int blockIndex): made public; returns the row ranges within the row group that may pass the configured filter. The computation is metadata-only (consults the column index from the footer; no column data is read). With no filter configured, it shortcuts to aRowRangescovering all rows of the row group rather than asserting that a filter is present.Are these changes tested?
Yes.
TestParquetFileReaderRowRangesverifies that, with no filter configured,getRowRangesreturns ranges covering all rows of the row group.Are there any user-facing changes?
No.
Closes #3598