GH-3598: Expose `getRowRanges(int)` by peter-toth · Pull Request #3599 · apache/parquet-java

peter-toth · 2026-06-05T13:57:19Z

Rationale for this change

This PR is based on @mbutrovich's previous work.

Opening up an API needed by a later materialization feature in Spark. External readers (e.g. a Spark-side scanner) need the column-index-derived row ranges that may pass the configured filter for a row group, so they can plan which pages to read without reading column data themselves.

getRowRanges(int) already existed as a private helper; this change makes it public and gives it a well-defined behavior when no filter is configured.

What changes are included in this PR?

getRowRanges(int blockIndex): made public; returns the row ranges within the row group that may pass the configured filter. The computation is metadata-only (consults the column index from the footer; no column data is read). With no filter configured, it shortcuts to a RowRanges covering all rows of the row group rather than asserting that a filter is present.

Are these changes tested?

Yes. TestParquetFileReaderRowRanges verifies that, with no filter configured, getRowRanges returns ranges covering all rows of the row group.

Are there any user-facing changes?

No.

Closes #3598

…RowRanges ### Rationale for this change Opening up APIs needed by a later materialization feature in Spark. External readers (e.g. a Spark-side scanner) need (a) the column-index-derived row ranges that may pass the configured filter for a row group, and (b) a metadata-only estimate of the on-disk compressed bytes those ranges correspond to for the currently requested columns, so they can plan I/O without reading column data. ### What changes are included in this PR? - `getRowRanges(int blockIndex)`: made public; returns row ranges that may pass the configured filter. With no filter, shortcuts to all rows of the row group. - `getCompressedBytesForRowRanges(int blockIndex, RowRanges rowRanges)`: metadata-only sum of compressed page sizes for the reader's currently requested columns whose pages overlap the given row ranges. Dictionary pages are not represented in OffsetIndex and are therefore excluded. ### Are these changes tested? Yes. `TestParquetFileReaderRowRanges` covers: no-filter row ranges cover all rows, empty ranges short-circuit to 0, full ranges equal the per-page OffsetIndex sum and are strictly less than the column-chunk total (proving dictionary-page exclusion), and partial ranges fall between 0 and the full total. ### Are there any user-facing changes? No. Closes apache#3598 Co-authored-by: Matt Butrovich <mbutrovich@gmail.com>

wgtmac · 2026-06-08T02:30:44Z

+   * @throws ColumnIndexStore.MissingOffsetIndexException if any requested column lacks an
+   *         offset index
+   */
+  public long getCompressedBytesForRowRanges(int blockIndex, RowRanges rowRanges) {


Do we really want to maintain this? It seems that all methods used in it are publicly available. Application code can just write this up there.

@wgtmac, I can move this part to Spark.
What's your take on the getRowRanges() change in this PR and the other PR? Shall I combine them into one?

I‘m in favor of keeping them as separate PRs.

Copilot

Pull request overview

This PR exposes ParquetFileReader#getRowRanges(int) as a public API so external readers (e.g., Spark scanners) can compute column-index-derived row ranges for a row group using footer metadata only. It also defines behavior when no record filter is configured by returning row ranges covering the entire row group, and adds a regression test for that case.

Changes:

Promotes getRowRanges(int blockIndex) from private to public and documents its metadata-only behavior.
Adds a no-filter fast path that returns RowRanges covering all rows in the row group.
Adds TestParquetFileReaderRowRanges to verify the no-filter behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java	Exposes `getRowRanges(int)` publicly and adds the no-filter behavior/documentation.
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetFileReaderRowRanges.java	Adds a test ensuring `getRowRanges` covers all rows when no filter is configured.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

peter-toth · 2026-06-09T12:16:29Z

+  public RowRanges getRowRanges(int blockIndex) {
+    if (!FilterCompat.isFilteringRequired(options.getRecordFilter())) {
+      return RowRanges.createSingle(blocks.get(blockIndex).getRowCount());
+    }
    RowRanges rowRanges = blockRowRanges.get(blockIndex);
    if (rowRanges == null) {


Fixed in b04cbb8.

peter-toth changed the title ~~GH-3598: Expose getRowRanges(int) and add getCompressedBytesForRowRanges~~ GH-3598: Expose getRowRanges(int) and add getCompressedBytesForRowRanges Jun 5, 2026

peter-toth force-pushed the GH-3598-parquet-reader-row-range-apis branch from dc0e426 to 6d3427b Compare June 5, 2026 14:05

wgtmac reviewed Jun 8, 2026

View reviewed changes

remove getCompressedBytesForRowRanges()

1bffbc2

peter-toth changed the title ~~GH-3598: Expose getRowRanges(int) and add getCompressedBytesForRowRanges~~ GH-3598: Expose getRowRanges(int) Jun 8, 2026

wgtmac requested a review from Copilot June 9, 2026 06:11

Copilot started reviewing on behalf of wgtmac June 9, 2026 06:11 View session

Copilot AI reviewed Jun 9, 2026

View reviewed changes

address review finding

b04cbb8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-3598: Expose `getRowRanges(int)`#3599

GH-3598: Expose `getRowRanges(int)`#3599
peter-toth wants to merge 3 commits into
apache:masterfrom
peter-toth:GH-3598-parquet-reader-row-range-apis

peter-toth commented Jun 5, 2026 •

edited

Loading

Uh oh!

wgtmac Jun 8, 2026

Uh oh!

peter-toth Jun 8, 2026 •

edited

Loading

Uh oh!

wgtmac Jun 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

peter-toth Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

peter-toth commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

wgtmac Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

peter-toth Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wgtmac Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

peter-toth Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

peter-toth commented Jun 5, 2026 •

edited

Loading

peter-toth Jun 8, 2026 •

edited

Loading