Skip to content

GH-7686: [Parquet] Fix int96 min/max stats#7687

Merged
alamb merged 28 commits into
apache:mainfrom
rahulketch:add-tests-for-int-96-stats
Jul 22, 2025
Merged

GH-7686: [Parquet] Fix int96 min/max stats#7687
alamb merged 28 commits into
apache:mainfrom
rahulketch:add-tests-for-int-96-stats

Conversation

@rahulketch

@rahulketch rahulketch commented Jun 17, 2025

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

int96 min/max statistics emitted by arrow-rs are incorrect.

What changes are included in this PR?

  1. Fix the int96 stats
  2. Add round-trip test to verify the behavior

Not included in this PR:

  1. Read stats only from known good writers. This will be implemented after a new arrow-rs release.

Are there any user-facing changes?

The int96 min/max statistics will be different and correct.

@github-actions github-actions Bot added the parquet Changes to the parquet crate label Jun 17, 2025
@rahulketch rahulketch changed the title GH-7686 [Parquet] Fix int96 min/max stats GH-7686: [Parquet] Fix int96 min/max stats Jun 17, 2025
Comment thread parquet/src/data_type.rs
@rahulketch rahulketch force-pushed the add-tests-for-int-96-stats branch 3 times, most recently from ede2b9a to 63a5fd5 Compare June 17, 2025 15:55
@rahulketch rahulketch force-pushed the add-tests-for-int-96-stats branch from 63a5fd5 to 6036398 Compare June 17, 2025 16:07
@etseidl

etseidl commented Jun 17, 2025

Copy link
Copy Markdown
Contributor

I tend to agree with @emkornfield (apache/parquet-java#3243 (comment)) that this is a bit of putting the cart before the horse. The sort order for INT96 is currently undefined so statistics should be ignored. I think we need changes to the Parquet spec before proceeding with this.

@etseidl etseidl added enhancement Any new improvement worthy of a entry in the changelog api-change Changes to the arrow API next-major-release the PR has API changes and it waiting on the next major version labels Jun 17, 2025
@rahulketch

Copy link
Copy Markdown
Contributor Author

@etseidl I have pushed the changes which I believe address the open concerns.

There is still the task of adding an allow/deny list to only accept the statistics from known good writer. See the corresponding change in parquet-java.

My suggestion for that is:

  1. Merge this change and make a new arrow-rs release 56.0.0
  2. After the new release add a check which allows reading the stats from:
  • parquet-java 1.15.0+
  • arrow-rs 56.0.0+
  • photon

What do you think?

PS: I will be on vacation 4th July - 15th July, so I'll only be able to make more changes after that.

@etseidl

etseidl commented Jul 3, 2025

Copy link
Copy Markdown
Contributor

Thanks @rahulketch! I've taken the liberty of fixing the remaining lint errors.

There is still the task of adding an allow/deny list to only accept the statistics from known good writer.

I agree that this is a follow on task and can be done after this PR merges. I think it will need to be part of a larger review of statistics handling to see if other types with undefined sort orders are also ignored.

@etseidl etseidl marked this pull request as ready for review July 3, 2025 13:54

@etseidl etseidl left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good to go now. @alamb @emkornfield can you take another look? 🙏

@alamb alamb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me -- thank you @rahulketch and @etseidl

Comment thread parquet/src/data_type.rs Outdated
Co-authored-by: Alkis Evlogimenos <alkis@evlogimenos.com>
@alamb alamb requested a review from emkornfield July 14, 2025 15:29
@alamb

alamb commented Jul 14, 2025

Copy link
Copy Markdown
Contributor

@emkornfield this PR looks good to merge from my perspective and has several approvals and I think addresses your feedback.

If you would like more time to review, please let us know, otherwise I plan to merge this PR tomorrow

@emkornfield

Copy link
Copy Markdown
Contributor

At the last sync we went back and forth on what was required for this from a spec perspective.

I think general consensus is we should update the spec (I think there is a PR open for this).

Beyond that there was discussion on whether we should have a new sort order or just rely on versions. I'll start a thread but #7909 is pertinent to a discussion on on how much effort adding a SortOrder would be.

@alamb

alamb commented Jul 14, 2025

Copy link
Copy Markdown
Contributor

@emkornfield are you opposed to merging this PR? I can't quite tell from the comments

I think general consensus is we should update the spec (I think there is a PR open for this).

I did make a PR here to clarify the spec about Int96 stats (with a recommendation, not with any actual change):

Beyond that there was discussion on whether we should have a new sort order or just rely on versions. I'll start a thread but #7909 is pertinent to a discussion on on how much effort adding a SortOrder would be.

In my opinion, there is no reason to hold up this PR (which improves compatibility for an "implementation defined" part of the spec) on actually changing the spec. We can proceed with actually defining a SortOrder / fixing the spec in parallel

@emkornfield

Copy link
Copy Markdown
Contributor

Yes, I'm fine to merge this PR. It seems like a strict improvement.

old_format,
),
Type::INT96 => {
// INT96 statistics may not be correct, because comparison is signed

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should remove this until we have a filter on known good statistics.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// INT96 statistics may not be correct, because comparison is signed
// INT96 statistics may not be correct, because comparison is signed

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to merge this one. so I made a follow on PR:

@alamb alamb requested a review from emkornfield July 22, 2025 22:01
@alamb alamb dismissed emkornfield’s stale review July 22, 2025 22:03

Per comment, Micah is satisfied with this PR
#7687 (comment)

@alamb alamb merged commit dff67c9 into apache:main Jul 22, 2025
16 checks passed
@alamb

alamb commented Jul 22, 2025

Copy link
Copy Markdown
Contributor

Thanks again everyone for all your help and patience

alamb added a commit that referenced this pull request Jul 23, 2025
# Which issue does this PR close?


- Follow on to #7687

# Rationale for this change

I merged #7687 without addressing
one of @emkornfield 's suggestions:
https://github.com/apache/arrow-rs/pull/7687/files#r2205393903

# What changes are included in this PR?

Implement the suggestion (restore a comment_

# Are these changes tested?

 BY CI

# Are there any user-facing changes?

No
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api-change Changes to the arrow API enhancement Any new improvement worthy of a entry in the changelog next-major-release the PR has API changes and it waiting on the next major version parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parquet: Incorrect min/max stats for int96 columns

7 participants