[GH-583]: Introduce chronological ordering for INT96 timestamps by divjotarora · Pull Request #584 · apache/parquet-format

divjotarora · 2026-06-09T19:00:31Z

Rationale for this change

When writing INT96 timestamp columns, writers either omit stats altogether or emit stats using the TYPE_ORDER ordering. However, some writers were incorrectly emitting stats via bytewise comparisons, which does not result in chronological INT96 ordering. These stats are incorrect and must be ignored by readers. The goal of this change is to introduce a mechanism for readers to correctly determine the validity of an INT96 column's statistics and ignore them if they are potentially incorrect.

What changes are included in this PR?

This PR specifies a new INT96_TIMESTAMP_ORDER sort order specifically used for INT96 timestamp statistics. Additionally, it suggests writers use this ordering when emitting INT96 stats and that readers ignore stats for INT96 TYPE_ORDER'd columns.

Do these changes have PoC implementations?

In-progress

Closes #$583

wgtmac · 2026-06-10T02:35:41Z

I have some general questions with regard to this proposal. A new column order requires change on the writer side. If we need to change the writer code anyway, isn't it simpler to just let writers emit INT64 values for timestamp? If it is targeted to not break the old readers that currently consume INT96 values and unable to upgrade, we have to make sure that they do not fail because of the unknown column order.

emkornfield

The wording looks fine to me. I agree with @etseidl on the clarification. And also we should try to figure out if there is a way to address @wgtmac concerns.

etseidl · 2026-06-10T06:21:43Z

I don't think the intent here is to support old readers. Instead, this is to aid new readers that want to be able to trust INT96 statistics generated by new writers.

We've already established that most old readers should be able to tolerate an unknown column order, with arrow-rs pre-57.0 being a notable exception.

wgtmac · 2026-06-10T06:33:07Z

this is to aid new readers that want to be able to trust INT96 statistics generated by new writers

What is the point of this? If both readers and writers are new, why not using INT64-typed timestamps instead? INT96 are marked as deprecated for years.

divjotarora · 2026-06-10T09:03:52Z

What is the point of this? If both readers and writers are new, why not using INT64-typed timestamps instead? INT96 are marked as deprecated for years.

@wgtmac This is a fair point, but some large engines like Spark still default to writing INT96 timestamps and not having a way to signal that stats are definitively correct is useful there. I'm also following up with Spark folks to see how we can make progress and move towards INT64 timestamps by default, but in the short term I still feel this change is targeted and useful.

If it is targeted to not break the old readers that currently consume INT96 values and unable to upgrade, we have to make sure that they do not fail because of the unknown column order.

This change is no more dangerous / breaking for old readers than the recent change to add the IEEE floating point order. IIRC there was a message in the mailing list about some older versions of arrow-rs failing on unknown sort orders, but I believe that's been fixed as well.

etseidl

Thanks, changes look good 👍

BTW I have a rust PoC nearly ready to go.

etseidl · 2026-06-10T17:20:29Z

Still needs to add tests, but Rust PoC is up apache/arrow-rs#10106

emkornfield · 2026-06-10T17:38:18Z

What is the point of this? If both readers and writers are new, why not using INT64-typed timestamps instead? INT96 are marked as deprecated for years.

@wgtmac This is a fair point, but some large engines like Spark still default to writing INT96 timestamps and not having a way to signal that stats are definitively correct is useful there. I'm also following up with Spark folks to see how we can make progress and move towards INT64 timestamps by default, but in the short term I still feel this change is targeted and useful.

There is also no current alternative for nano timestamps that need to span the SQL time RAnge (years 0001-9999)

CurtHagenlocher · 2026-06-10T17:41:57Z

There is also no current alternative for nano timestamps that need to span the SQL time RAnge (years 0001-9999)

As someone not really in the Spark world, doesn't the persistence of this type imply there's a need that Parquet isn't filling without it? Is it worth reconsidering its deprecation? I agree with @wgtmac that it's a bit weird to say "don't use this type! but if you do, then change your writers to emit this new sort order."

etseidl · 2026-06-10T18:50:37Z

For some (sometimes contentious) historical context, see

Together, they read like the stages of grief. I think this PR is "acceptance".

divjotarora · 2026-06-11T13:59:40Z

parquet-java reference implementation: apache/parquet-java#3610

divjotarora force-pushed the int96-sort-order branch from 877629a to de95300 Compare June 9, 2026 19:01

Introduce chronological ordering for INT96 timestamps

98d8747

divjotarora force-pushed the int96-sort-order branch from de95300 to 98d8747 Compare June 9, 2026 19:03

etseidl reviewed Jun 9, 2026

View reviewed changes

Comment thread src/main/thrift/parquet.thrift Outdated

emkornfield approved these changes Jun 10, 2026

View reviewed changes

clarify int96 stats are optional

e248175

divjotarora requested a review from etseidl June 10, 2026 09:39

etseidl approved these changes Jun 10, 2026

View reviewed changes

This was referenced Jun 10, 2026

Implement PoC for Parquet GH-583: Introduce chronological ordering for INT96 timestamps apache/arrow-rs#10105

Open

[PoC] Implement Parquet GH-583 INT96 timestamp ColumnOrder apache/arrow-rs#10106

Draft

fix typo

d86d8c7

This was referenced Jun 11, 2026

Add new sort order for int96 timestamps apache/parquet-java#3609

Open

[GH-3609]: Add new sort order for int96 timestamps apache/parquet-java#3610

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GH-583]: Introduce chronological ordering for INT96 timestamps#584

[GH-583]: Introduce chronological ordering for INT96 timestamps#584
divjotarora wants to merge 3 commits into
apache:masterfrom
divjotarora:int96-sort-order

divjotarora commented Jun 9, 2026

Uh oh!

Uh oh!

wgtmac commented Jun 10, 2026 •

edited

Loading

Uh oh!

emkornfield left a comment

Uh oh!

etseidl commented Jun 10, 2026

Uh oh!

wgtmac commented Jun 10, 2026

Uh oh!

divjotarora commented Jun 10, 2026

Uh oh!

etseidl left a comment

Uh oh!

etseidl commented Jun 10, 2026

Uh oh!

emkornfield commented Jun 10, 2026

Uh oh!

CurtHagenlocher commented Jun 10, 2026

Uh oh!

etseidl commented Jun 10, 2026 •

edited

Loading

Uh oh!

divjotarora commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

divjotarora commented Jun 9, 2026

Rationale for this change

What changes are included in this PR?

Do these changes have PoC implementations?

Uh oh!

Uh oh!

wgtmac commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

emkornfield left a comment

Choose a reason for hiding this comment

Uh oh!

etseidl commented Jun 10, 2026

Uh oh!

wgtmac commented Jun 10, 2026

Uh oh!

divjotarora commented Jun 10, 2026

Uh oh!

etseidl left a comment

Choose a reason for hiding this comment

Uh oh!

etseidl commented Jun 10, 2026

Uh oh!

emkornfield commented Jun 10, 2026

Uh oh!

CurtHagenlocher commented Jun 10, 2026

Uh oh!

etseidl commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

divjotarora commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

wgtmac commented Jun 10, 2026 •

edited

Loading

etseidl commented Jun 10, 2026 •

edited

Loading