[PoC] Implement Parquet GH-583 INT96 timestamp ColumnOrder#10106
Draft
etseidl wants to merge 49 commits into
Draft
[PoC] Implement Parquet GH-583 INT96 timestamp ColumnOrder#10106etseidl wants to merge 49 commits into
ColumnOrder#10106etseidl wants to merge 49 commits into
Conversation
Contributor
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
parquet::basic::ColumnOrder::sort_order_for_type#10104 (which depends on Implement PARQUET-2249: Introduce IEEE 754 total order #9619)Rationale for this change
Spark continues to use INT96 timestamps, despite INT96 being marked as deprecated in 2018. Query engines want valid statistics to allow reliably pruning on INT96 columns. apache/parquet-format#584 adds a new
ColumnOrdervariant which can be used to signal compliance with the only known use of INT96 (4-byte julian day from epoch, 8-byte nanosecond).What changes are included in this PR?
Adds support for the new enum variant, and writes the appropriate value in the
FileMetaData.column_ordersfield.This builds on changes introduced in #7687.
Are these changes tested?
Yes
Are there any user-facing changes?
Yes, this adds a new variant to public enums (
ColumnOrder::INT96_TIMESTAMP_ORDER,SortOrder::INT96_TIMESTAMP).