Parquet Tool Interface for File-Level Operations in Clustering #17958
suryaprasanna
started this conversation in
General Discussions
Replies: 1 comment 1 reply
-
|
+1 generally. one question here, clustering is all about data rewrite with the same schema, curious why column pruning is in the scope here, is it related to schema evolution? |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Context:
I'd like to restart the discussion around adding a parquet tool interface for file-level operations during clustering. I previously opened PR #9006 which implements this capability, and I believe this feature would be valuable for the Hudi community.
Problem Statement
Currently, Hudi's clustering strategies operate on a record-by-record basis. For certain use cases like column pruning, encryption, or selective column preservation, this approach is inefficient. These operations don't require reading and deserializing individual records - they can be performed much more efficiently at the file level using parquet-tools.
Proposed Solution
The PR introduces a ParquetToolsExecutionStrategy that enables efficient file-level operations during clustering. The implementation:
This interface would be particularly beneficial for:
Beta Was this translation helpful? Give feedback.
All reactions