Parquet Tool Interface for File-Level Operations in Clustering #17958

suryaprasanna · 2026-01-20T21:22:06Z

suryaprasanna
Jan 20, 2026

Context:

I'd like to restart the discussion around adding a parquet tool interface for file-level operations during clustering. I previously opened PR #9006 which implements this capability, and I believe this feature would be valuable for the Hudi community.

Problem Statement

Currently, Hudi's clustering strategies operate on a record-by-record basis. For certain use cases like column pruning, encryption, or selective column preservation, this approach is inefficient. These operations don't require reading and deserializing individual records - they can be performed much more efficiently at the file level using parquet-tools.

Proposed Solution

The PR introduces a ParquetToolsExecutionStrategy that enables efficient file-level operations during clustering. The implementation:

Extends SingleSparkJobExecutionStrategy to provide a framework for file-level clustering operations
Introduces HoodieFileWriteHandle for file-level operations (vs record-level)
Supports proper rollback via marker files
Enables efficient rewriting without record iteration

This interface would be particularly beneficial for:

Column pruning - removing unnecessary columns to reduce storage costs without deserializing records
Encryption - applying encryption at the file level
Schema evolution - efficient column reordering or type changes

danny0405 · 2026-01-21T04:40:21Z

danny0405
Jan 21, 2026
Collaborator

+1 generally.

one question here, clustering is all about data rewrite with the same schema, curious why column pruning is in the scope here, is it related to schema evolution?

1 reply

suryaprasanna Jan 21, 2026
Author

The use cases where we use column pruning is in nullifying unused columns on historical data, to save storage space. But users can also leverage the interface to remove already drop columns from the data files.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet Tool Interface for File-Level Operations in Clustering #17958

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Parquet Tool Interface for File-Level Operations in Clustering #17958

Uh oh!

Uh oh!

suryaprasanna Jan 20, 2026

Context:

Problem Statement

Proposed Solution

Replies: 1 comment · 1 reply

Uh oh!

danny0405 Jan 21, 2026 Collaborator

Uh oh!

suryaprasanna Jan 21, 2026 Author

suryaprasanna
Jan 20, 2026

Replies: 1 comment 1 reply

danny0405
Jan 21, 2026
Collaborator

suryaprasanna Jan 21, 2026
Author