Skip to content

fix(table): enable Parquet dictionary encoding by default#931

Open
twuebi wants to merge 1 commit into
apache:mainfrom
twuebi:tp/enable-dict-write-parquet
Open

fix(table): enable Parquet dictionary encoding by default#931
twuebi wants to merge 1 commit into
apache:mainfrom
twuebi:tp/enable-dict-write-parquet

Conversation

@twuebi

@twuebi twuebi commented Apr 22, 2026

Copy link
Copy Markdown
Contributor

The Parquet writer hardcoded WithDictionaryDefault(false), overriding the arrow-go writer's own default (true) with no way for users to opt back in.

Remove the override so the library default applies.

This matches Java Iceberg (which removed its equivalent table property in apache/iceberg#7665) and parquet-java/parquet-cpp defaults.

@twuebi twuebi requested a review from zeroshade as a code owner April 22, 2026 15:49
@twuebi

twuebi commented Apr 22, 2026

Copy link
Copy Markdown
Contributor Author

closing for now, for high cardinality columns this is a regression since the dictionary is not discarded on overflow

@twuebi twuebi closed this Apr 22, 2026
@zeroshade

Copy link
Copy Markdown
Member

@twuebi what do you mean? With the high-cardinality case it would fallback to plain encoding instead of dictionary, or is there a bug there?

@twuebi

twuebi commented Apr 23, 2026

Copy link
Copy Markdown
Contributor Author

@zeroshade yes, there seems to be a bug in arrow-go where both dictionary and plain ends up getting written for that page

@C-Loftus

Copy link
Copy Markdown
Contributor

Hi all, in light of this being closed, I was just curious to clarify the current status, so does iceberg-go currently not support dictionary encoding? If so, would it be reasonable to make a PR to enable opting in but not need to make it enabled by default (i.e. to prevent any issues with high cardinality as mentioned above)

Thank you very much for all your hard work on this library!

@twuebi twuebi reopened this May 13, 2026
@twuebi

twuebi commented May 13, 2026

Copy link
Copy Markdown
Contributor Author

Hi @C-Loftus, thanks for the poke, we've fixed the double encoding in arrow-go, once arrow-go makes another release, and we bump the dependency here, we should be able to allow dictionary encoding. Until then, enabling it would mean a regression for high cardinality columns.

@C-Loftus

Copy link
Copy Markdown
Contributor

Makes sense, thank you for the update and that context!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants