Skip to content

[C++] All data is null for one column in one row group for parquet, arrow will encode with dictionary, while parquet-java use encode plain #50062

@lifulong

Description

@lifulong

Describe the bug, including details regarding any error messages, version, and platform.

https://github.com/apache/parquet-java/blob/b8f33308534e990003e48a2e66036ccc83fd5db4/parquet-column/src/main/java/org/apache/parquet/column/values/fallback/FallbackValuesWriter.java#L81
https://github.com/apache/parquet-java/blob/b8f33308534e990003e48a2e66036ccc83fd5db4/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L129

parquet-java will check raw data size, if raw data size is zero, it will fall back use plain encode. ((encodedSize + dictionaryByteSize) < rawSize)

while arrow still use dictionary encode.

The official Parquet specification does not apply any encoding:
https://parquet.apache.org/docs/file-format/nulls/
Nullity is encoded in the definition levels (which is run-length encoded). NULL values are not encoded in the data. For example, in a non-nested schema, a column with 1000 NULLs would be encoded with run-length encoding (0, 1000 times) for the definition levels and nothing else.

Component(s)

C++

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions