Describe the bug, including details regarding any error messages, version, and platform.
https://github.com/apache/parquet-java/blob/b8f33308534e990003e48a2e66036ccc83fd5db4/parquet-column/src/main/java/org/apache/parquet/column/values/fallback/FallbackValuesWriter.java#L81
https://github.com/apache/parquet-java/blob/b8f33308534e990003e48a2e66036ccc83fd5db4/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L129
parquet-java will check raw data size, if raw data size is zero, it will fall back use plain encode. ((encodedSize + dictionaryByteSize) < rawSize)
while arrow still use dictionary encode.
The official Parquet specification does not apply any encoding:
https://parquet.apache.org/docs/file-format/nulls/
Nullity is encoded in the definition levels (which is run-length encoded). NULL values are not encoded in the data. For example, in a non-nested schema, a column with 1000 NULLs would be encoded with run-length encoding (0, 1000 times) for the definition levels and nothing else.
Component(s)
C++
Describe the bug, including details regarding any error messages, version, and platform.
https://github.com/apache/parquet-java/blob/b8f33308534e990003e48a2e66036ccc83fd5db4/parquet-column/src/main/java/org/apache/parquet/column/values/fallback/FallbackValuesWriter.java#L81
https://github.com/apache/parquet-java/blob/b8f33308534e990003e48a2e66036ccc83fd5db4/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L129
parquet-java will check raw data size, if raw data size is zero, it will fall back use plain encode. ((encodedSize + dictionaryByteSize) < rawSize)
while arrow still use dictionary encode.
The official Parquet specification does not apply any encoding:
https://parquet.apache.org/docs/file-format/nulls/
Nullity is encoded in the definition levels (which is run-length encoded). NULL values are not encoded in the data. For example, in a non-nested schema, a column with 1000 NULLs would be encoded with run-length encoding (0, 1000 times) for the definition levels and nothing else.
Component(s)
C++