feat(parquet): Support config store decimal as integer for write parquet format#16941
Conversation
✅ Deploy Preview for meta-velox canceled.
|
b4c58a0 to
a132b5f
Compare
a132b5f to
e2953d8
Compare
e2953d8 to
46cb210
Compare
| std::optional<int64_t> dataPageSize; | ||
| std::optional<int64_t> dictionaryPageSizeLimit; | ||
| std::optional<bool> enableDictionary; | ||
| /// If unset, Writer uses true (INT32/INT64 for short DECIMAL); false forces |
There was a problem hiding this comment.
/// Controls how DECIMAL values are stored by the Writer.
/// - If unset, the Writer defaults to storing as integer (true),
/// using INT32/INT64 for short DECIMAL precisions.
/// - If set to false, DECIMAL values are stored as FIXED_LEN_BYTE_ARRAY,
/// regardless of precision.
| std::optional<int64_t> dataPageSize; | ||
| std::optional<int64_t> dictionaryPageSizeLimit; | ||
| std::optional<bool> enableDictionary; | ||
| /// If unset, Writer uses true (INT32/INT64 for short DECIMAL); false forces |
There was a problem hiding this comment.
/// Controls how DECIMAL values are stored by the Writer.
/// - If unset, the Writer defaults to storing as integer (true),
/// using INT32/INT64 for short DECIMAL precisions.
/// - If set to false, DECIMAL values are stored as FIXED_LEN_BYTE_ARRAY,
/// regardless of precision.
| } | ||
|
|
||
| if (!storeDecimalAsInteger) { | ||
| storeDecimalAsInteger = isParquetStoreDecimalAsInteger( |
There was a problem hiding this comment.
Use the similar format as toParquetEnableDictionary, do a refactor to toParquetEnableDictionary, rename to toBoolConfigValue
| "hive.parquet.writer.batch_size"; | ||
| static constexpr const char* kParquetCreatedBy = | ||
| "hive.parquet.writer.created_by"; | ||
| static constexpr const char* kParquetSessionStoreDecimalAsInteger = |
There was a problem hiding this comment.
In Gluten, we only need the WriteOptions member, don't need to set the writer config, let community decide if also adding the session config
PingLiuPing
left a comment
There was a problem hiding this comment.
Do you need to adding a test to verify that the decimal is written as FIXED_LEN_BYTE_ARRAY instead of integer?
| "hive.parquet.writer.batch_size"; | ||
| static constexpr const char* kParquetCreatedBy = | ||
| "hive.parquet.writer.created_by"; | ||
| static constexpr const char* kParquetSessionStoreDecimalAsInteger = |
There was a problem hiding this comment.
There is something wrong with the base main branch. I'm checking with Meta folks and see how to proceed. Once confirmed, these code should be moved to other files.
| const config::ConfigBase emptySession({}); | ||
|
|
||
| { | ||
| std::unordered_map<std::string, std::string> connectorMap = { |
There was a problem hiding this comment.
Code repeats too much, consider adding a lambda.
Spark supports controlling how Parquet decimal fields are written via the parameter spark.sql.parquet.writeLegacyFormat=true.When set to true, it forces decimal columns to be stored as FIXED_LEN_BYTE_ARRAY.
When Spark or Flink reads Hive data using ParquetHiveSerDe in Hive CREATE TABLE statements, especially with older Hive versions such as Hive 2.1, forcing decimal fields to be stored as FIXED_LEN_BYTE_ARRAY can resolve compatibility issues. Otherwise, exceptions will be thrown.
Velox uses different data types based on precision by default. For example, a decimal(8,4) will be stored as an int64.
This pr depends by gluten apache/gluten#11839