Skip to content

feat(parquet): Support config store decimal as integer for write parquet format#16941

Open
lifulong wants to merge 1 commit intofacebookincubator:mainfrom
lifulong:support_config_parquet_store_decimal_as_integer
Open

feat(parquet): Support config store decimal as integer for write parquet format#16941
lifulong wants to merge 1 commit intofacebookincubator:mainfrom
lifulong:support_config_parquet_store_decimal_as_integer

Conversation

@lifulong
Copy link
Copy Markdown
Contributor

@lifulong lifulong commented Mar 27, 2026

Spark supports controlling how Parquet decimal fields are written via the parameter spark.sql.parquet.writeLegacyFormat=true.When set to true, it forces decimal columns to be stored as FIXED_LEN_BYTE_ARRAY.

When Spark or Flink reads Hive data using ParquetHiveSerDe in Hive CREATE TABLE statements, especially with older Hive versions such as Hive 2.1, forcing decimal fields to be stored as FIXED_LEN_BYTE_ARRAY can resolve compatibility issues. Otherwise, exceptions will be thrown.

Velox uses different data types based on precision by default. For example, a decimal(8,4) will be stored as an int64.
This pr depends by gluten apache/gluten#11839

@lifulong lifulong requested a review from majetideepak as a code owner March 27, 2026 07:43
@netlify
Copy link
Copy Markdown

netlify bot commented Mar 27, 2026

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 46cb210
🔍 Latest deploy log https://app.netlify.com/projects/meta-velox/deploys/69c63b5d061c3d0008c575f7

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 27, 2026
@lifulong lifulong force-pushed the support_config_parquet_store_decimal_as_integer branch 2 times, most recently from b4c58a0 to a132b5f Compare March 27, 2026 07:54
@lifulong lifulong force-pushed the support_config_parquet_store_decimal_as_integer branch from a132b5f to e2953d8 Compare March 27, 2026 08:05
@lifulong lifulong force-pushed the support_config_parquet_store_decimal_as_integer branch from e2953d8 to 46cb210 Compare March 27, 2026 08:10
std::optional<int64_t> dataPageSize;
std::optional<int64_t> dictionaryPageSizeLimit;
std::optional<bool> enableDictionary;
/// If unset, Writer uses true (INT32/INT64 for short DECIMAL); false forces
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// Controls how DECIMAL values are stored by the Writer.
/// - If unset, the Writer defaults to storing as integer (true),
/// using INT32/INT64 for short DECIMAL precisions.
/// - If set to false, DECIMAL values are stored as FIXED_LEN_BYTE_ARRAY,
/// regardless of precision.

std::optional<int64_t> dataPageSize;
std::optional<int64_t> dictionaryPageSizeLimit;
std::optional<bool> enableDictionary;
/// If unset, Writer uses true (INT32/INT64 for short DECIMAL); false forces
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// Controls how DECIMAL values are stored by the Writer.
/// - If unset, the Writer defaults to storing as integer (true),
/// using INT32/INT64 for short DECIMAL precisions.
/// - If set to false, DECIMAL values are stored as FIXED_LEN_BYTE_ARRAY,
/// regardless of precision.

}

if (!storeDecimalAsInteger) {
storeDecimalAsInteger = isParquetStoreDecimalAsInteger(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the similar format as toParquetEnableDictionary, do a refactor to toParquetEnableDictionary, rename to toBoolConfigValue

"hive.parquet.writer.batch_size";
static constexpr const char* kParquetCreatedBy =
"hive.parquet.writer.created_by";
static constexpr const char* kParquetSessionStoreDecimalAsInteger =
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Gluten, we only need the WriteOptions member, don't need to set the writer config, let community decide if also adding the session config

Copy link
Copy Markdown
Collaborator

@PingLiuPing PingLiuPing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to adding a test to verify that the decimal is written as FIXED_LEN_BYTE_ARRAY instead of integer?

"hive.parquet.writer.batch_size";
static constexpr const char* kParquetCreatedBy =
"hive.parquet.writer.created_by";
static constexpr const char* kParquetSessionStoreDecimalAsInteger =
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is something wrong with the base main branch. I'm checking with Meta folks and see how to proceed. Once confirmed, these code should be moved to other files.

const config::ConfigBase emptySession({});

{
std::unordered_map<std::string, std::string> connectorMap = {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code repeats too much, consider adding a lambda.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants