Skip to content

[C++][Parquet] Saturate ApplicationVersion components instead of atoi UB#50013

Open
rootvector2 wants to merge 1 commit into
apache:mainfrom
rootvector2:parquet-version-atoi-overflow
Open

[C++][Parquet] Saturate ApplicationVersion components instead of atoi UB#50013
rootvector2 wants to merge 1 commit into
apache:mainfrom
rootvector2:parquet-version-atoi-overflow

Conversation

@rootvector2
Copy link
Copy Markdown

What

When reading a Parquet file, ApplicationVersion::ApplicationVersion(const std::string& created_by) parses the writer-supplied created_by metadata string into three integer fields (version.major, version.minor, version.patch). The current implementation in cpp/src/parquet/metadata.cc does this with std::atoi:

application_version_.version.major = atoi(version_major_string.c_str());  // line 1444
application_version_.version.minor = atoi(version_minor_string.c_str());  // line 1469
application_version_.version.patch = atoi(version_patch_string.c_str());  // line 1487

The substrings here are already filtered to be digit-only, but their length is not bounded. The created_by field is fully attacker-controlled in any untrusted Parquet file, so an input like

parquet-mr version 99999999999999999999.0.0 (build abcd)

reaches atoi with a value far outside the range of int. Per the C++ standard, std::atoi has undefined behavior when the result cannot be represented as an int ([c.strtoint]). In practice the value is unspecified, and the negative or otherwise garbage value that comes out then drives VersionLt / HasCorrectStatistics, controlling whether the reader chooses to trust the column statistics block of the file.

How to observe

A small reproducer in the same C++17 mode as Arrow:

#include <cstdio>
#include <cstdlib>
#include <string>
int main() {
    std::string s = "99999999999999999999";
    std::printf("atoi -> %d\n", std::atoi(s.c_str())); // UB; observed -1 here
}

UBSan flags the overflow inside the underlying strtol. The existing tests in metadata_test.cc only cover small inputs, which is why this has not been caught.

The fix

Replace the three atoi call sites with a small ParseUnsignedVersionComponent helper inside the anonymous namespace in metadata.cc. The helper uses std::strtoul (which is defined to set errno = ERANGE and return ULONG_MAX on overflow rather than invoking UB) and saturates to std::numeric_limits<int>::max() when the parsed value would not fit in int. The fix is local to the callee – callers (the constructor of ApplicationVersion, called from FileMetaData's thrift deserializer) are unchanged.

A new ApplicationVersion.VersionComponentOverflow test exercises:

  • very long all-digit components (saturate to INT_MAX),
  • exactly INT_MAX (representable),
  • INT_MAX + 1 (saturates).

Build/test evidence

Built parquet_objlib and parquet-internals-test locally; all 18 ApplicationVersion.* tests pass, including the new overflow case:

[----------] 18 tests from ApplicationVersion (0 ms total)
[  PASSED  ] 18 tests.

The version components in a Parquet file's `created_by` string are
attacker-controlled. std::atoi exhibits undefined behavior when the
parsed value overflows int (see [c.strtoint]). Replace the three
atoi calls in ApplicationVersionParser with a saturating helper that
uses std::strtoul and clamps to INT_MAX when the parsed value would
not fit in int. Add a regression test exercising overflow inputs.
@rootvector2 rootvector2 requested a review from wgtmac as a code owner May 21, 2026 12:36
@github-actions
Copy link
Copy Markdown

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens Parquet created_by parsing in C++ by removing undefined behavior when version components overflow int, preventing attacker-controlled metadata from producing garbage version values that could affect reader decisions (e.g., statistics trust).

Changes:

  • Replaced std::atoi parsing of major/minor/patch with a helper that uses std::strtoul and saturates to INT_MAX on overflow.
  • Added a regression test covering very large version components and INT_MAX boundary conditions.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
cpp/src/parquet/metadata.cc Adds a safe, saturating parser for version components and wires it into ApplicationVersion parsing.
cpp/src/parquet/metadata_test.cc Adds a new test ensuring overflow inputs clamp to INT_MAX and checks boundary behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +750 to +757
// Boundary cases: INT_MAX is representable, INT_MAX+1 saturates.
ApplicationVersion at_max("parquet-mr version 2147483647.2147483647.2147483647");
ASSERT_EQ(std::numeric_limits<int>::max(), at_max.version.major);
ASSERT_EQ(std::numeric_limits<int>::max(), at_max.version.minor);
ASSERT_EQ(std::numeric_limits<int>::max(), at_max.version.patch);

ApplicationVersion just_over("parquet-mr version 2147483648.2147483648.2147483648");
ASSERT_EQ(std::numeric_limits<int>::max(), just_over.version.major);
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants