GH-3466: Improve RunLengthBitPackingHybridDecoder.readNext to avoid per-call buffer allocation and DataInputStream wrapping#3467
Conversation
RunLengthBitPackingHybridDecoder.readNext to avoid per-call buffer allocation and DataInputStream wrappingRunLengthBitPackingHybridDecoder.readNext to avoid per-call buffer allocation and DataInputStream wrapping
75979e3 to
0081ccf
Compare
| if (packedBytes == null || packedBytes.length < bytesNeeded) { | ||
| packedBytes = new byte[bytesNeeded]; | ||
| } | ||
| int bytesRead = in.readNBytes(packedBytes, 0, bytesNeeded); |
There was a problem hiding this comment.
Nice, now we're at Java 11 :)
Fokko
left a comment
There was a problem hiding this comment.
I've left one fix that addresses a concern is that currentBuffer retains stale values from a previous larger packed run in positions beyond currentCount. While those positions are never accessed today (the indexing is bounded by currentBufferLength), it's a latent risk. Let's add defensive zeroing, consistent with how packedBytes is zeroed on short reads.
…avoid per-call buffer allocation and `DataInputStream` wrapping
Thank you for the review. I applied the code changes as suggested. |
Rationale for this change
RunLengthBitPackingHybridDecoder.readNext()allocates a newint[]andbyte[]on every PACKED-mode call. The existing code even acknowledges this with a// TODO: reuse a buffercomment at line 94. In workloads that decode many bit-packed runs (definition levels, repetition levels, RLE-encoded integers), these allocations become a significant source of GC pressure.There are two issues:
Per-call buffer allocation.
currentBuffer = new int[currentCount]andbyte[] bytes = new byte[numGroups * bitWidth]are allocated fresh on every PACKED-modereadNext(). The individual allocations are modest (8–128 ints per run) but occur thousands of times per column chunk. In a custom JFR-profiled benchmark merging 60 Parquet files (180M rows, 10 columns), these two sites were the number 2 and number 6 allocation hotspots respectively: 2,402 out of 12,711 total allocation samples (18.9%), accounting for ~7.2 GB of allocation per operation.Per-call
DataInputStreamwrapping.new DataInputStream(in).readFully(bytes, 0, bytesToRead)creates a wrapper object on every PACKED-mode call just to accessreadFully().What changes are included in this PR?
Three changes to
RunLengthBitPackingHybridDecoder:currentBufferpromoted from local to field, reused acrossreadNext()calls with a grow-only strategy, only reallocated when the next run requires a larger buffer than currently held.byte[] bytespromoted to a fieldpackedBytes, same grow-only reuse strategy.new DataInputStream(in).readFully(...)replaced with a privatereadFully()method that reads directly from the underlyingInputStream, eliminating the per-call wrapper allocation and virtual dispatch.Custom JFR-profiled benchmark results (merge of 60 Parquet files, 180M rows, 10 columns, JDK 26, macOS aarch64):
Are these changes tested?
Yes, changes are tested. We added a regression test in
TestRunLengthBitPackingHybridEncodercalledtestTruncatedPackedRunAfterFullPackedRunDoesNotReuseStaleBytesto cover a truncated packed-run edge case after buffer reuse. We also ran the fullTestRunLengthBitPackingHybridEncoderclass andRunLengthBitPackingHybridIntegrationTestand all tests passed.Are there any user-facing changes?
No. This is a transparent performance improvement internal to the RLE decoder. Decoded values are identical. No API changes, no configuration changes, no behavioral changes.
Closes #3466