Skip to content

perf(parquet/pqarrow): cap RecordReader batch size to actual row count#817

Open
paveon wants to merge 1 commit into
apache:mainfrom
paveon:paveon/perf(parquet-pqarrow)
Open

perf(parquet/pqarrow): cap RecordReader batch size to actual row count#817
paveon wants to merge 1 commit into
apache:mainfrom
paveon:paveon/perf(parquet-pqarrow)

Conversation

@paveon
Copy link
Copy Markdown

@paveon paveon commented May 18, 2026

Rationale for this change

GetRecordReader passes BatchSize directly to the internal recordReader
without capping it to the actual number of rows. When BatchSize is configured
to a large value (e.g. 131072) but the file or requested row groups contain
few rows (e.g. 10), leafReader.LoadBatch calls Reserve(131072) which
pre-allocates definition/repetition level buffers and value buffers sized for
the full batch. For a 200-column int64 table with 10 rows this wastes ~250 MB
of allocations.

What changes are included in this PR?

Cap batchSize to NextPowerOf2(nrows) when a BatchSize is explicitly
configured. The power-of-2 rounding keeps allocations aligned with the
downstream updateCapacity logic that already rounds to powers of two,
avoiding a redundant reallocation on the first read.

Are these changes tested?

Existing tests pass. The change is on the allocation-sizing path only —
read correctness is unaffected since LoadBatch already stops reading
when rows are exhausted.

Are there any user-facing changes?

No

@paveon paveon requested a review from zeroshade as a code owner May 18, 2026 09:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant