perf(parquet/pqarrow): cap RecordReader batch size to actual row count by paveon · Pull Request #817 · apache/arrow-go

paveon · 2026-05-18T09:23:44Z

Rationale for this change

GetRecordReader passes BatchSize directly to the internal recordReader
without capping it to the actual number of rows. When BatchSize is configured
to a large value (e.g. 131072) but the file or requested row groups contain
few rows (e.g. 10), leafReader.LoadBatch calls Reserve(131072) which
pre-allocates definition/repetition level buffers and value buffers sized for
the full batch. For a 200-column int64 table with 10 rows this wastes ~250 MB
of allocations.

What changes are included in this PR?

Cap batchSize to NextPowerOf2(nrows) when a BatchSize is explicitly
configured. The power-of-2 rounding keeps allocations aligned with the
downstream updateCapacity logic that already rounds to powers of two,
avoiding a redundant reallocation on the first read.

Are these changes tested?

Existing tests pass. The change is on the allocation-sizing path only —
read correctness is unaffected since LoadBatch already stops reading
when rows are exhausted.

Are there any user-facing changes?

No

Cap batchSize by row count

c880256

paveon requested a review from zeroshade as a code owner May 18, 2026 09:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(parquet/pqarrow): cap RecordReader batch size to actual row count#817

perf(parquet/pqarrow): cap RecordReader batch size to actual row count#817
paveon wants to merge 1 commit into
apache:mainfrom
paveon:paveon/perf(parquet-pqarrow)

paveon commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

paveon commented May 18, 2026

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant