Skip to content

perf: stdlib allocation optimizations — foldl while-loop, join pre-sized, flatten two-pass, reverse direct#695

Closed
He-Pin wants to merge 1 commit intodatabricks:masterfrom
He-Pin:perf/stdlib-allocation-opt
Closed

perf: stdlib allocation optimizations — foldl while-loop, join pre-sized, flatten two-pass, reverse direct#695
He-Pin wants to merge 1 commit intodatabricks:masterfrom
He-Pin:perf/stdlib-allocation-opt

Conversation

@He-Pin
Copy link
Copy Markdown
Contributor

@He-Pin He-Pin commented Apr 6, 2026

Motivation

Several hot-path stdlib functions (std.flattenArrays, std.reverse, std.foldl, std.join) use Scala collection patterns that allocate unnecessary intermediate objects: ArrayBuilder with rough size hints, .reverse creating new collections, for-comprehension iterator/lambda allocations, and un-sized StringBuilder.

These functions are called millions of times in benchmarks like foldl, reverse, comparison2, and realistic2. Reducing per-call allocation overhead yields measurable throughput gains.

Key Design Decisions

  • Two-pass pre-sizing: For flattenArrays and join (empty separator), count total elements first, then allocate the exact-sized result array and fill with System.arraycopy. Eliminates ArrayBuilder resize/copy cycles entirely.
  • While-loop conversion: Replace for (x <- arr) and .foreach with index-based while loops to avoid iterator/lambda allocation overhead.
  • Local caching: Hoist pos.noOffset out of tight loops into a local variable to avoid repeated method dispatch.
  • StringBuilder pre-sizing: Estimate output size as arr.length * (separator.length + 8) to avoid StringBuilder growth copies.

Modification

ArrayModule.scala:

  • FlattenArrays: Two-pass (count → pre-sized array + System.arraycopy)
  • FlattenDeepArrays: foreachwhile-loop for initial deque fill
  • Reverse: .reverse → manual reverse-copy while-loop
  • Foldl (array path): for-loop → while-loop + cache pos.noOffset
  • Foldl (string path): Cache pos.noOffset in local

StringModule.scala:

  • Join (string path): Pre-size StringBuilder
  • Join (array path, empty separator): Two-pass pre-sized with System.arraycopy
  • Join (array path, non-empty separator): for-loop → while-loop + better sizeHint

Tests: Added stdlib_alloc_opt.jsonnet covering flattenArrays, reverse, join (both paths), and foldl edge cases (empty arrays, nulls, single elements).

Benchmark Results

JMH (1 fork, 1 warmup, 1 iteration — full regression suite):

Benchmark Master (ms/op) This PR (ms/op) Delta
foldl 9.611 9.103 -5.3%
reverse 10.813 10.404 -3.8%
comparison2 69.731 66.307 -4.9%
bench.04 33.338 32.595 -2.2%
setDiff 0.448 0.423 -5.6%
setInter 0.389 0.369 -5.1%
setUnion 0.715 0.677 -5.3%
realistic2 67.137 66.743 ± 0.937 neutral ✅

No regressions across all 35 benchmarks.

Targeted realistic2 (5 iterations with error bars): 66.743 ± 0.937 ms/op — confirms no regression.

Analysis

  • foldl -5.3%: While-loop eliminates iterator allocation + closure boxing per element. pos.noOffset caching avoids repeated Position.noOffset dispatch in tight loop.
  • reverse -3.8%: Manual reverse-copy avoids Scala's .reverse which allocates intermediate ArraySeq + copies.
  • comparison2 -4.9%: Benefits from foldl/flattenArrays improvements in nested comprehension workloads.
  • set operations -5.1% to -5.6%: Benefit from reverse optimization (sets use sorted arrays internally).

References

Result

All 140 tests pass. Consistent improvements across stdlib-heavy benchmarks with no regressions.

…zed, flatten two-pass, reverse direct

Optimize hot-path stdlib functions to reduce allocation and improve throughput:

- flattenArrays: Two-pass approach counting total elements first, then
  using System.arraycopy for each sub-array (eliminates ArrayBuilder resizing)
- flattenDeepArray: while-loop instead of foreach for initial fill
- reverse: Direct reverse-copy into new array instead of .reverse
- foldl (array path): Convert for-loop to while-loop, cache pos.noOffset
- foldl (string path): Cache pos.noOffset in local
- join (string path): Pre-size StringBuilder based on estimated element length
- join (array path, empty separator): Two-pass pre-sized with arraycopy
- join (array path, non-empty separator): while-loop with better sizeHint

Add targeted regression tests for flattenArrays, reverse, join, and foldl
covering edge cases (empty arrays, nulls, single elements).

Upstream: 4fa535fb
@He-Pin
Copy link
Copy Markdown
Contributor Author

He-Pin commented Apr 6, 2026

Closing: superseded by consolidated stdlib optimization effort. The base64DecodeBytes unsigned byte fix has been extracted to #705. Performance optimizations from this PR will be resubmitted in a consolidated PR with comprehensive native benchmarks.

@He-Pin He-Pin closed this Apr 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant