perf: stdlib allocation optimizations — foldl while-loop, join pre-sized, flatten two-pass, reverse direct by He-Pin · Pull Request #695 · databricks/sjsonnet

He-Pin · 2026-04-06T09:58:44Z

Motivation

Several hot-path stdlib functions (std.flattenArrays, std.reverse, std.foldl, std.join) use Scala collection patterns that allocate unnecessary intermediate objects: ArrayBuilder with rough size hints, .reverse creating new collections, for-comprehension iterator/lambda allocations, and un-sized StringBuilder.

These functions are called millions of times in benchmarks like foldl, reverse, comparison2, and realistic2. Reducing per-call allocation overhead yields measurable throughput gains.

Key Design Decisions

Two-pass pre-sizing: For flattenArrays and join (empty separator), count total elements first, then allocate the exact-sized result array and fill with System.arraycopy. Eliminates ArrayBuilder resize/copy cycles entirely.
While-loop conversion: Replace for (x <- arr) and .foreach with index-based while loops to avoid iterator/lambda allocation overhead.
Local caching: Hoist pos.noOffset out of tight loops into a local variable to avoid repeated method dispatch.
StringBuilder pre-sizing: Estimate output size as arr.length * (separator.length + 8) to avoid StringBuilder growth copies.

Modification

ArrayModule.scala:

FlattenArrays: Two-pass (count → pre-sized array + System.arraycopy)
FlattenDeepArrays: foreach → while-loop for initial deque fill
Reverse: .reverse → manual reverse-copy while-loop
Foldl (array path): for-loop → while-loop + cache pos.noOffset
Foldl (string path): Cache pos.noOffset in local

StringModule.scala:

Join (string path): Pre-size StringBuilder
Join (array path, empty separator): Two-pass pre-sized with System.arraycopy
Join (array path, non-empty separator): for-loop → while-loop + better sizeHint

Tests: Added stdlib_alloc_opt.jsonnet covering flattenArrays, reverse, join (both paths), and foldl edge cases (empty arrays, nulls, single elements).

Benchmark Results

JMH (1 fork, 1 warmup, 1 iteration — full regression suite):

Benchmark	Master (ms/op)	This PR (ms/op)	Delta
foldl	9.611	9.103	-5.3% ✅
reverse	10.813	10.404	-3.8% ✅
comparison2	69.731	66.307	-4.9% ✅
bench.04	33.338	32.595	-2.2% ✅
setDiff	0.448	0.423	-5.6% ✅
setInter	0.389	0.369	-5.1% ✅
setUnion	0.715	0.677	-5.3% ✅
realistic2	67.137	66.743 ± 0.937	neutral ✅

No regressions across all 35 benchmarks.

Targeted realistic2 (5 iterations with error bars): 66.743 ± 0.937 ms/op — confirms no regression.

Analysis

foldl -5.3%: While-loop eliminates iterator allocation + closure boxing per element. pos.noOffset caching avoids repeated Position.noOffset dispatch in tight loop.
reverse -3.8%: Manual reverse-copy avoids Scala's .reverse which allocates intermediate ArraySeq + copies.
comparison2 -4.9%: Benefits from foldl/flattenArrays improvements in nested comprehension workloads.
set operations -5.1% to -5.6%: Benefit from reverse optimization (sets use sorted arrays internally).

References

Upstream: he-pin/sjsonnet@4fa535fb

Result

All 140 tests pass. Consistent improvements across stdlib-heavy benchmarks with no regressions.

…zed, flatten two-pass, reverse direct Optimize hot-path stdlib functions to reduce allocation and improve throughput: - flattenArrays: Two-pass approach counting total elements first, then using System.arraycopy for each sub-array (eliminates ArrayBuilder resizing) - flattenDeepArray: while-loop instead of foreach for initial fill - reverse: Direct reverse-copy into new array instead of .reverse - foldl (array path): Convert for-loop to while-loop, cache pos.noOffset - foldl (string path): Cache pos.noOffset in local - join (string path): Pre-size StringBuilder based on estimated element length - join (array path, empty separator): Two-pass pre-sized with arraycopy - join (array path, non-empty separator): while-loop with better sizeHint Add targeted regression tests for flattenArrays, reverse, join, and foldl covering edge cases (empty arrays, nulls, single elements). Upstream: 4fa535fb

He-Pin · 2026-04-06T21:50:08Z

Closing: superseded by consolidated stdlib optimization effort. The base64DecodeBytes unsigned byte fix has been extracted to #705. Performance optimizations from this PR will be resubmitted in a consolidated PR with comprehensive native benchmarks.

He-Pin closed this Apr 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: stdlib allocation optimizations — foldl while-loop, join pre-sized, flatten two-pass, reverse direct#695

perf: stdlib allocation optimizations — foldl while-loop, join pre-sized, flatten two-pass, reverse direct#695
He-Pin wants to merge 1 commit intodatabricks:masterfrom
He-Pin:perf/stdlib-allocation-opt

He-Pin commented Apr 6, 2026

Uh oh!

He-Pin commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

He-Pin commented Apr 6, 2026

Motivation

Key Design Decisions

Modification

Benchmark Results

Analysis

References

Result

Uh oh!

He-Pin commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant