perf: comprehension fuse scope+eval and inline BinaryOp(ValidId,ValidId) fast path by He-Pin · Pull Request #686 · databricks/sjsonnet

He-Pin · 2026-04-05T04:25:27Z

Motivation

Comprehension operations (array/object comprehensions) are the most performance-critical loops in Jsonnet evaluation. Every iteration currently involves:

Scope allocation: Creating a new ValScope for each iteration to bind the loop variable
Expression dispatch: Full visitExpr dispatch for the body, even when the body is a simple binary operation on two local variables
Virtual call overhead: Multiple levels of indirection through pattern matching and method dispatch

For workloads like comparison2 (which runs millions of comprehension iterations with simple comparison bodies), these overheads dominate execution time.

Key Design Decision

Two complementary optimizations target the comprehension inner loop:

Scope+Eval Fusion: Instead of first building a scope (extendBy) and then evaluating the body as separate steps, fuse them into a single operation. This eliminates one intermediate method call and allows the optimizer to keep variables in registers.
Inline BinaryOp(ValidId, ValidId) Fast Path: When the comprehension body is a binary operation on two local variables (e.g., x > y, a + b), bypass visitExpr entirely and directly:
- Read both values from the scope array by index
- Dispatch to the binary operator
- Return the result
This eliminates all expression dispatch overhead for the most common comprehension pattern.

Modification

Evaluator.scala: Added visitCompInline method with pattern matching on body expression:
- BinaryOp(ValidId(lhsIdx), ValidId(rhsIdx), op) → direct scope read + op dispatch
- Falls back to standard visitExpr for other body patterns
- Uses mutable scope slot for iteration variable to avoid repeated scope allocation
Test: Added comprehension_binop_types.jsonnet covering:
- Arithmetic: +, -, *, /, %
- Comparison: <, >, <=, >=, ==, !=
- Boolean: &&, ||
- String concatenation: + on strings
- Mixed-type operations

Benchmark Results

JMH (JVM, 3 iterations)

Benchmark	Master (ms/op)	This PR (ms/op)	Change
bench.02	50.427 ± 38.906	47.258 ± 4.861	-6.3%
comparison2	85.854 ± 188.657	38.386 ± 13.591	-55.3% 🔥
realistic2	73.458 ± 66.747	67.243 ± 12.009	-8.5%

Hyperfine (Scala Native, 10 runs, vs master)

Benchmark	Master (ms)	This PR (ms)	Speedup
bench.02	75.1 ± 1.8	72.1 ± 1.1	1.04x faster
comparison2	183.8 ± 5.8	83.6 ± 1.5	2.20x faster 🔥
realistic2	302.8 ± 3.7	305.0 ± 4.1	neutral
reverse	51.5 ± 2.6	52.4 ± 1.5	neutral

Hyperfine (Scala Native, vs jrsonnet)

Benchmark	sjsonnet (ms)	jrsonnet (ms)	Speedup
comparison2	83.6 ± 1.5	212.4 ± 3.3	sjsonnet 2.54x faster 🔥

Analysis

comparison2 is the primary beneficiary: comprehension with comparison body is exactly the optimized pattern
-55% on JVM, -54% on Native — consistent improvement across both platforms
2.54x faster than jrsonnet (Rust) on comparison2 benchmark
No regressions on other benchmarks (realistic2, bench.02, reverse all neutral)
The optimization is safe: unrecognized body patterns fall through to standard evaluation

References

Upstream exploration: he-pin/sjsonnet jit branch commits 71545ba8, 230ae9d1
Pattern: similar to JIT compiler peephole optimization for hot inner loops

Result

Massive performance improvement for comprehension-heavy workloads with simple bodies (comparisons, arithmetic). comparison2 goes from 2.14x slower to 2.54x faster than jrsonnet.

Fuse comprehension scope building with body evaluation, eliminating intermediate scope array allocation. For nested comprehensions like [x+y for x in arr for y in arr if x==y], this avoids allocating O(n²) intermediate scopes — only the O(n) matching results are materialized. When the innermost body is BinaryOp(ValidId,ValidId), inline scope lookups and numeric binary-op dispatch to avoid 3× visitExpr overhead per iteration. Falls back to general visitExpr for non-numeric types. Key changes: - visitCompFused: recursive fused scope+eval loop with ArrayBuilder - evalBinaryOpNumNum: @switch-dispatched Num×Num fast path - Non-numeric fallback uses existing visitExpr (no code duplication) Upstream: jit branch commits 3466461 (fuse) + 71545ba (inline)

He-Pin mentioned this pull request Apr 5, 2026

performance optimization #666

Open

He-Pin marked this pull request as ready for review April 5, 2026 09:44

He-Pin mentioned this pull request Apr 5, 2026

perf: optimize std.range allocation and add staticNull singleton #669

Open

He-Pin force-pushed the perf/comprehension-binop-inline branch from 9b5caef to 62c6ef6 Compare April 6, 2026 05:30

He-Pin mentioned this pull request Apr 6, 2026

perf: fuse comprehension scope building and body evaluation #675

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: comprehension fuse scope+eval and inline BinaryOp(ValidId,ValidId) fast path#686

perf: comprehension fuse scope+eval and inline BinaryOp(ValidId,ValidId) fast path#686
He-Pin wants to merge 1 commit intodatabricks:masterfrom
He-Pin:perf/comprehension-binop-inline

He-Pin commented Apr 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

He-Pin commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Key Design Decision

Modification

Benchmark Results

JMH (JVM, 3 iterations)

Hyperfine (Scala Native, 10 runs, vs master)

Hyperfine (Scala Native, vs jrsonnet)

Analysis

References

Result

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

He-Pin commented Apr 5, 2026 •

edited

Loading