Skip to content

perf: comprehension fuse scope+eval and inline BinaryOp(ValidId,ValidId) fast path#686

Open
He-Pin wants to merge 1 commit intodatabricks:masterfrom
He-Pin:perf/comprehension-binop-inline
Open

perf: comprehension fuse scope+eval and inline BinaryOp(ValidId,ValidId) fast path#686
He-Pin wants to merge 1 commit intodatabricks:masterfrom
He-Pin:perf/comprehension-binop-inline

Conversation

@He-Pin
Copy link
Copy Markdown
Contributor

@He-Pin He-Pin commented Apr 5, 2026

Motivation

Comprehension operations (array/object comprehensions) are the most performance-critical loops in Jsonnet evaluation. Every iteration currently involves:

  1. Scope allocation: Creating a new ValScope for each iteration to bind the loop variable
  2. Expression dispatch: Full visitExpr dispatch for the body, even when the body is a simple binary operation on two local variables
  3. Virtual call overhead: Multiple levels of indirection through pattern matching and method dispatch

For workloads like comparison2 (which runs millions of comprehension iterations with simple comparison bodies), these overheads dominate execution time.

Key Design Decision

Two complementary optimizations target the comprehension inner loop:

  1. Scope+Eval Fusion: Instead of first building a scope (extendBy) and then evaluating the body as separate steps, fuse them into a single operation. This eliminates one intermediate method call and allows the optimizer to keep variables in registers.

  2. Inline BinaryOp(ValidId, ValidId) Fast Path: When the comprehension body is a binary operation on two local variables (e.g., x > y, a + b), bypass visitExpr entirely and directly:

    • Read both values from the scope array by index
    • Dispatch to the binary operator
    • Return the result

    This eliminates all expression dispatch overhead for the most common comprehension pattern.

Modification

  • Evaluator.scala: Added visitCompInline method with pattern matching on body expression:

    • BinaryOp(ValidId(lhsIdx), ValidId(rhsIdx), op) → direct scope read + op dispatch
    • Falls back to standard visitExpr for other body patterns
    • Uses mutable scope slot for iteration variable to avoid repeated scope allocation
  • Test: Added comprehension_binop_types.jsonnet covering:

    • Arithmetic: +, -, *, /, %
    • Comparison: <, >, <=, >=, ==, !=
    • Boolean: &&, ||
    • String concatenation: + on strings
    • Mixed-type operations

Benchmark Results

JMH (JVM, 3 iterations)

Benchmark Master (ms/op) This PR (ms/op) Change
bench.02 50.427 ± 38.906 47.258 ± 4.861 -6.3%
comparison2 85.854 ± 188.657 38.386 ± 13.591 -55.3% 🔥
realistic2 73.458 ± 66.747 67.243 ± 12.009 -8.5%

Hyperfine (Scala Native, 10 runs, vs master)

Benchmark Master (ms) This PR (ms) Speedup
bench.02 75.1 ± 1.8 72.1 ± 1.1 1.04x faster
comparison2 183.8 ± 5.8 83.6 ± 1.5 2.20x faster 🔥
realistic2 302.8 ± 3.7 305.0 ± 4.1 neutral
reverse 51.5 ± 2.6 52.4 ± 1.5 neutral

Hyperfine (Scala Native, vs jrsonnet)

Benchmark sjsonnet (ms) jrsonnet (ms) Speedup
comparison2 83.6 ± 1.5 212.4 ± 3.3 sjsonnet 2.54x faster 🔥

Analysis

  • comparison2 is the primary beneficiary: comprehension with comparison body is exactly the optimized pattern
  • -55% on JVM, -54% on Native — consistent improvement across both platforms
  • 2.54x faster than jrsonnet (Rust) on comparison2 benchmark
  • No regressions on other benchmarks (realistic2, bench.02, reverse all neutral)
  • The optimization is safe: unrecognized body patterns fall through to standard evaluation

References

  • Upstream exploration: he-pin/sjsonnet jit branch commits 71545ba8, 230ae9d1
  • Pattern: similar to JIT compiler peephole optimization for hot inner loops

Result

Massive performance improvement for comprehension-heavy workloads with simple bodies (comparisons, arithmetic). comparison2 goes from 2.14x slower to 2.54x faster than jrsonnet.

@He-Pin He-Pin marked this pull request as ready for review April 5, 2026 09:44
Fuse comprehension scope building with body evaluation, eliminating
intermediate scope array allocation. For nested comprehensions like
[x+y for x in arr for y in arr if x==y], this avoids allocating O(n²)
intermediate scopes — only the O(n) matching results are materialized.

When the innermost body is BinaryOp(ValidId,ValidId), inline scope
lookups and numeric binary-op dispatch to avoid 3× visitExpr overhead
per iteration. Falls back to general visitExpr for non-numeric types.

Key changes:
- visitCompFused: recursive fused scope+eval loop with ArrayBuilder
- evalBinaryOpNumNum: @switch-dispatched Num×Num fast path
- Non-numeric fallback uses existing visitExpr (no code duplication)

Upstream: jit branch commits 3466461 (fuse) + 71545ba (inline)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant