Skip to content

perf(vm): fast-path Numeric.to_signed_int64 for in-range integers#227

Merged
davydog187 merged 3 commits into
mainfrom
perf/inline-numeric-narrowing
May 21, 2026
Merged

perf(vm): fast-path Numeric.to_signed_int64 for in-range integers#227
davydog187 merged 3 commits into
mainfrom
perf/inline-numeric-narrowing

Conversation

@davydog187
Copy link
Copy Markdown
Contributor

Inline to_signed_int64/1 for the in-range fast path

Plan: .agents/plans/B8-inline-numeric-narrowing.md

Goal

Lua.VM.Numeric.to_signed_int64/1 is called on every integer arithmetic result
to wrap into signed 64-bit per Lua 5.3 §3.4.1. In the fib(22) tprof profile it
accounts for 3.82% of total time across 85,968 calls. For the overwhelming
common case where the result is already in [-2^63, 2^63 - 1], the masking and
conditional subtraction are wasted work. Adding a guarded fast-path clause that
returns the input as-is when it's already in range short-circuits the cost on
that branch, and @compile {:inline, ...} lets the BEAM inline both clauses at
intra-module call sites.

Success criteria

  • to_signed_int64/1 has a guard-clause fast path for inputs already in
    the signed 64-bit range — verified in lib/lua/vm/numeric.ex.
  • signed?/1 is @compile {:inline, signed?: 1} so the fast-path guard
    is cheap — applied alongside to_signed_int64: 1.
  • mix test passes — 1692 tests, 51 properties, 55 doctests, 0 failures.
  • mix test --only lua53 does not regress — 29 tests, 0 failures (matches
    main).
  • Profile after merge: Numeric.to_signed_int64 self-time drops on fib(22).
    Measured: 3.82% → 3.38% (12% relative drop). The plan's stretch target of
    < 1.5% relied on cross-module inlining, which @compile {:inline, ...}
    does not perform; the realized win comes from the guard short-circuit only.
  • Microbenchmarks: fib improves by ≥ 1% floor / 3% stretch. Measured fib(30)
    wall clock: lua (chunk) 873.4ms → 844.8ms (-3.3%), lua (eval) 876.7ms
    → 852.2ms (-2.8%). Luerl (control) 730.9ms → 731.8ms (unchanged).
    Both runs at ±0.5% deviation, well outside noise.
  • No regression on overflow-heavy tests. Spot-checked:
    9223372036854775807 + 1 == -9223372036854775808,
    -9223372036854775808 - 1 == 9223372036854775807,
    0xFFFFFFFFFFFFFFFF == -1. All correct.

Changes

 lib/lua/vm/numeric.ex                            | 6 ++++++
 .agents/plans/B8-inline-numeric-narrowing.md     | 2 +-
 2 files changed, 7 insertions(+), 1 deletion(-)

The behavior is bit-for-bit identical; the fast path is purely a guard-tested
return-as-is. The slow path (already-out-of-range integers needing wrap-around)
is unchanged.

Discoveries

  • @compile {:inline, ...} only inlines within the same module. Cross-module
    callers in Lua.VM.Executor and Lua.VM.Value still trip a function
    boundary on every call. This caps the win below the plan's stretch target —
    the realized improvement comes entirely from the guard short-circuit, not
    from inlining at the dispatch sites.
  • tprof call count stayed at 85,968 before/after, confirming no inlining
    happened at the cross-module callers. The wall-clock improvement is real
    (luerl as control did not move), so the per-call win is genuine even if
    modest.

Verification

mix format
mix compile --warnings-as-errors
mix test            # 1692 tests, 0 failures
mix test --only lua53   # 29 tests, 0 failures, 23 skipped

Benchmark (fib(30), 10s benchee runs, 2s warmup):

Name                ips      average     deviation  median
lua (chunk)        1.18    844.76 ms     ±0.42%    845.56 ms   (was 873.36ms, -3.3%)
lua (eval)         1.17    852.21 ms     ±0.43%    851.97 ms   (was 876.74ms, -2.8%)
luerl              1.37    731.78 ms     ±0.50%    732.24 ms   (was 730.87ms, unchanged - control)
C Lua (luaport)   36.66     27.28 ms     ±4.38%     26.86 ms

Profile (fib(22), mix profile.tprof):

Lua.VM.Numeric.to_signed_int64/1   85968 calls   3.82% -> 3.38%
Lua.VM.Executor.do_execute/8      802388 calls  50.63% -> 52.09% (no change, just relative shift)

Out of scope (intentional)

  • Bypassing to_signed_int64/1 calls entirely at the executor level — that
    is B5 territory (compiling prototypes to Erlang).
  • Changing Lua wrap-around semantics. Behavior is identical.
  • Turning call sites into inline arithmetic. That would tangle with B5 and is
    not the right place.

The Lua 5.3 wrap-around mask runs on every integer arithmetic result, but
the overwhelming common case is an input already in [-2^63, 2^63 - 1],
which passes through unchanged. Adding a guard-clause clause that returns
the input as-is short-circuits the masking on that branch.

`@compile {:inline, ...}` lets the BEAM inline both clauses at intra-module
call sites; cross-module callers still trip a function boundary but the
guarded clause's match cost is lower than the band+compare body.

On fib(22), Numeric.to_signed_int64 self-time drops 3.82% -> 3.38% under
tprof. On fib(30) wall clock, lua (chunk) improves 873.4ms -> 844.8ms
(-3.3%), comfortably outside the run-to-run deviation band. Luerl (the
control) does not move. Overflow tests (max_int + 1, min_int - 1,
0xFFFF...) still wrap correctly.

Plan: .agents/plans/B8-inline-numeric-narrowing.md
Records PR #227, captures the discovery that @compile {:inline, ...}
does not cross module boundaries (so the fast path's win comes from
the guard short-circuit only, not from call-site inlining), and the
wall-clock fib(30) delta of -3.3%.
@davydog187 davydog187 merged commit 297eadd into main May 21, 2026
4 checks passed
@davydog187 davydog187 deleted the perf/inline-numeric-narrowing branch May 21, 2026 21:43
davydog187 added a commit that referenced this pull request May 22, 2026
Post-PR #223 / #227 profile shows Map.get/2 + Map.get/3 combined at
3.28% on fib(22) (plan claimed 6.4%), 2.81% on OOP, and 0.04% on
table_build. The real table-workload bottlenecks live inside
Lua.VM.Table (insert/put 18%, normalize_key 3.3%, sequence_length 4%)
and in :erlang.setelement (17.5% on table writes, 20.9% on OOP).
Those are B7's targets, not B6's.

B6's projected wall-clock win is now below 1%, inside benchee's
deviation band on every measured workload. Audit cleanup may still be
worth doing later as a refactor, but not as a perf plan and not
before B7.
davydog187 added a commit that referenced this pull request May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant