perf(vm): fast-path Numeric.to_signed_int64 for in-range integers by davydog187 · Pull Request #227 · tv-labs/lua

davydog187 · 2026-05-21T21:28:46Z

Inline `to_signed_int64/1` for the in-range fast path

Plan: .agents/plans/B8-inline-numeric-narrowing.md

Goal

Lua.VM.Numeric.to_signed_int64/1 is called on every integer arithmetic result
to wrap into signed 64-bit per Lua 5.3 §3.4.1. In the fib(22) tprof profile it
accounts for 3.82% of total time across 85,968 calls. For the overwhelming
common case where the result is already in [-2^63, 2^63 - 1], the masking and
conditional subtraction are wasted work. Adding a guarded fast-path clause that
returns the input as-is when it's already in range short-circuits the cost on
that branch, and @compile {:inline, ...} lets the BEAM inline both clauses at
intra-module call sites.

Success criteria

to_signed_int64/1 has a guard-clause fast path for inputs already in
the signed 64-bit range — verified in lib/lua/vm/numeric.ex.
signed?/1 is @compile {:inline, signed?: 1} so the fast-path guard
is cheap — applied alongside to_signed_int64: 1.
mix test passes — 1692 tests, 51 properties, 55 doctests, 0 failures.
mix test --only lua53 does not regress — 29 tests, 0 failures (matches
main).
Profile after merge: Numeric.to_signed_int64 self-time drops on fib(22).
Measured: 3.82% → 3.38% (12% relative drop). The plan's stretch target of
< 1.5% relied on cross-module inlining, which @compile {:inline, ...}
does not perform; the realized win comes from the guard short-circuit only.
Microbenchmarks: fib improves by ≥ 1% floor / 3% stretch. Measured fib(30)
wall clock: lua (chunk) 873.4ms → 844.8ms (-3.3%), lua (eval) 876.7ms
→ 852.2ms (-2.8%). Luerl (control) 730.9ms → 731.8ms (unchanged).
Both runs at ±0.5% deviation, well outside noise.
No regression on overflow-heavy tests. Spot-checked:
9223372036854775807 + 1 == -9223372036854775808,
-9223372036854775808 - 1 == 9223372036854775807,
0xFFFFFFFFFFFFFFFF == -1. All correct.

Changes

 lib/lua/vm/numeric.ex                            | 6 ++++++
 .agents/plans/B8-inline-numeric-narrowing.md     | 2 +-
 2 files changed, 7 insertions(+), 1 deletion(-)

The behavior is bit-for-bit identical; the fast path is purely a guard-tested
return-as-is. The slow path (already-out-of-range integers needing wrap-around)
is unchanged.

Discoveries

@compile {:inline, ...} only inlines within the same module. Cross-module
callers in Lua.VM.Executor and Lua.VM.Value still trip a function
boundary on every call. This caps the win below the plan's stretch target —
the realized improvement comes entirely from the guard short-circuit, not
from inlining at the dispatch sites.
tprof call count stayed at 85,968 before/after, confirming no inlining
happened at the cross-module callers. The wall-clock improvement is real
(luerl as control did not move), so the per-call win is genuine even if
modest.

Verification

mix format
mix compile --warnings-as-errors
mix test            # 1692 tests, 0 failures
mix test --only lua53   # 29 tests, 0 failures, 23 skipped

Benchmark (fib(30), 10s benchee runs, 2s warmup):

Name                ips      average     deviation  median
lua (chunk)        1.18    844.76 ms     ±0.42%    845.56 ms   (was 873.36ms, -3.3%)
lua (eval)         1.17    852.21 ms     ±0.43%    851.97 ms   (was 876.74ms, -2.8%)
luerl              1.37    731.78 ms     ±0.50%    732.24 ms   (was 730.87ms, unchanged - control)
C Lua (luaport)   36.66     27.28 ms     ±4.38%     26.86 ms

Profile (fib(22), mix profile.tprof):

Lua.VM.Numeric.to_signed_int64/1   85968 calls   3.82% -> 3.38%
Lua.VM.Executor.do_execute/8      802388 calls  50.63% -> 52.09% (no change, just relative shift)

Out of scope (intentional)

Bypassing to_signed_int64/1 calls entirely at the executor level — that
is B5 territory (compiling prototypes to Erlang).
Changing Lua wrap-around semantics. Behavior is identical.
Turning call sites into inline arithmetic. That would tangle with B5 and is
not the right place.

The Lua 5.3 wrap-around mask runs on every integer arithmetic result, but the overwhelming common case is an input already in [-2^63, 2^63 - 1], which passes through unchanged. Adding a guard-clause clause that returns the input as-is short-circuits the masking on that branch. `@compile {:inline, ...}` lets the BEAM inline both clauses at intra-module call sites; cross-module callers still trip a function boundary but the guarded clause's match cost is lower than the band+compare body. On fib(22), Numeric.to_signed_int64 self-time drops 3.82% -> 3.38% under tprof. On fib(30) wall clock, lua (chunk) improves 873.4ms -> 844.8ms (-3.3%), comfortably outside the run-to-run deviation band. Luerl (the control) does not move. Overflow tests (max_int + 1, min_int - 1, 0xFFFF...) still wrap correctly. Plan: .agents/plans/B8-inline-numeric-narrowing.md

@compile

Records PR #227, captures the discovery that @compile {:inline, ...} does not cross module boundaries (so the fast path's win comes from the guard short-circuit only, not from call-site inlining), and the wall-clock fib(30) delta of -3.3%.

Post-PR #223 / #227 profile shows Map.get/2 + Map.get/3 combined at 3.28% on fib(22) (plan claimed 6.4%), 2.81% on OOP, and 0.04% on table_build. The real table-workload bottlenecks live inside Lua.VM.Table (insert/put 18%, normalize_key 3.3%, sequence_length 4%) and in :erlang.setelement (17.5% on table writes, 20.9% on OOP). Those are B7's targets, not B6's. B6's projected wall-clock win is now below 1%, inside benchee's deviation band on every measured workload. Audit cleanup may still be worth doing later as a refactor, but not as a perf plan and not before B7.

davydog187 added 3 commits May 21, 2026 14:24

chore(B8): start plan

853e51b

chore(B8): mark plan as review

74f1d23

Records PR #227, captures the discovery that @compile {:inline, ...} does not cross module boundaries (so the fast path's win comes from the guard short-circuit only, not from call-site inlining), and the wall-clock fib(30) delta of -3.3%.

davydog187 merged commit 297eadd into main May 21, 2026
4 checks passed

davydog187 deleted the perf/inline-numeric-narrowing branch May 21, 2026 21:43

davydog187 mentioned this pull request May 21, 2026

perf(vm): split table storage into array + hash parts #229

Closed

8 tasks

davydog187 added a commit that referenced this pull request May 22, 2026

chore(B8): mark plan merged via PR #227

f25da3f

davydog187 mentioned this pull request May 22, 2026

docs(roadmap): consolidate B-series findings (B4, B6, B7, B8 + harness) #234

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(vm): fast-path Numeric.to_signed_int64 for in-range integers#227

perf(vm): fast-path Numeric.to_signed_int64 for in-range integers#227
davydog187 merged 3 commits into
mainfrom
perf/inline-numeric-narrowing

davydog187 commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

davydog187 commented May 21, 2026

Inline to_signed_int64/1 for the in-range fast path

Goal

Success criteria

Changes

Discoveries

Verification

Out of scope (intentional)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Inline `to_signed_int64/1` for the in-range fast path