Skip to content

perf(vm): compile Lua prototypes to BEAM modules#235

Open
davydog187 wants to merge 3 commits into
mainfrom
perf/erlang-codegen-foundation
Open

perf(vm): compile Lua prototypes to BEAM modules#235
davydog187 wants to merge 3 commits into
mainfrom
perf/erlang-codegen-foundation

Conversation

@davydog187
Copy link
Copy Markdown
Contributor

Plan: B5a — Erlang codegen foundation

Plan: .agents/plans/B5a-erlang-codegen-foundation.md
Parent strategic plan: .agents/plans/B5-compile-prototypes-to-erlang.md

Goal

Land the foundation for compiling Lua prototypes to BEAM modules
via :compile.forms/2. A compiled prototype's call goes through a
new {:compiled_closure, mod, fun, upvalues, proto} value type,
bypassing the interpreter's register-tuple construction and per-opcode
dispatch loop entirely. This first PR covers arithmetic, comparison,
logical ops, conditional :test, single-result :call,
single-value :return, and the common _ENV.name lookup path.

Scope

Supported in this PR:

  • Constants and moves: :load_constant, :load_boolean, :load_nil,
    :move, :source_line, :scope
  • Upvalues + globals: :get_upvalue, :get_open_upvalue,
    :load_env, :get_global
  • _ENV.name field access: :get_field with binary literal name
    (inlines the no-metatable fast path; metatable case delegates to
    Executor.index_value/6)
  • Arithmetic with integer fast path: :add, :subtract, :multiply;
    slow-path-only for :divide, :floor_divide, :modulo, :power,
    :negate
  • Comparison with number fast path: :less_than, :less_equal,
    :greater_than, :greater_equal; slow-path-only for :equal,
    :not_equal
  • Logical :not
  • Conditional :test and :test_true — restricted to branches that
    terminate via :return (no SSA-merging in B5a)
  • :call with single-result returns; routes through
    call_function_with_position which bridges native-callback position
    tracking but no-ops for pure Lua-to-Lua calls.

Out of scope (deliberately falling back to interpreter):

  • Tables → B5c
  • Closures, varargs, multi-return → B5d
  • Error position fidelity inside compiled raises → B5e
  • :goto/:label, loops (:numeric_for, :while_loop, etc.)

All-or-nothing per prototype: a prototype containing any unsupported
opcode falls back to interpretation in its entirety. Sub-prototypes
compile independently.

Success criteria

  • Lua.Compiler.Erlang.compile/1 exists and returns
    {:ok, proto_with_compiled_module_set} for covered prototypes
  • Lua.VM.CompiledModule value type wired through
    Executor.call_function/3 and the :call opcode dispatch
  • Every covered opcode lowered in Lua.Compiler.Erlang.Opcodes
  • Uncovered opcodes trigger fallback — never crash
  • Closure construction (:closure) emits :compiled_closure
    when the nested prototype compiled, else :lua_closure
  • mix test: 1705 tests + 51 properties + 55 doctests, 0
    failures
  • mix test --only lua53: 29 tests, 0 failures
  • fib(25) beats Luerl by ≥5x — not met (achieves ~1.1x).
    The throw/catch overhead on non-tail returns and the
    register-tuple setelement/3 churn dominate; B5b/B5c/B5d will
    close the gap as more opcodes inline.
  • No workload regresses

Perf

fib(30), full mode:

Implementation Mean vs main vs Luerl
main ~970 ms 1.00x 0.74x slower
B5a (this PR) ~670 ms 1.45x faster 1.07x faster
Luerl ~720 ms 1.35x faster baseline
C Lua (luaport) ~27 ms 36x faster 27x faster

The compiled path beats Luerl modestly today. The 5x stretch target
is held back primarily by:

  • throw/catch for non-tail returns (~8% of CPU). This PR
    optimises the function-tail :return to natural-return; returns
    inside :test branches still throw. B5e (error fidelity) will
    revisit the throw/catch shape.
  • setelement/3 per opcode write (~22% of CPU). Equivalent to
    the interpreter's register-tuple cost. Register promotion to SSA
    Erlang variables (deferred follow-up) eliminates this.
  • apply_arith_op / index_value calls when the inline fast path
    doesn't fire.
    B5c adds table-opcode coverage which inlines more
    paths.

Changes

  • lib/lua/compiler/erlang.ex — top-level compile/load orchestration
  • lib/lua/compiler/erlang/codegen.ex — abstract-forms generation
  • lib/lua/compiler/erlang/opcodes.ex — per-opcode lowering
  • lib/lua/compiler/erlang/runtime.ex — generated-code runtime helpers
  • lib/lua/compiler/prototype.excompiled_module field
  • lib/lua/compiler.ex — wire codegen into Lua.Compiler.compile/2
  • lib/lua/vm.ex — top-level execute dispatches to compiled module
  • lib/lua/vm/executor.ex:compiled_closure clauses in
    call_function/3 and the :call opcode; apply_arith_op/6,
    apply_unary_op/5, apply_compare_op/6,
    call_function_with_position/5 public helpers; index_value/6
    promoted to public
  • lib/lua/vm/value.ex, lib/lua/util.ex, lib/lua/api.ex,
    lib/lua/vm/display.ex, lib/lua/vm/stdlib*.ex, lib/lua.ex
    add :compiled_closure clauses everywhere :lua_closure was
    pattern-matched

Verification

mix format
mix compile --warnings-as-errors
mix test                       # 1705 tests, 0 failures
mix test --only lua53          # 29 tests, 0 failures
MIX_ENV=benchmark mix run benchmarks/fibonacci.exs

Known limitations (followed up in B5b–B5e)

  • Every prototype gets a fresh module name; loaded modules persist
    until BEAM exit. B5b introduces the content-addressable
    ref-counted cache.
  • :get_field with non-binary name, all other table opcodes, and
    closures fall back. B5c and B5d cover them.
  • Errors raised from compiled code carry the codegen-time :source_line
    but not full position fidelity. B5e adds try/catch with
    pc_to_line tables.
  • One observed :erl_lint :unsafe_var warning logs (not a failure)
    for prototypes with a specific shape involving register write
    inside :test branches that then continue. The prototype safely
    falls back in that case.

Splits B5 into five sequential plans (B5a foundation, B5b lifecycle,
B5c tables, B5d closures, B5e error fidelity) after three pre-flight
spikes confirmed the dispatch-loop hypothesis:

- Stripped fib(25):  278x faster than interpreter (BEAMASM ceiling)
- Faithful fib(25):  12.4x faster than interpreter, 10.4x vs Luerl
- Faithful table_sum: 2.1x faster than interpreter (modest by design)

Spike benchmarks land permanently under benchmarks/b5_spike*.exs so
each follow-on plan can re-measure against the same baseline.

Plan: B5a (foundation)
Introduces Lua.Compiler.Erlang — a codegen that translates supported
%Prototype{} values into Erlang abstract forms via :compile.forms/2,
loaded as fresh BEAM modules at runtime. The dispatch path through
{:compiled_closure, mod, fun, upvalues, proto} bypasses the interpreter's
register-tuple construction and per-opcode dispatch loop entirely.

Coverage in this PR (B5a — foundation):
- arithmetic, comparison, logical ops (with integer fast paths)
- control flow: :test (terminating branches), :test_true, early return
- upvalues: :get_upvalue, :get_open_upvalue, :load_env, :get_global
- :get_field on _ENV (inline no-metatable fast path; metatable case
  delegates to Executor.index_value/6)
- :call with single-result returns; routes through
  call_function_with_position which bridges native-callback position
  tracking but no-ops for Lua-to-Lua calls.
- :scope (transparent block inlining)
- :move, :load_constant, :load_nil, :load_boolean, :source_line

Out of scope (B5c/B5d/B5e):
- table opcodes (:new_table, :get_table, :set_table, :set_list,
  :set_field, non-env :get_field)
- closure construction (:closure), upvalue mutation
  (:set_upvalue, :set_open_upvalue), varargs, multi-value returns
- error position fidelity for raises inside compiled code
- :goto/:label, loops (:numeric_for, :while_loop, :repeat_loop,
  :generic_for, :break)

The all-or-nothing rule applies per prototype: if any opcode in a
prototype is unsupported, that prototype falls back to interpretation.
Sub-prototypes compile or fall back independently, and the :closure
opcode emits the appropriate value type per child.

Suite: 1705 tests + 51 properties + 55 doctests, 0 failures.
       29 lua53 tests, 0 failures.

Perf (fib(30)):
- main:           ~970 ms
- with B5a:       ~670 ms (1.4x faster than main, 1.07x vs Luerl)

The 5x-vs-Luerl stretch target from the plan is not met by this PR
alone — most of the remaining gap is throw/catch overhead on the
non-tail :return forms, register-tuple setelement churn, and the
Process.put bridge on calls. Each closes incrementally as B5b through
B5e land.

Plan: B5a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant