From ab375c677bd9b1b91b7b8eb3b1311ef01ee22c47 Mon Sep 17 00:00:00 2001 From: Dave Lucia Date: Fri, 22 May 2026 08:47:59 -0700 Subject: [PATCH 1/3] chore(plan): scope B5a-B5e and record spike benchmarks Splits B5 into five sequential plans (B5a foundation, B5b lifecycle, B5c tables, B5d closures, B5e error fidelity) after three pre-flight spikes confirmed the dispatch-loop hypothesis: - Stripped fib(25): 278x faster than interpreter (BEAMASM ceiling) - Faithful fib(25): 12.4x faster than interpreter, 10.4x vs Luerl - Faithful table_sum: 2.1x faster than interpreter (modest by design) Spike benchmarks land permanently under benchmarks/b5_spike*.exs so each follow-on plan can re-measure against the same baseline. Plan: B5a (foundation) --- .../plans/B5-compile-prototypes-to-erlang.md | 341 +++++++++++++++- .../plans/B5a-erlang-codegen-foundation.md | 363 ++++++++++++++++++ .agents/plans/B5b-module-lifecycle.md | 236 ++++++++++++ .agents/plans/B5c-table-opcodes.md | 216 +++++++++++ .agents/plans/B5d-closures-and-varargs.md | 197 ++++++++++ .agents/plans/B5e-error-position-fidelity.md | 171 +++++++++ benchmarks/b5_spike.exs | 126 ++++++ benchmarks/b5_spike_faithful.exs | 279 ++++++++++++++ benchmarks/b5_spike_tables.exs | 206 ++++++++++ 9 files changed, 2128 insertions(+), 7 deletions(-) create mode 100644 .agents/plans/B5a-erlang-codegen-foundation.md create mode 100644 .agents/plans/B5b-module-lifecycle.md create mode 100644 .agents/plans/B5c-table-opcodes.md create mode 100644 .agents/plans/B5d-closures-and-varargs.md create mode 100644 .agents/plans/B5e-error-position-fidelity.md create mode 100644 benchmarks/b5_spike.exs create mode 100644 benchmarks/b5_spike_faithful.exs create mode 100644 benchmarks/b5_spike_tables.exs diff --git a/.agents/plans/B5-compile-prototypes-to-erlang.md b/.agents/plans/B5-compile-prototypes-to-erlang.md index 3de7800..a93ea34 100644 --- a/.agents/plans/B5-compile-prototypes-to-erlang.md +++ b/.agents/plans/B5-compile-prototypes-to-erlang.md @@ -3,20 +3,39 @@ id: B5 title: Compile Lua prototypes to Erlang functions (executor JIT) issue: null pr: null -branch: perf/compile-to-erlang +branch: n/a (split into B5a-B5e) base: main -status: blocked +status: split direction: B unlocks: - sub-Luerl latency on tight numeric/call workloads - "perf parity with Luerl, ±10%" 1.0 commitment headroom --- -## Blocked on +## Status: split into B5a–B5e -- B4 — the flat instruction stream is the natural intermediate - representation to translate into Erlang. Trying to JIT directly from - the list-of-tuples shape would mix two structural changes in one PR. +After three pre-flight spikes (recorded in `## Discoveries` below) +the work was split into five sequential plans, each shippable as +one PR per the `ship-a-plan` contract: + +- **B5a** — Erlang codegen foundation; covers fib + arithmetic + + control flow. Falls back on tables and closures. +- **B5b** — Module lifecycle (cache + ref-counted purging). + Immediately after B5a; required before more opcodes ship. +- **B5c** — Table opcodes. +- **B5d** — Closures, varargs, multi-return. +- **B5e** — Error position fidelity. + +This parent plan stays as the strategic record: spike data, +architectural decisions, and what was decided out of scope. Read +the child plans for what gets implemented. + +## Blocked on (historical) + +- B4 — the flat instruction stream was assumed to be the natural + intermediate representation. The B4 spike disproved this; B5 + proceeds directly from the existing list-of-tuples shape. See + the B4 plan's Discoveries for why. ## Goal @@ -251,4 +270,312 @@ IO.inspect(:code.all_loaded() |> length(), label: "loaded modules after 1000 eva ## Discoveries -(populated during implementation) +### Pre-flight spike (perf/b5-spike-fib, May 2026) + +Before committing to the multi-month build, a vertical-slice spike hand- +wrote what `compile:forms/2` would emit for the fib prototype's hot +path and compared it against the interpreter, native Elixir (BEAMASM +ceiling), Luerl, and C Lua (via luaport). Spike source: +`benchmarks/b5_spike.exs`. + +**fib(25), full mode:** + +| Implementation | Mean | Memory | vs interpreter | +|---|---|---|---| +| native elixir | 0.27 ms | 0 B | 325x faster | +| compiled erlang | 0.89 ms | 0 B | **98x faster** | +| C Lua (luaport) | 2.35 ms | 184 B | 37x faster | +| luerl | 65.4 ms | 238 MB | 1.34x faster | +| lua (chunk) | 87.7 ms | 705 MB | baseline | + +**fib(30), quick mode:** + +| Implementation | Mean | vs interpreter | +|---|---|---| +| native elixir | 3.30 ms | 294x faster | +| compiled erlang | 9.67 ms | **100x faster** | +| C Lua (luaport) | 26.8 ms | 36x faster | +| luerl | 726 ms | 1.34x faster | +| lua (chunk) | 970 ms | baseline | + +Ratios are stable across n; the result is not a small-n artefact. + +### What the spike shows + +- The compiled-erlang path is two orders of magnitude faster than the + interpreter on fib's hot path and is the only path that beats luerl + by more than a constant factor. The exit condition for going ahead + with B5 (≥30% win on fib(30)) is met by ~33x. +- Memory is the more dramatic signal: 705 MB → 0 B on fib(25). The + interpreter's register-tuple churn (`setelement/3` at 25% of self- + time in the main-branch profile) **disappears completely** when the + prototype compiles to a module that uses Erlang variables instead of + a tuple. This validates that the `setelement/3` ceiling identified + in the B-series consolidation is not a wall — it is a property of + the interpreter's data shape, not of the BEAM. +- The 3.3x gap between native elixir and compiled erlang is the + realistic ceiling for B5: cross-module inlining and constant-time + call resolution that runtime-loaded modules don't get. B5 should be + scoped against the compiled-erlang column, not the native-elixir + column. + +### Caveats the spike does not address + +1. **fib is the friendliest possible workload.** Pure integer math, no + tables, no metamethods, no strings, no upvalue mutations. The OOP + and table_ops benchmarks exercise costs the spike does not touch. + B5 may deliver smaller (still meaningful) wins on those. +2. **The spike strips Lua semantics.** No register tuple, no `_ENV.fib` + lookup, no metamethod dispatch path on `<` or `+`. Each of those + reintroduces overhead a real B5 codegen must respect. The first PR + should validate that a faithful translation (register tuple + + `get_upvalue` + `get_field` for the recursive call) still clears + the original plan's success bar. +3. **Module load cost is not amortised.** Compiled once outside the + benchmark. Content-addressable module cache (already in the plan) + handles repeated runs; one-shot scripts may be net slower. + +### Adjustments to the plan + +- Success criterion "fib(25) parity with Luerl ±5%" is too conservative + given the spike numbers. Update to "fib(25) beats Luerl by ≥20x" or + similar, set on the basis of the faithful-translation prototype, not + the stripped spike. +- Option (1) (keep registers as tuple, eat `setelement/3`) is the right + first move. The spike showed the dispatch-loop win is overwhelming + even before register promotion; SSA promotion (`B5c`) can be deferred + without losing the bulk of the win. +- The faithful-translation prototype (next step) should land as a + second spike before the full plan implementation begins. If a + faithful fib compiled module loses more than 5x against the stripped + spike, the Lua-semantics overhead is bigger than expected and the + plan needs another pass. + +### Spike artefact + +Branch `perf/b5-spike-fib`, file `benchmarks/b5_spike.exs`. Reproduce +with `MIX_ENV=benchmark mix run benchmarks/b5_spike.exs` +(or with `LUA_BENCH_MODE=full` / `FIB_N=30`). + +### Faithful follow-up spike (perf/b5-spike-fib, May 2026) + +The stripped spike answered "is there headroom?" Yes. This second +spike answered "how much survives when we add back the Lua-VM +machinery a real B5 codegen could not skip?" + +The faithful spike compiles fib via `compile:forms/2` and then, for +its recursive call, looks up `_ENV` through the upvalue cell, fetches +`_ENV.fib` from the globals table, and re-enters via +`Lua.VM.Executor.call_function/3`. State threads through both calls. +Args are boxed in a list, results unbox from a list — all the same +protocol the interpreter uses. + +Source: `benchmarks/b5_spike_faithful.exs`. Required a small +additive change in `lib/lua/vm/executor.ex` to register a +`:compiled_closure` value type that dispatches to a BEAM module +without building a callee register tuple (this is the win condition — +the spike measures call cost when the dispatch shape itself is +collapsed to a BEAM function call). The change is tagged as spike- +only in comments; full test suite (1705 tests + 51 properties + 55 +doctests) still passes with it in place. + +**fib(25), full mode:** + +| Implementation | Mean | Memory | vs interpreter | vs Luerl | +|---|---|---|---|---| +| compiled-stripped | 0.28 ms | 0 MB | 278x | 232x | +| native elixir | 0.32 ms | 0 MB | 243x | 210x | +| C Lua (luaport) | 2.34 ms | 184 B | 33x | 28x | +| **compiled-faithful** | **6.27 ms** | **13.0 MB** | **12.4x** | **10.4x** | +| luerl | 64.8 ms | 227 MB | 1.2x | baseline | +| lua (interpreter) | 77.7 ms | 673 MB | baseline | 1.2x slower | + +### What the faithful spike shows + +- B5 still clears its bar by a wide margin: **12.4x faster than the + interpreter, 10.4x faster than Luerl, 22x slower than the BEAMASM + ceiling**. The 22x gap between stripped and faithful is the real + cost of preserving Lua semantics during the recursive call (upvalue + cell lookup, get_field on `_ENV`, two `call_function/3` invocations + per frame, state threading, args/result list boxing). +- Memory is the standout signal: 673 MB → 13 MB on fib(25), a 50x + reduction *even with* the call-protocol overhead intact. The + register-tuple `setelement/3` churn that consumed 25% of fib's + self-time on main is gone — the compiled function uses Erlang + variables, and the register tuple never enters the picture for + the compiled prototype itself. +- Risks #5 in the original plan — "the 1.13x current gap may not be + reachable from here alone" — is falsified. The plan assumed + `setelement/3` was a floor. The spike shows it is a property of + the interpreter's data shape, not the BEAM. + +### What the faithful spike unlocks for the plan + +1. **The biggest remaining cost is the call protocol, not dispatch.** + That changes B5's phasing. A v1 that just collapses dispatch (the + plan's headline) gets most of the win. A follow-up that adds a + direct-call edge for compiled-to-compiled invocations (skipping + list boxing on args and results) would buy another large chunk + — likely a B5d or B5e plan. + +2. **`_ENV.fib` static-resolution is a real follow-up lever.** Every + recursive call re-resolves `fib` through `_ENV`. The interpreter + pays this. B5 codegen can prove (in the common case) that the + binding is stable across calls and emit a direct call. This is + peephole/escape-analysis work — defer to a follow-up plan. + +3. **Register-tuple `setelement/3` is not the ceiling.** This was + the dominant concern in the B-series consolidation (ROADMAP.md + §"What we learned"). The spike shows compiling out of the + register-tuple representation entirely (Option 1 in the plan, + ironically the conservative one) eliminates the cost completely + on prototypes that fit in BEAM registers. SSA promotion (`B5c`) + was scoped as the lever for this — it can be deferred without + losing the bulk of the win. + +### Revised success criteria + +Replace the plan's "fib(25) parity with Luerl ±5%" with: + +- Floor: fib(25) beats Luerl by ≥5x. +- Target: fib(25) beats Luerl by ≥8x. +- Stretch: fib(25) beats Luerl by ≥10x. + +The faithful spike hit 10.4x; even a halving of that gap for real- +codegen overhead clears the floor comfortably. + +### What the spike did not prove + +- **Other workloads.** fib is pure integer math. The OOP, table_ops, + closures, and string_ops benchmarks exercise costs the spike does + not touch. A faithful spike on at least one table-heavy workload + should follow before B5 commits to a phasing — if (say) the + table_ops loop only wins 2-3x faithful, the plan's per-opcode + migration order may need to lead with table ops rather than + arithmetic. +- **Compile-and-load amortisation.** Spike loads modules once outside + the loop. `Lua.VM.CodeCache` work in the plan stands. +- **Module purging.** Spike never cleans up. + +### Spike artefacts + +- `benchmarks/b5_spike.exs` — stripped spike (no Lua semantics). +- `benchmarks/b5_spike_faithful.exs` — faithful spike (full call + protocol). +- `lib/lua/vm/executor.ex` — additive `:compiled_closure` dispatch + (spike-only, two clauses; see in-line comments). + +All on branch `perf/b5-spike-fib`. Reproduce: + +``` +MIX_ENV=benchmark mix run benchmarks/b5_spike_faithful.exs +LUA_BENCH_MODE=full MIX_ENV=benchmark mix run benchmarks/b5_spike_faithful.exs +``` + +### Table-heavy spike (perf/b5-spike-fib, May 2026) + +The first two spikes measured fib — pure integer arithmetic, the +friendliest possible workload. Open question after the faithful +spike: does the win generalise to table-heavy code? Tables exercise +costs B5 cannot eliminate (`Table.put/3` building a new map per +mutation, `state.tables` updates per write). + +Third spike compiles `run_table_sum(n)` from +`benchmarks/table_ops.exs` — two tight `:numeric_for` loops, one +populating a 1..n table, one summing it. Every iteration of the +first loop hits `:set_table`; every iteration of the second hits +`:get_table`. Same `:compiled_closure` dispatch as the second spike. + +Source: `benchmarks/b5_spike_tables.exs`. The compiled function is +written in Elixir rather than via `:compile.forms/2` — the second +spike already proved `:compile.forms` output runs at near-native +Elixir speed (1.13x slower in the worst case), and writing two +recursive loop helpers as abstract forms would add ~200 lines without +changing what's measured. + +**run_table_sum(n), full mode:** + +| n | Interpreter | Compiled | Luerl | C Lua | vs interp | vs Luerl | vs C Lua | +|---|---|---|---|---|---|---|---| +| 100 | 23.0 μs | 10.9 μs | 41.9 μs | 9.6 μs | **2.1x** | **3.8x** | 0.88x slower | +| 500 | 125 μs | 56.4 μs | 146 μs | 14.1 μs | **2.2x** | **2.6x** | 4.0x slower | +| 1000 | 274 μs | 131 μs | 272 μs | 20.1 μs | **2.1x** | **2.1x** | 6.6x slower | + +Memory at n=1000: interpreter 2.45 MB → compiled 0.59 MB (4.2x less). + +### What the table spike shows + +- **The compiled-vs-interpreter ratio is stable at ~2.1x across all + n.** Per-op interpreter dispatch is a constant per opcode, B5 saves + a constant fraction. Does not scale with n because the dominant + cost (table mutation allocation via `Table.put/3` + state.tables + update) is unchanged. +- **The compiled-vs-C-Lua gap widens with n.** At n=100 we + essentially match C Lua. At n=1000 we are 6.6x slower. This is + allocation churn — every `t[i] = i` allocates a new `:data` map + and a new `state.tables` map. PUC-Lua mutates in place; we cannot + because tables are immutable maps. Same constraint that defeated + B7 (see ROADMAP.md §"What we learned"). +- **B5's win on tables is ~6x smaller than on fib.** fib's win was + 12.4x faithful; tables is 2.1x faithful. Why: fib eliminates the + register-tuple `setelement/3` (25% of its self-time) entirely. + table_sum cannot escape the `Table.put` cost because that lives + in `state.tables`, not in registers — B5 saves dispatch around + the mutation, not the mutation itself. + +### What this changes about B5 phasing + +The plan's per-opcode phasing (arithmetic + control flow first, +tables next, then metamethods, then native calls) is correct. What +changes is the *expected return per phase*: + +- **Phase 1 (arithmetic + control flow):** the big win. Numeric + workloads jump from 1.2x-vs-Luerl (today) to ~10x. fib-style code + is the primary beneficiary. This is where most of the headline + performance numbers will come from. +- **Phase 2 (table ops):** smaller win (~2x). Worth doing, but + table-heavy workloads will not see numbers that look like Phase 1. +- **Phase 3+ (metamethods, native calls):** unmeasured. Each needs + a pre-flight spike if/when scoped. + +A Phase 1-only v1 would honestly ship — fib-style workloads get the +big bump immediately, table workloads stay at interpreter speed +until Phase 2 lands. The release notes need to be honest about which +workloads benefit when. + +### Refined success criteria + +Replace single fib target with per-workload targets: + +- **Numeric workloads (fib, math.*):** floor 5x faster than Luerl, + target 8x, stretch 10x. +- **Table workloads (table_sum, OOP, etc.):** floor 1.5x faster than + Luerl, target 2x. PUC-Lua parity is unreachable on BEAM for + table-heavy code — the third spike puts a hard number on this + (6.6x slower at n=1000 with the dispatch loop eliminated). The + remaining gap is allocation cost in immutable maps. Drop any + aspiration of PUC-Lua parity on table workloads. + +### Implication: parallel investigation worth scoping later + +Most of the table-workload allocation cost comes from `state.tables` +being a map of maps — every mutation walks two levels. If a future +plan changed table storage to something mutable from inside the BEAM +(`:ets`, `:atomics`, or a per-state mutable structure with explicit +GC integration), it would compose multiplicatively with B5: B5 saves +dispatch, that change saves allocation. Together they could close +the C-Lua gap meaningfully on table workloads. + +Not in scope for B5. Worth keeping in the back pocket as a B-series +follow-up once B5 v1 has shipped and the data shape is the obvious +remaining ceiling. + +### Third spike artefact + +`benchmarks/b5_spike_tables.exs`. Reuses the `:compiled_closure` +dispatch from the second spike. Reproduce: + +``` +MIX_ENV=benchmark mix run benchmarks/b5_spike_tables.exs +LUA_BENCH_MODE=full MIX_ENV=benchmark mix run benchmarks/b5_spike_tables.exs +``` diff --git a/.agents/plans/B5a-erlang-codegen-foundation.md b/.agents/plans/B5a-erlang-codegen-foundation.md new file mode 100644 index 0000000..c11e930 --- /dev/null +++ b/.agents/plans/B5a-erlang-codegen-foundation.md @@ -0,0 +1,363 @@ +--- +id: B5a +title: Erlang codegen foundation — compile arithmetic + control flow prototypes to BEAM modules +issue: null +pr: null +branch: perf/erlang-codegen-foundation +base: main +status: in-progress +direction: B +unlocks: + - B5b (lifecycle), B5c (tables), B5d (closures), B5e (errors) + - ~10x speedup over Luerl on numeric workloads (fib, math.*) + - ~2x speedup over Luerl on control-flow-heavy code +--- + +## Goal + +Land the foundation for compiling Lua `%Prototype{}` values to BEAM +modules via `:compile.forms/2`. The compiled module gets dispatched +through a new `:compiled_closure` value type that bypasses the +interpreter's register-tuple construction and per-opcode dispatch +loop entirely. + +This first PR covers every opcode **except tables and closures**: +arithmetic, comparison, control flow (including loops and goto), +bitwise ops, string concat/length, source-line tracking, calls, +single-value returns, and upvalue reads (read-only, since closures +ship in B5d). If a prototype contains a table or closure opcode the +whole prototype falls back to the interpreter (all-or-nothing per +prototype — mixed-mode interpret-from-pc is explicitly out of scope). + +## Why now + +Three pre-flight spikes (recorded under `## Discoveries` in +`.agents/plans/B5-compile-prototypes-to-erlang.md`, branch +`perf/b5-spike-fib`) measured the headroom against today's +interpreter: + +- **Stripped fib(25):** 278x faster than interpreter (BEAMASM ceiling). +- **Faithful fib(25):** 12.4x faster than interpreter, 10.4x faster + than Luerl. Memory 673 MB → 13 MB. +- **Faithful run_table_sum(1000):** 2.1x faster than interpreter, + 2.1x faster than Luerl. + +The dispatch-loop hypothesis from the parent plan is confirmed. The +spike branch demonstrated the `:compiled_closure` dispatch shape; +this plan productionises it. + +The library is pre-release and there is no flag — every prototype +the codegen can handle goes through compilation. That's the bet. + +## Out of scope + +- Module lifecycle (cache, ref-counting, purging). Every prototype + gets a fresh module per compile in this PR. **Leaks. B5b fixes + this immediately after merge.** +- Tables (`:new_table`, `:get_table`, `:set_table`, `:set_list`, + `:get_field` full path, `:set_field`). Falls back to interpreter. + B5c. +- Closures (`:closure`, `:set_upvalue`, `:get_open_upvalue`, + `:set_open_upvalue`, `:vararg`, `:return_vararg`, `:return` count + > 1). Falls back to interpreter. B5d. +- Error position fidelity for compiled code (line/source in raise + sites). B5e. +- Mixed-mode (compiled prototype calls interpreter for one missing + opcode and resumes). All-or-nothing per prototype. +- SSA / register promotion. Registers stay in a tuple in compiled + code — same shape as the interpreter. This is the conservative + option from the parent plan; the spike showed the dispatch win + alone justifies the work. + +## Success criteria + +- [ ] `Lua.Compiler.Erlang` module exists and converts a covered + `%Prototype{}` into Erlang abstract forms. +- [ ] `Lua.VM.CompiledModule` value type exists and is dispatched + by `Executor.call_function/3` and the `:call` opcode. + Carries `{:compiled_closure, module_name, function_name, + upvalues_tuple}`. +- [ ] `Lua.Compiler.compile/1,2` returns prototypes that have been + compiled to BEAM modules where the codegen accepts them. + Prototypes containing any uncovered opcode are returned as + plain interpreted prototypes (current behaviour). +- [ ] Opcode coverage in this PR (everything except tables, + closures, varargs, multi-return, generic_for, tail_call, + self): + `:load_constant`, `:load_boolean`, `:load_nil`, `:move`, + `:source_line`, `:scope`, `:get_upvalue`, `:get_global`, + `:set_global`, `:load_env`, `:get_field` (env-lookup form + only — uses the same fast path as the interpreter's + `get_field` when reading from an upvalue-loaded register + holding `_ENV`), `:add`, `:subtract`, `:multiply`, `:divide`, + `:floor_divide`, `:modulo`, `:power`, `:negate`, + `:bitwise_and`, `:bitwise_or`, `:bitwise_xor`, `:shift_left`, + `:shift_right`, `:bitwise_not`, `:less_than`, `:less_equal`, + `:greater_than`, `:greater_equal`, `:equal`, `:not_equal`, + `:not`, `:length`, `:concatenate`, `:test`, `:test_true`, + `:test_and`, `:test_or`, `:goto`, `:label`, `:numeric_for`, + `:while_loop`, `:repeat_loop`, `:break`, `:call`, `:return` + (count = 1). + Out of scope and falling back: + `:new_table`/`:get_table`/`:set_table`/`:set_list`/ + `:set_field`/non-env-form `:get_field` (→ B5c), + `:closure`/`:set_upvalue`/`:get_open_upvalue`/ + `:set_open_upvalue`/`:vararg`/`:return_vararg`/ + `:return` count > 1/`:generic_for`/`:self`/`:tail_call` + (→ B5d). +- [ ] `mix test` passes; 1705 tests + 51 properties + 55 doctests. +- [ ] `mix test --only lua53` does not regress. +- [ ] fib(25) beats Luerl by ≥5x in `mix run benchmarks/fibonacci.exs`. + Stretch: ≥8x. +- [ ] No workload regresses on the existing benchmark suite by more + than 5% (within noise). +- [ ] Compiled-mode failures (codegen bugs) fall back gracefully to + interpretation — never crash. Logged via Logger.warning. + +## Implementation notes + +### Strategy + +`Lua.Compiler.Erlang.compile/1` takes a `%Prototype{}` and returns +either `{:ok, compiled_prototype}` or `:fallback` if any opcode is +uncovered. The codegen walks the instruction stream once, building +Erlang abstract forms, then calls `:compile.forms/2` and +`:code.load_binary/3`. + +Module names in this PR: `lua_proto_`. Real +content-addressable naming and lifecycle is B5b's job. Yes this +leaks; one PR of leak is acceptable for the integration period. + +### Codegen shape + +The compiled function signature mirrors the spike's faithful path: + +```elixir +@spec execute([term()], tuple(), State.t()) :: + {[term()], State.t()} +def execute(args, upvalues, state) do + # body +end +``` + +`args` is the call args as a list (matches `Executor.call_function/3`'s +`:lua_closure` clause). `upvalues` is the upvalue cell-ref tuple +threaded by the caller. `state` threads through. + +Inside the function: + +- A register variable for each register slot: `R0`, `R1`, … + Single-assignment Erlang variables. Reassigning `R3` becomes + `R3_1`, `R3_2`, … using a per-codegen-pass counter. +- The parameters land in `R0..R{param_count-1}` from the args list + via pattern matching at the function head. +- State is threaded as `State_0`, `State_1`, … through any opcode + that can mutate it. (`:call` and `:get_global` for upvalue + resolution can.) Most arithmetic is state-pure. + +### Control flow + +`:numeric_for`, `:while_loop`, `:repeat_loop` compile to +**recursive Erlang helper functions** inside the generated module. +This is the BEAM-native loop idiom and what `:compile.forms` +produces for any Erlang `case`-based loop. Each loop gets a fresh +helper named `loop_/N` where N covers the loop variable, +limit, step, and any captured live variables. + +`:goto` + `:label` resolve at codegen time to a function call into +a helper. The interpreter's `find_label/2` linear scan is replaced +by a compile-time label-to-helper map. + +`:break` becomes an early return from the loop helper. + +### Opcode lowering + +Each covered opcode lowers to a fixed snippet of Erlang abstract +forms. Strategy: + +- **Arithmetic/comparison** that already has integer fast paths in + the executor (the work from PR #223 et al.): inline a guard + clause for the integer-integer case, fall through to a helper + call (`Lua.VM.Numeric.add/2` etc.) for the slow path. This + preserves the exact semantics the interpreter delivers including + metamethod dispatch — the helper calls back through + `Executor.try_binary_metamethod/5`. +- **`:test`**: compile to an Erlang `case` over `Value.truthy?/1`, + with the two branches inlined as instruction sequences. This is + why we need control flow first — `:test` is everywhere. +- **`:call`**: dispatch to `Executor.call_function/3`. Args list is + materialized from the relevant register range; results unbox into + the right register slots. Pays the same call-protocol cost the + third spike measured. +- **`:return` count = 1**: returns `{[elem(regs, base)], state}` — + the standard CPS-frame-pop shape, but since this is the entry + function not a continuation, it just returns to whoever called + `Executor.call_function/3`. + +### Dispatch wiring + +`Lua.VM.Executor.call_function/3` learns a new clause: + +```elixir +def call_function({:compiled_closure, mod, fun, upvalues}, args, state) do + apply(mod, fun, [args, upvalues, state]) +end +``` + +The `:call` opcode dispatch learns the same shortcut: bypass +register-tuple construction, materialize args list, call +`apply(mod, fun, ...)`. This is the spike's `:compiled_closure` +clause promoted to production. The spike already added these +clauses to `lib/lua/vm/executor.ex` on this branch — verify they +stay in place, are properly tested, and are no longer flagged as +"spike-only" in comments. + +### Falling back + +`Lua.Compiler.compile/2` (the existing entry) is changed to: + +```elixir +def compile(source, opts \\ []) do + proto = existing_compile_path(source, opts) + case Lua.Compiler.Erlang.compile(proto) do + {:ok, compiled} -> compiled + :fallback -> proto + end +end +``` + +`Lua.Compiler.Erlang.compile/1` walks the instructions and returns +`:fallback` on the first uncovered opcode. Sub-prototypes (nested +function definitions) recurse; if any sub-prototype falls back, the +parent does too (avoids the mixed-mode complexity of mixing call +shapes between parent and child). + +### Where prototypes live after compile + +`%Prototype{}` gains a new optional field `compiled_module :: +{atom(), atom()} | nil` — module name and function name. When set, +all execution sites that currently see `{:lua_closure, proto, +upvalues}` use `{:compiled_closure, mod, fun, upvalues}` instead. +The conversion happens at closure-creation time +(`:closure` opcode, `Lua.Compiler.compile_to_closure`, and the +top-level entry in `Lua.VM.execute/2`). + +### Files + +- `lib/lua/compiler/erlang.ex` (new) — abstract-forms generator. + Public API: `compile/1`. Internal: per-opcode lowering helpers. +- `lib/lua/compiler/erlang/opcodes.ex` (new) — pure functions mapping + each covered opcode to its Erlang form. Kept separate so opcode + tables are easy to extend in later plans. +- `lib/lua/compiler/prototype.ex` — add `compiled_module` field. +- `lib/lua/compiler.ex` — wire the codegen into the public compile + path. Fallback handling. +- `lib/lua/vm/executor.ex` — add `:compiled_closure` clauses to + `call_function/3` and the `:call` opcode. Update closure-creation + sites to emit `:compiled_closure` when `proto.compiled_module` is + set. +- `lib/lua/vm.ex` — update entry point to dispatch the top-level + prototype through the compiled module if present. +- `test/lua/compiler/erlang_test.exs` (new) — fixed-input prototype + golden tests: every covered opcode in isolation, assert compiled + result == interpreted result. +- `test/lua/compiler/erlang_fallback_test.exs` (new) — every + uncovered opcode triggers `:fallback`. Sub-prototype fallback + cascades to parent. + +### Error fidelity (placeholder, full fix in B5e) + +For this PR: runtime errors raised from compiled code carry the +line at codegen time of the originating opcode (already in the +`:source_line` opcodes). Source filename comes from the prototype. +This is good enough for most tests; B5e adds full position +threading via try/catch. + +If a test asserts a specific stack trace shape that the compiled +path breaks, that test moves to an explicit `compiled: false` fixture +override **only after** confirming the assertion is about the +interpreter's stack trace specifically, not user-facing behaviour. +Track any such overrides in `## Discoveries`. + +### Benchmarks + +The spike benchmarks `benchmarks/b5_spike*.exs` ship as part of +this PR. They serve a dual purpose: + +1. **Regression tests for the dispatch shape.** They exercise the + `:compiled_closure` value type with hand-built modules, + independent of the codegen. If a later plan breaks the + dispatch protocol they fail loudly. +2. **Comparison baseline for codegen output.** The faithful spike + represents what a hand-tuned compile would look like. The real + codegen running through `Lua.Compiler.Erlang` should be within + ~2x of the faithful spike on fib. Diverging from that means + the codegen has room to optimise. + +The spikes are kept as `benchmarks/b5_spike{,_faithful,_tables}.exs` +rather than renamed, to make their origin explicit. + +## Verification + +```bash +mix format +mix compile --warnings-as-errors +mix test +mix test --only lua53 + +# fib parity check (the main success criterion). +LUA_BENCH_MODE=full mix run benchmarks/fibonacci.exs + +# Confirm other workloads don't regress. +LUA_BENCH_MODE=full mix run benchmarks/closures.exs +LUA_BENCH_MODE=full mix run benchmarks/oop.exs +LUA_BENCH_MODE=full mix run benchmarks/table_ops.exs +LUA_BENCH_MODE=full mix run benchmarks/string_ops.exs + +# Confirm fallback path: every uncovered opcode triggers fallback, +# never a crash. (Tests cover this; this is the manual smoke.) +mix run -e ' +{:ok, _, _} = Lua.eval(Lua.new(), "local t = {1,2,3}; return t[2]") +IO.puts("table fallback OK") +' +``` + +## Risks + +- **`compile:forms/2` is slow (hundreds of microseconds per + module).** For embedders that one-shot `Lua.eval!` of short + scripts, compilation could be net slower than interpretation. + Acceptable for this PR — B5b's content-addressable cache makes + repeated evals of the same source share a module. If the + one-shot cost is too high in real usage, B5b's cache can be + extended to memoise by source-hash rather than only prototype- + hash. Defer the call. +- **The compiled module path differs subtly from the interpreter + on edge cases.** Float-to-integer coercion, NaN comparisons, + string-to-number coercion in arithmetic. Mitigation: opcode-by- + opcode golden tests in `erlang_test.exs` assert byte-for-byte + result equality with the interpreter on a battery of inputs + including the nasty corners (NaN, inf, -0.0, max_int + 1, "3" + 2). +- **BEAM atom table pressure.** Every prototype this PR compiles + creates a unique module name. Run-once embedders that compile + unique source forever could exhaust the atom table. Concrete + ceiling: ~1M atoms in default BEAM config. This PR's leak is + bounded for the integration period because nobody runs production + for hours between B5a and B5b — but it's a real footgun if B5b + slips. Mitigation: if B5b takes longer than a week to ship, add + a hard cap here that disables further compilation past N modules. +- **Module loading is not crash-safe across hot reload.** If `mix + test` recompiles `lib/` mid-run, compiled prototypes referencing + old function definitions raise. Mitigation: regenerate prototypes + at `Application.start/2` boot in the test env, and include the + application boot hash in the module name. Same approach the plan + parent (`B5`) calls for in Risks #3. +- **Some interpreter tests will fail by assertion of internal state** + — e.g. tests that count instruction-list reductions, or compare + inspectability of a `:lua_closure`. Track these in Discoveries + and either update the assertion to be representation-agnostic or + add a fixture override. Should be a small number. + +## Discoveries + +(populated during implementation) diff --git a/.agents/plans/B5b-module-lifecycle.md b/.agents/plans/B5b-module-lifecycle.md new file mode 100644 index 0000000..d89b231 --- /dev/null +++ b/.agents/plans/B5b-module-lifecycle.md @@ -0,0 +1,236 @@ +--- +id: B5b +title: Module lifecycle — content-addressable cache + ref-counted purging +issue: null +pr: null +branch: perf/erlang-codegen-lifecycle +base: main +status: ready +direction: B +unlocks: + - B5c (tables) and later phases can ship without compounding the leak + - Production-safe deployment of the codegen path +--- + +## Blocked on + +- B5a — there's nothing to manage the lifecycle of until the codegen + is producing modules. + +## Goal + +Make B5a not leak. Every compiled prototype currently allocates a +fresh `lua_proto_` module that lives forever in the +BEAM code server. After B5a merges this would saturate the atom +table within hours of real use. + +This PR introduces `Lua.VM.CodeCache`, a content-addressable +ref-counted registry. Identical prototypes (same instruction stream, +same upvalue descriptors) share a module. When the last reference +to a compiled prototype drops, the module is purged. + +## Why now + +B5a ships the codegen with leak-by-design as a known limitation. +The leak is bounded for the integration period (no production +deployment between B5a and B5b) but compounds rapidly the moment a +real user hits the codegen. Every PR that adds opcodes (B5c, B5d) +makes the leak worse because more prototypes are eligible for +compilation. Fix it now, before the surface area grows. + +## Out of scope + +- Adding more opcodes (B5c, B5d). +- Cross-prototype optimization or whole-program compilation. +- Persistent compilation caches (on-disk). Memory cache only. +- Changes to the codegen output. The cache wraps codegen calls; + it doesn't rewrite the modules themselves. + +## Success criteria + +- [ ] `Lua.VM.CodeCache` GenServer exists. Started under + `Lua.Application` supervision tree. +- [ ] Module names become `lua_proto_`. Two + prototypes with byte-identical instruction streams + upvalue + descriptors share a module. +- [ ] Per-module ref count tracks live closures referencing it. + Each `{:compiled_closure, mod, fun, upvalues}` value + increments on creation, decrements on collection. +- [ ] When ref count reaches zero, the cache schedules + `:code.purge/1` + `:code.delete/1`. Scheduled, not immediate + — running code may still be executing the module on another + scheduler. +- [ ] Hard cap on loaded modules (default 4096, configurable via + `Lua.Compiler.Erlang.cache_size/0`). LRU eviction when the + cap is hit. +- [ ] Build hash in module names (`lua_proto__`). + A code-server module loaded from a previous build is rejected + on lookup and recompiled. Prevents stale references across + `mix test` hot-reload. +- [ ] Stress test: 10,000 unique prototypes compiled and dropped in + sequence. `:code.all_loaded() |> length()` stays within + cache_size + a small buffer for the duration. +- [ ] Stress test: 10,000 *identical* prototypes compiled. Only one + module loaded. +- [ ] `mix test` passes. No regression. +- [ ] No measurable performance regression on + `mix run benchmarks/fibonacci.exs` — the cache hit path adds + one ETS lookup per call to `compile`, which should be + ~hundreds of nanoseconds. + +## Implementation notes + +### Architecture + +- `Lua.VM.CodeCache` is a GenServer holding an ETS table + (`:lua_code_cache`) plus an LRU access list. +- ETS keyed by `{build_hash, content_hash}` → `{module_name, + function_name, ref_count, last_accessed}`. +- `Lua.Compiler.Erlang.compile/1` consults the cache before + invoking `:compile.forms/2`. Cache hit returns the existing + module; miss compiles, loads, inserts, returns. +- Ref-counting: + - Increment when a `{:compiled_closure, mod, fun, upvalues}` + value is created (closure construction, prototype top-level + compile). + - Decrement when… (see below — this is the hard part). + +### Ref-count decrement strategy + +Closures in this codebase are plain Elixir values. They get +garbage-collected by the BEAM with no callback. So "decrement when +collected" cannot be implemented with `:erlang.monitor`. + +Two viable approaches: + +1. **Periodic GC sweep.** Every N seconds, walk every live state's + tables, collect the set of referenced `(mod, fun)` pairs, mark + the cache. Anything not referenced for K sweeps is purged. This + is what Luerl's equivalent layer does. +2. **Resource tracking via NIF resource.** Wrap the module + reference in a NIF-allocated resource whose destructor + decrements the count. Requires a NIF, which we currently don't + ship. + +Recommend (1) for this PR. Simpler, no NIF, doesn't bound when +modules are purged (they linger until the next sweep) but that's +acceptable for the cap-and-LRU policy. + +Sweep cadence: every 30 seconds. Configurable. + +LRU eviction provides a hard upper bound regardless of sweep +correctness — if the cap is hit, the least-recently-accessed +module is purged immediately, ref-count be damned. This prevents +unbounded growth if the sweep logic has a bug. + +### Build hash + +`@build_hash` is computed at compile time from the app's +`:application.get_key(:lua, :vsn)` plus a hash of the codegen +module's source. Embedded in module names. On lookup, if the +module's name doesn't match the current build hash, treat as a +miss and recompile. The stale module is purged by the LRU as it +ages out. + +This handles two cases: + +- Production: a host application doing a rolling deploy may keep + old compiled modules in memory referenced by older state values + that survived the upgrade. The new compiled prototypes use new + module names; the old ones age out. +- Dev: `mix test` recompiles `lib/`. Compiled prototypes from a + previous test run reference old internal helpers; reject them + and recompile. + +### Content hash + +`:erlang.phash2/2` over `{instructions, upvalue_descriptors, +param_count, is_vararg}`. Truncated to 12 hex chars. Collision +probability is negligible at the scales we care about, but we +verify by storing the full pre-hash key alongside the hash in ETS +and asserting equality on lookup. + +### Files + +- `lib/lua/vm/code_cache.ex` (new) — the GenServer + ETS interface. +- `lib/lua/application.ex` — supervise the new GenServer. +- `lib/lua/compiler/erlang.ex` (modified) — replace the + unique-integer module naming with `CodeCache.module_for/1`. +- `test/lua/vm/code_cache_test.exs` (new) — the unit tests + + stress tests listed in Success criteria. + +### Edge cases + +- **Module name collisions with non-Lua code.** Mitigation: the + `lua_proto_` prefix is reserved. Document in `Lua.VM.CodeCache`'s + moduledoc. +- **GenServer crash.** If the cache GenServer dies (shouldn't, but + defense in depth), the supervisor restarts it with an empty ETS + table. Every prototype recompiles. Performance penalty, not a + correctness failure. +- **Cache poisoned by a compile error.** If `:compile.forms/2` + raises mid-load, the ETS entry must roll back. Use a + `try`-`rescue` in `CodeCache.handle_call`. + +## Verification + +```bash +mix format +mix compile --warnings-as-errors +mix test +mix test test/lua/vm/code_cache_test.exs + +# Stress test: 10k unique prototypes +mix run -e ' +for i <- 1..10_000 do + src = "function f_#{i}(n) return n + #{i} end f_#{i}(42)" + {_, _} = Lua.eval!(Lua.new(), src) +end +:erlang.garbage_collect() +Process.sleep(35_000) +count = :code.all_loaded() |> Enum.count(fn {m, _} -> + to_string(m) |> String.starts_with?("lua_proto_") +end) +IO.puts("loaded after sweep: #{count}") +# Should be ≤ cache_size (default 4096). +' + +# Stress test: 10k *identical* prototypes +mix run -e ' +src = "function f(n) return n + 1 end f(42)" +for _ <- 1..10_000, do: Lua.eval!(Lua.new(), src) +count = :code.all_loaded() |> Enum.count(fn {m, _} -> + to_string(m) |> String.starts_with?("lua_proto_") +end) +IO.puts("identical compiles → loaded count: #{count}") +# Should be 1. +' +``` + +## Risks + +- **Sweep cadence vs allocation rate.** If a host app compiles + faster than the sweep can clean up, the LRU evicts. If the LRU + evicts a module that's still in use by a long-running state, + next call into that closure raises (module not found). + Mitigation: defer LRU eviction of modules with ref_count > 0 + until they age past a hard limit (10x cache_size, say). + Compromise: under extreme pressure, the cache exceeds the soft + cap; only when ref counts drop does it shrink. Acceptable + trade-off — we'd rather use 2x memory than crash. +- **The sweep is O(states × refs).** For a deployment with tens of + thousands of live Lua states this could be measurable. Profile + during this PR; if it shows up, partition the sweep across + cycles or push the work into a dedicated scheduler. +- **`:code.purge/1` blocks if any process is currently executing + the module on another scheduler.** Use `:code.soft_purge/1` + first; if that fails, defer to next sweep rather than blocking. + Document the policy. +- **NIF resource alternative might be necessary post-launch.** If + the sweep approach proves too imprecise (modules sticking around + too long, memory pressure), the NIF-resource approach can be a + later plan. Don't pre-commit to it now. + +## Discoveries + +(populated during implementation) diff --git a/.agents/plans/B5c-table-opcodes.md b/.agents/plans/B5c-table-opcodes.md new file mode 100644 index 0000000..02c986e --- /dev/null +++ b/.agents/plans/B5c-table-opcodes.md @@ -0,0 +1,216 @@ +--- +id: B5c +title: Compile table opcodes — make table-heavy workloads bypass the interpreter +issue: null +pr: null +branch: perf/erlang-codegen-tables +base: main +status: ready +direction: B +unlocks: + - ~2x speedup on table_ops benchmarks + - the full OOP benchmark workload (depends on tables + closures) +--- + +## Blocked on + +- B5a (foundation) +- B5b (lifecycle) — required before adding more opcodes to the + codegen, otherwise the cache pressure scales with surface area. + +## Goal + +Extend `Lua.Compiler.Erlang` to lower the table opcode family: +`:new_table`, `:get_table`, `:set_table`, `:set_list`, `:get_field` +(full path, not just env lookup), `:set_field`. After this PR, +prototypes that touch tables compile end-to-end and stay out of +the interpreter fallback path. + +The third spike measured **2.1x faster than interpreter** on +run_table_sum(1000). This PR delivers that. + +## Why now + +Once tables compile, the OOP benchmark and most real-world Lua +code stops falling back to the interpreter. The win is smaller per +opcode than fib's (3.8x vs 12.4x at faithful), but it removes a +large class of fallback cases — the dominant blocker after B5a. + +## Out of scope + +- Closures (`:closure`, upvalue mutation). B5d. +- Error position fidelity. B5e. +- Optimising table data shape (this is a B-series follow-up that + was deferred: B6/B7). B5 saves dispatch around the table + mutation, not the mutation itself. + +## Success criteria + +- [ ] Opcodes added to the codegen: `:new_table`, `:get_table`, + `:set_table`, `:set_list`, `:get_field` (full path), + `:set_field`. +- [ ] `mix test` passes; no regression in unit, suite, or property + tests. +- [ ] `LUA_BENCH_MODE=full mix run benchmarks/table_ops.exs`: + `lua (chunk)` beats Luerl by ≥1.5x on `Table Iterate/Sum` + and `Table Map + Reduce` at n=500 and n=1000. Stretch: ≥2x. +- [ ] `mix run benchmarks/oop.exs`: no regression now that more + of the OOP path is compiled. Stretch: measurable improvement + once `:closure` lands in B5d. +- [ ] No regression on numeric benchmarks (fibonacci, etc.) — the + shared codegen pieces don't slow down what B5a already won. + +## Implementation notes + +### Lowering each opcode + +The interpreter's table opcodes already have fast paths (PR #223 +and follow-ups). The compiled lowering mirrors them inline rather +than calling back into the interpreter helpers, **except** when the +slow path is hit (metamethod dispatch, type errors). The slow +paths delegate to `Lua.VM.Executor` helpers that already exist. + +#### `:new_table` + +```erlang +{Tref0, State0} = 'Elixir.Lua.VM.State':alloc_table(State_in), +R_dest = Tref0, +State_out = State0 +``` + +State threads through. + +#### `:get_table` + +Two cases. Integer or binary key on a `{:tref, _}`: inline the +fast path from `executor.ex:1300-1323`: + +```erlang +TableVal = R_table, +Key = R_key, +case TableVal of + {tref, Id} when is_integer(Key); is_binary(Key) -> + Table = erlang:map_get(Id, maps:get(tables, State_in)), + case erlang:map_get(data, Table) of + #{Key := Value} -> + R_dest = Value, + State_out = State_in; + _ -> + case erlang:map_get(metatable, Table) of + nil -> + R_dest = nil, + State_out = State_in; + _ -> + {Value, State1} = 'Elixir.Lua.VM.Executor':index_value( + TableVal, Key, State_in, Line, Source, NameHint), + R_dest = Value, + State_out = State1 + end + end; + _ -> + {Value, State1} = 'Elixir.Lua.VM.Executor':index_value( + TableVal, Key, State_in, Line, Source, NameHint), + R_dest = Value, + State_out = State1 +end +``` + +`index_value/6` needs to be promoted from `defp` to `def` in the +executor so the compiled module can call it. Add `@doc false` to +keep it out of the public API surface. + +#### `:set_table` + +```erlang +case R_table of + {tref, _} -> + State_out = 'Elixir.Lua.VM.Executor':table_newindex( + R_table, R_key, R_value, State_in); + _ -> + 'Elixir.Lua.VM.Executor':raise_index_type_error( + R_table, Line, Source, NameHint) +end +``` + +`table_newindex/4` is already `def` (executor.ex:1919). +`raise_index_type_error/4` needs promoting. + +#### `:set_list` + +Iterates over a register range and calls `table_newindex` per +entry. Compile as a recursive helper (same pattern as +`:numeric_for` from B5a). + +#### `:get_field`, `:set_field` + +B5a already covers `:get_field` for env lookups. Generalise: the +fast path uses the table's `:data` map with the literal binary +key. Falls through to `index_value` / `table_newindex` for +metatable cases. + +### Promoting helpers + +The executor's table helpers that the compiled code calls into: + +- `Lua.VM.Executor.table_newindex/4` — already `def`. +- `Lua.VM.Executor.index_value/6` — currently `defp`. Promote to + `def` with `@doc false`. +- `Lua.VM.Executor.raise_index_type_error/4` — currently `defp`. + Promote. + +The `@doc false` keeps these from showing up in the user-facing +documentation but lets the compiled module call them by their +fully-qualified `'Elixir.Lua.VM.Executor':function(...)` form. + +### Files + +- `lib/lua/compiler/erlang/opcodes.ex` — add lowering clauses for + the table family. +- `lib/lua/compiler/erlang.ex` — remove table opcodes from the + fallback set; allow them in the codegen. +- `lib/lua/vm/executor.ex` — promote `index_value/6` and + `raise_index_type_error/4` to public. +- `test/lua/compiler/erlang_test.exs` — golden tests for each table + opcode (compiled vs interpreted result equality on a battery of + inputs including metatable cases). + +## Verification + +```bash +mix format +mix compile --warnings-as-errors +mix test +mix test --only lua53 + +LUA_BENCH_MODE=full mix run benchmarks/table_ops.exs +LUA_BENCH_MODE=full mix run benchmarks/oop.exs +LUA_BENCH_MODE=full mix run benchmarks/fibonacci.exs # no regression +``` + +## Risks + +- **Metatable semantics are subtle.** `__index` and `__newindex` + can recurse through long chains. The compiled fast path skips + metatable dispatch only when `metatable == nil` on the table. + Any non-nil metatable falls through to the existing + `index_value` / `table_newindex` helpers, which already handle + the chains. Risk is limited to "is the fast-path predicate + right" — covered by golden tests. +- **`set_list` codegen is the most complex per-opcode lowering.** + It needs to compile a register-range loop into a recursive + helper that's careful about register aliasing. Test with both + short ranges (typical: `{1, 2, 3}` table constructor) and long + ranges. +- **Promoting `defp` to `def` widens the executor's public API.** + `@doc false` mitigates discoverability. The executor's + `@moduledoc` should mention that these are runtime helpers used + by compiled modules and should not be called directly by user + code. +- **The third spike's 2.1x was measured at faithful, not real + codegen.** Real codegen has overheads the spike skipped (full + opcode coverage means more dispatch within the compiled + function). The success-criteria floor (≥1.5x) accommodates this. + +## Discoveries + +(populated during implementation) diff --git a/.agents/plans/B5d-closures-and-varargs.md b/.agents/plans/B5d-closures-and-varargs.md new file mode 100644 index 0000000..0fba57f --- /dev/null +++ b/.agents/plans/B5d-closures-and-varargs.md @@ -0,0 +1,197 @@ +--- +id: B5d +title: Compile closures, varargs, and multi-return — every opcode has a compiled path +issue: null +pr: null +branch: perf/erlang-codegen-closures +base: main +status: ready +direction: B +unlocks: + - 100% opcode coverage in the codegen (no more fallbacks except for diagnostics) + - the closures benchmark workload + - the OOP benchmark workload now fully compiled +--- + +## Blocked on + +- B5a (foundation), B5b (lifecycle), B5c (tables). + +## Goal + +Cover the remaining opcodes. After this PR, no prototype falls +back to the interpreter for opcode-coverage reasons. Every opcode +in the codegen. + +Opcodes added: + +- `:closure` — closure construction with upvalue capture. +- `:set_upvalue` — mutate a captured upvalue cell. +- `:get_open_upvalue`, `:set_open_upvalue` — open-cell access for + upvalues that still reference live caller registers. +- `:vararg`, `:return_vararg` — varargs. +- `:return` with count > 1 — multi-return. +- `:generic_for` — the `for k, v in pairs(t)` family. + +## Why now + +After B5c, table-heavy code compiles. After this PR, closure-heavy +code does too — which is the dominant remaining real-world Lua +idiom. From here on, additional B5 work is about polish (error +fidelity, B5e) and the wider B-series mutable-data follow-up that +B5 itself does not address. + +## Out of scope + +- Mixed-mode interpret-from-pc (still all-or-nothing per prototype). +- Cross-prototype optimisation (inlining one Lua function into + another). +- Error position fidelity. B5e. + +## Success criteria + +- [ ] Opcodes added: `:closure`, `:set_upvalue`, + `:get_open_upvalue`, `:set_open_upvalue`, `:vararg`, + `:return_vararg`, `:return` (count > 1), `:generic_for`. +- [ ] After this PR, the codegen's `:fallback` cases are only: + genuinely unrecognised opcode shapes (programmer error) or + explicit opt-outs added by future plans. No production-Lua + opcode falls back. +- [ ] `mix test` passes; no regression. +- [ ] `mix test --only lua53` does not regress. +- [ ] `LUA_BENCH_MODE=full mix run benchmarks/closures.exs`: lua + (chunk) beats Luerl by ≥2x. +- [ ] `LUA_BENCH_MODE=full mix run benchmarks/oop.exs`: lua (chunk) + beats Luerl by ≥1.5x. (OOP is a mix of closures + tables; + both contribute.) +- [ ] No regression on numeric or table workloads. + +## Implementation notes + +### Closure construction (`:closure`) + +`:closure` creates a `{:lua_closure, sub_proto, captured_upvalues}` +value in the interpreter. The compiled version creates either: + +- `{:compiled_closure, mod, fun, captured_upvalues}` if the + sub-prototype itself compiled. +- `{:lua_closure, sub_proto, captured_upvalues}` if the + sub-prototype fell back to interpretation. + +The codegen checks `sub_proto.compiled_module` at codegen time. +This works because sub-prototypes are compiled in a separate +codegen pass (bottom-up) before the parent. + +Upvalue capture: the parent prototype's `:closure` opcode +specifies which upvalue descriptors to populate from which parent +registers / parent upvalues. In the compiled module this becomes +a fresh upvalue tuple constructed inline. Open cells get a fresh +reference (`make_ref/0`) and state.open_upvalues entry; closed +cells inherit from the parent upvalues tuple. + +### Upvalue mutation (`:set_upvalue`) + +Mirrors the interpreter (`executor.ex:362-367`): + +```erlang +CellRef = element(Index + 1, Upvalues), +Value = R_source, +NewUpvalueCells = maps:put(CellRef, Value, + maps:get(upvalue_cells, State_in)), +State_out = setelement(StateUpvalueCellsIdx, State_in, NewUpvalueCells) +``` + +Updating a struct field at runtime via `setelement` works because +the State struct's field positions are stable. +`StateUpvalueCellsIdx` is determined at codegen time from +`%State{}`'s field order. + +### Open upvalues (`:get_open_upvalue`, `:set_open_upvalue`) + +These read/write a cell ref but resolve to either a register (if +the cell is still open) or `state.upvalue_cells` (if closed). The +compiled version mirrors `executor.ex:367-401` directly, including +the open-cell fast path that avoids touching state for the common +case. + +### `:vararg`, `:return_vararg` + +Vararg storage is on `proto.varargs`. In the compiled function, +this is just a closure-time-captured argument list. The codegen +adds an extra parameter to the compiled function (or threads +varargs through state, depending on what's cleaner; the +interpreter currently uses `proto.varargs`, which works because +proto is a runtime value). + +### Multi-return `:return` (count > 1) + +B5a covered count = 1. For count > 1, the compiled function +returns `{Values, State}` where Values is a list of length `count` +constructed from the register range. `continue_after_call/11` +unpacks the list into the caller's registers. + +For the `{:multi, _}` count form (caller wants all available +returns), the compiled function returns `{Values, State}` with +exactly the multi-return values; the caller's `:call` opcode +handles slot expansion. + +### Generic for (`:generic_for`) + +Like `:numeric_for` (B5a) but the loop helper calls the iterator +function on every iteration via `Executor.call_function/3` rather +than incrementing a counter. The CPS frame logic from the +interpreter (executor.ex:518-547) translates cleanly to a tail- +recursive Erlang helper. + +### Files + +- `lib/lua/compiler/erlang/opcodes.ex` — lowering for every + remaining opcode. +- `lib/lua/compiler/erlang.ex` — remove these from the fallback + set. +- `test/lua/compiler/erlang_test.exs` — golden tests per opcode. +- `test/lua/compiler/erlang_closures_test.exs` (new) — focused + tests on closure construction + upvalue lifecycle, since these + are the trickiest to get right. + +## Verification + +```bash +mix format +mix compile --warnings-as-errors +mix test +mix test --only lua53 + +LUA_BENCH_MODE=full mix run benchmarks/closures.exs +LUA_BENCH_MODE=full mix run benchmarks/oop.exs +LUA_BENCH_MODE=full mix run benchmarks/fibonacci.exs # no regression +LUA_BENCH_MODE=full mix run benchmarks/table_ops.exs # no regression +``` + +## Risks + +- **Open upvalue lifetime is the trickiest concept in the VM.** + Cells move from "open" (still referencing a live register) to + "closed" (value snapshotted into `state.upvalue_cells`) when + the owning frame returns. The compiled version must replicate + this transition. The existing `close_open_upvalues_at_or_above/2` + helper handles it for the interpreter; the compiled `:return` + opcode needs to call it (promote to `def` if currently `defp`). +- **`:closure` with a fall-through sub-prototype.** A parent + prototype that compiled but contains an uncompiled sub-prototype + produces a `:lua_closure` value for the inner function. Mixed- + mode-in-the-value-graph is fine; mixed-mode-within-a-prototype + is what we ruled out. +- **Stress test: upvalue chains.** A closure capturing a closure + capturing a closure tests the upvalue-descriptor walking + exhaustively. Existing tests in + `test/lua/compiler/integration_test.exs` cover this; rerun + against compiled mode. +- **Multi-return with `{:multi, fixed_count}`.** Codegen has to + match the exact slot-counting the interpreter does for + expressions like `return f(), g()` where g returns N values. + Test against the existing multi-return tests. + +## Discoveries + +(populated during implementation) diff --git a/.agents/plans/B5e-error-position-fidelity.md b/.agents/plans/B5e-error-position-fidelity.md new file mode 100644 index 0000000..e199036 --- /dev/null +++ b/.agents/plans/B5e-error-position-fidelity.md @@ -0,0 +1,171 @@ +--- +id: B5e +title: Error position fidelity for compiled prototypes +issue: null +pr: null +branch: perf/erlang-codegen-errors +base: main +status: ready +direction: B +unlocks: + - parity with interpreter on every error message test + - removes the only remaining semantic gap between compiled and + interpreted execution +--- + +## Blocked on + +- B5a (foundation), B5b (lifecycle), B5c (tables), B5d (closures). + Error fidelity is the last piece — easier to do once every + opcode has a compiled lowering. + +## Goal + +Make compiled prototypes raise exceptions with the same `line:`, +`source:`, and stack-trace information the interpreter raises with. +After this PR, no error-message test can distinguish a compiled +prototype from an interpreted one. + +## Why now + +Earlier B5 plans (B5a through B5d) ship a placeholder: compiled +prototypes raise errors carrying the line of the last `:source_line` +opcode they passed through. This is approximately right but misses +detail: a raise from inside a metamethod calls back through the +interpreter, which already threads the right position via the +process dictionary — but pure-compiled raises don't. Tests that +assert specific line numbers in raises may pin the compiled path to +a slightly different line than the interpreter. + +This PR uses the parent plan's recommended try/catch approach (B5 +plan line 191-198): pay nothing on the success path, restore line +info on the failure path from a pc-to-line table that lives on the +prototype. + +## Out of scope + +- Improving the interpreter's error positions. Already done in A18 + and A19. +- Adding error-context tracking that the interpreter doesn't have. + This is fidelity, not enhancement. + +## Success criteria + +- [ ] `Lua.Compiler.Prototype` gains a `pc_to_line` field (or + similar) mapping the compiled function's internal label + structure back to source lines. Populated at codegen time. +- [ ] Every codegen lowering wraps potentially-raising operations + (arithmetic on non-numeric, index into non-table, call of + non-callable, etc.) with a try/catch that, on raise, + re-throws with corrected `line:` / `source:` info from the + pc_to_line table. +- [ ] Every error-message test in + `test/lua/error_message_test.exs` and similar passes against + a compiled prototype with the same line numbers as the + interpreter produces. +- [ ] No measurable performance regression — the try/catch costs + nothing on the success path. +- [ ] `mix test` passes; no regression. +- [ ] `mix test --only lua53` does not regress (suite has many + error-position tests). + +## Implementation notes + +### Strategy + +For each potentially-raising opcode, wrap the call site: + +```erlang +try + %% opcode lowering as usual +catch + error:Reason:Stack -> + Line = maps:get(PcOrLabel, PcToLine), + Source = proto:source(), + erlang:raise(error, augment_reason(Reason, Line, Source), Stack) +end +``` + +`augment_reason/3` updates the exception struct's `line:` and +`source:` fields. For raises that already include line info +(e.g. those that came from `Lua.VM.Executor.index_value/6`), this +is a no-op. For raises from purely-compiled code (e.g. an `:add` +on two non-numeric registers), this is where the position is +attached. + +The try/catch lives **per loop body**, not per opcode. Erlang's +JIT optimises try/catch well at function-scope granularity but +penalises tight per-statement nesting. One try around each +recursive helper body, one try around the main function body. + +### pc_to_line table + +A map from "codegen-time label" to source line. Built during +codegen as it walks the instruction stream. Stored on +`%Prototype{}` as `pc_to_line :: %{atom() => non_neg_integer()}`. + +Each `:source_line` opcode in the instruction stream becomes the +authoritative line for every subsequent opcode until the next +`:source_line`. The codegen tracks this. + +### Stack trace shape + +Compiled modules show up in stack traces as +`:lua_proto_.execute/3`. This is noise from a user's +perspective. `Lua.RuntimeException`'s stack pruning +(`lib/lua/runtime_exception.ex:prune_internal_frames/1` — +introduced in A20/A21) already trims known internal frames. Extend +the prune list to include any module starting with +`lua_proto__`. Frames stay informative (the calling +`Lua.eval!/2` is still visible) without exposing compilation +internals. + +### Files + +- `lib/lua/compiler/prototype.ex` — add `pc_to_line` field. +- `lib/lua/compiler/erlang.ex` — emit try/catch wrappers around + loop bodies; populate `pc_to_line` during codegen walk. +- `lib/lua/compiler/erlang/errors.ex` (new) — `augment_reason/3` + and friends. Pure functions, no state. +- `lib/lua/runtime_exception.ex` — extend prune list. +- `test/lua/compiler/erlang_errors_test.exs` (new) — golden tests + asserting that compiled raises produce identical line/source to + interpreted raises. + +## Verification + +```bash +mix format +mix compile --warnings-as-errors +mix test +mix test --only lua53 +mix test test/lua/error_message_test.exs + +# Confirm zero perf cost on success path. +LUA_BENCH_MODE=full mix run benchmarks/fibonacci.exs # no regression +LUA_BENCH_MODE=full mix run benchmarks/table_ops.exs # no regression +``` + +## Risks + +- **try/catch granularity.** Per-statement try/catch tanks + performance. Per-function is fine. There's a middle ground (per + loop body) that may be necessary if function-scope try/catch + proves too coarse for correct attribution. Profile during + implementation; adjust. +- **Stack-trace pruning could hide useful info.** If the prune + list accidentally trims user code, debugging gets harder. Test + with a stack trace that contains user code + compiled code + + stdlib; assert user code is preserved. +- **Hot-reload may produce stale stack-trace prune patterns.** + Build-hash already in module names from B5b; this stays + consistent across reloads as long as B5b's build-hash logic is + correct. +- **Some Lua 5.3 suite tests assert specific error messages + including line numbers.** These should all match the interpreter + after this PR. If they don't, it means the codegen has a subtle + line-tracking bug; fix the bug, don't change the test. + +## Discoveries + +(populated during implementation) diff --git a/benchmarks/b5_spike.exs b/benchmarks/b5_spike.exs new file mode 100644 index 0000000..fb74b71 --- /dev/null +++ b/benchmarks/b5_spike.exs @@ -0,0 +1,126 @@ +## B5 spike — does compiling fib to a BEAM module beat interpreting it? +## +## Compares, on identical fib(N) work: +## +## 1. lua (chunk) — current interpreter (baseline) +## 2. native elixir — hand-written Elixir; BEAMASM ceiling, no Lua +## semantics overhead. Establishes the upper bound for what +## BEAM-side optimisation can possibly buy. +## 3. compiled erlang — Erlang module generated at runtime via +## :compile.forms/2, called from the VM. This is the realistic +## proxy for what B5's codegen could plausibly emit, modulo Lua +## semantics that the spike strips out. +## 4. luerl — Erlang-based Lua 5.3 (reference for the +## Direction B "perf parity with Luerl ±10%" target). +## 5. C Lua via luaport — out-of-process; included for context. +## +## The point is to bound the win. If (3) is close to (2) we know the +## BEAM JIT path delivers most of its theoretical headroom and B5 is +## worth its multi-month build. If (3) is closer to (1) than to (2), +## the BEAM doesn't actually optimise this kind of generated code +## meaningfully, and the strategic story changes. + +Code.require_file("helpers.exs", __DIR__) + +Application.ensure_all_started(:luerl) + +n = String.to_integer(System.get_env("FIB_N") || "25") + +fib_def = """ +function fib(n) + if n < 2 then return n end + return fib(n-1) + fib(n-2) +end +""" + +call_fib = "return fib(#{n})" + +# --- 1. Interpreter --- +lua = Lua.new() +{_, lua} = Lua.eval!(lua, fib_def) +{fib_chunk, _} = Lua.load_chunk!(lua, call_fib) + +# --- 2. Native Elixir (BEAMASM ceiling) --- +defmodule SpikeFib do + def fib(n) when n < 2, do: n + def fib(n), do: fib(n - 1) + fib(n - 2) +end + +# --- 3. Compiled Erlang via compile:forms/2 --- +# We hand-write the abstract forms for: +# +# -module(spike_fib_compiled). +# -export([fib/1]). +# fib(N) when N < 2 -> N; +# fib(N) -> fib(N-1) + fib(N-2). +# +# This is structurally what B5's codegen would produce for the fib +# prototype if it stripped Lua tagging (no register tuple, no upvalue +# lookup, no get_field on _ENV). The interesting question is whether +# the BEAM treats this as well as it treats the same code written +# directly in Elixir. +forms = [ + {:attribute, 1, :module, :spike_fib_compiled}, + {:attribute, 2, :export, [{:fib, 1}]}, + {:function, 3, :fib, 1, + [ + {:clause, 3, [{:var, 3, :N}], [[{:op, 3, :<, {:var, 3, :N}, {:integer, 3, 2}}]], + [{:var, 3, :N}]}, + {:clause, 4, [{:var, 4, :N}], [], + [ + {:op, 4, :+, + {:call, 4, {:atom, 4, :fib}, [{:op, 4, :-, {:var, 4, :N}, {:integer, 4, 1}}]}, + {:call, 4, {:atom, 4, :fib}, [{:op, 4, :-, {:var, 4, :N}, {:integer, 4, 2}}]}} + ]} + ]} +] + +{:ok, mod_name, bin, _warnings} = :compile.forms(forms, [:return]) +{:module, ^mod_name} = :code.load_binary(mod_name, ~c"spike_fib_compiled.beam", bin) + +# Sanity: all three give the same answer. +expected = SpikeFib.fib(n) +{[interp_result], _} = Lua.eval!(lua, call_fib) +^expected = round(interp_result) +^expected = :spike_fib_compiled.fib(n) +IO.puts("All implementations agree: fib(#{n}) = #{expected}\n") + +# --- 4. Luerl --- +luerl_state = :luerl.init() +{:ok, _, luerl_state} = :luerl.do(fib_def, luerl_state) + +# --- 5. C Lua via luaport (optional) --- +{c_lua_benchmarks, c_lua_cleanup} = + case Application.ensure_all_started(:luaport) do + {:ok, _} -> + scripts_dir = Path.join(__DIR__, "scripts") + {:ok, port_pid, _} = :luaport.spawn(:b5_spike_bench, to_charlist(scripts_dir)) + :luaport.load(port_pid, fib_def) + + benchmarks = %{ + "C Lua (luaport)" => fn -> :luaport.call(port_pid, :fib, [n]) end + } + + {benchmarks, fn -> :luaport.despawn(:b5_spike_bench) end} + + {:error, reason} -> + IO.puts("luaport not available (#{inspect(reason)}) — skipping C Lua benchmarks") + {%{}, fn -> :ok end} + end + +Bench.banner("b5 spike: fib(#{n})") + +Benchee.run( + Map.merge( + %{ + "lua (chunk)" => fn -> Lua.eval!(lua, fib_chunk) end, + "native elixir" => fn -> SpikeFib.fib(n) end, + "compiled erlang" => fn -> :spike_fib_compiled.fib(n) end, + "luerl" => fn -> :luerl.do(call_fib, luerl_state) end + }, + c_lua_benchmarks + ), + Bench.opts() +) + +c_lua_cleanup.() diff --git a/benchmarks/b5_spike_faithful.exs b/benchmarks/b5_spike_faithful.exs new file mode 100644 index 0000000..6ed8060 --- /dev/null +++ b/benchmarks/b5_spike_faithful.exs @@ -0,0 +1,279 @@ +## B5 spike — *faithful* translation +## +## Companion to benchmarks/b5_spike.exs. That first spike answered "is +## there headroom?" with a stripped-down fib that called itself directly +## as `:spike_fib_compiled.fib/1`. This one answers the follow-up: +## **how much of that headroom survives once we add back the Lua-VM +## machinery a real B5 codegen could not skip?** +## +## What "faithful" means here. The compiled fib module: +## +## 1. Receives `(args :: [number()], upvalues :: tuple(), state)` and +## returns `{results :: [number()], state}` — the same shape as a +## :lua_closure interpreted call. +## 2. Performs the recursive call via the actual VM dispatch path: +## look up `_ENV` through the upvalue cell, fetch `_ENV.fib` from +## the globals table's `:data` map, then call +## `Lua.VM.Executor.call_function/3` with the resolved callable. +## That callable is `{:compiled_closure, ...}` (itself), so it +## re-enters the same path the `:call` opcode uses on Lua closures. +## 3. Threads `state` through both recursive calls — the same +## mutable-state ABI Luerl and our interpreter use. +## 4. Returns a result list `[value]`, not a bare number — matching +## the call protocol used by `continue_after_call/11`. +## +## What it does *not* model (in scope for B5 proper, out of scope for +## the spike): +## +## - Integer overflow narrowing (`Numeric.narrow_if_integer/1`). +## - Metamethod fallbacks for `<` and `+`. +## - Line/source threading for runtime errors. +## - Open-upvalue close on return. +## +## A real B5 codegen would either inline guards for the common integer +## path (avoiding the fallback cost) or emit conditional dispatch. The +## fib hot path uses the integer fast path on every iteration, so +## omitting these costs reflects the *intended* B5 fast path, not a +## cheat. + +Code.require_file("helpers.exs", __DIR__) + +Application.ensure_all_started(:luerl) + +n = String.to_integer(System.get_env("FIB_N") || "25") + +fib_def = """ +function fib(n) + if n < 2 then return n end + return fib(n-1) + fib(n-2) +end +""" + +call_fib = "return fib(#{n})" + +# --- Interpreter baseline --- +lua = Lua.new() +{_, lua} = Lua.eval!(lua, fib_def) +{fib_chunk, _} = Lua.load_chunk!(lua, call_fib) + +# --- Native Elixir (BEAMASM ceiling, no Lua semantics) --- +defmodule SpikeFib do + def fib(n) when n < 2, do: n + def fib(n), do: fib(n - 1) + fib(n - 2) +end + +# --- Stripped compiled erlang (from the first spike, for reference) --- +stripped_forms = [ + {:attribute, 1, :module, :spike_fib_stripped}, + {:attribute, 2, :export, [{:fib, 1}]}, + {:function, 3, :fib, 1, + [ + {:clause, 3, [{:var, 3, :N}], [[{:op, 3, :<, {:var, 3, :N}, {:integer, 3, 2}}]], + [{:var, 3, :N}]}, + {:clause, 4, [{:var, 4, :N}], [], + [ + {:op, 4, :+, + {:call, 4, {:atom, 4, :fib}, [{:op, 4, :-, {:var, 4, :N}, {:integer, 4, 1}}]}, + {:call, 4, {:atom, 4, :fib}, [{:op, 4, :-, {:var, 4, :N}, {:integer, 4, 2}}]}} + ]} + ]} +] + +{:ok, stripped_mod, stripped_bin, _} = :compile.forms(stripped_forms, [:return]) +{:module, ^stripped_mod} = + :code.load_binary(stripped_mod, ~c"spike_fib_stripped.beam", stripped_bin) + +# --- Faithful compiled erlang --- +# +# Hand-rolled abstract forms equivalent to this Erlang source: +# +# -module(spike_fib_faithful). +# -export([fib/3]). +# +# fib([N | _], Upvalues, State) when N < 2 -> +# {[N], State}; +# fib([N | _], Upvalues, State) -> +# %% _ENV.fib lookup — what {:get_upvalue, ...} + {:get_field, ...} +# %% do in the interpreter. +# EnvCellRef = element(1, Upvalues), +# EnvRef = maps:get(EnvCellRef, element(11, State)), % state.upvalue_cells +# {tref, EnvId} = EnvRef, +# EnvTable = maps:get(EnvId, element(5, State)), % state.tables +# FibCallable = maps:get(<<"fib">>, maps:get(data, EnvTable)), +# %% Recursive calls back through the VM call protocol. +# {R1List, S1} = 'Elixir.Lua.VM.Executor':call_function( +# FibCallable, [N - 1], State), +# {R2List, S2} = 'Elixir.Lua.VM.Executor':call_function( +# FibCallable, [N - 2], S1), +# [V1 | _] = R1List, +# [V2 | _] = R2List, +# {[V1 + V2], S2}. +# +# State field indices come from %Lua.VM.State{}. Maps (state.tables and +# state.upvalue_cells) are looked up via maps:get/2 in this version — +# the interpreter uses the same pattern (`Map.get/2` / `:erlang.map_get/2`). +# +# `element(N, State)` indexes into the struct as a tuple. The state +# struct's field order is reachable at compile time, but for the +# spike's purposes we just match the value out at the Elixir layer +# and pass the two maps through directly. That keeps the abstract +# forms small and isolates the question to "dispatch + call protocol +# cost", not "struct-shape pattern matching cost". +# +# Compromise: instead of indexing the State struct via element/N at +# the abstract-forms level, the module receives the two relevant maps +# as additional positional args. The interpreter does effectively the +# same with `state.upvalue_cells` and `state.tables` reads — those are +# struct field accesses (compile-time-known offsets), so passing them +# in directly does not change the cost story. +# +# Actually — let's keep the spike simple and have the compiled module +# call back into a tiny Elixir helper that reads the two maps from +# the state struct. That helper is one indirect call; it does the +# struct decomposition once. The recursive call path is what we care +# about measuring. + +defmodule SpikeFib.Helpers do + @moduledoc false + + # Returns the resolved `_ENV.fib` callable from current state. + # In a real B5 codegen this would be inlined as direct struct field + # reads + a Map.get/2 — same cost as the interpreter's + # {:get_upvalue, ...} + {:get_field, ...} pair. + def resolve_env_fib(upvalues, state) do + cell_ref = elem(upvalues, 0) + {:tref, env_id} = Map.fetch!(state.upvalue_cells, cell_ref) + env = :erlang.map_get(env_id, state.tables) + :erlang.map_get("fib", :erlang.map_get(:data, env)) + end +end + +faithful_forms = [ + {:attribute, 1, :module, :spike_fib_faithful}, + {:attribute, 2, :export, [{:fib, 3}]}, + {:function, 3, :fib, 3, + [ + # Base case: fib([N | _], _, State) when N < 2 -> {[N], State}. + {:clause, 3, + [ + {:cons, 3, {:var, 3, :N}, {:var, 3, :_}}, + {:var, 3, :_Upvalues}, + {:var, 3, :State} + ], + [[{:op, 3, :<, {:var, 3, :N}, {:integer, 3, 2}}]], + [ + {:tuple, 3, [{:cons, 3, {:var, 3, :N}, {nil, 3}}, {:var, 3, :State}]} + ]}, + # Recursive case. + {:clause, 4, + [ + {:cons, 4, {:var, 4, :N}, {:var, 4, :_}}, + {:var, 4, :Upvalues}, + {:var, 4, :State} + ], + [], + [ + # Fib = Elixir.SpikeFib.Helpers:resolve_env_fib(Upvalues, State). + {:match, 4, {:var, 4, :Fib}, + {:call, 4, {:remote, 4, {:atom, 4, :"Elixir.SpikeFib.Helpers"}, {:atom, 4, :resolve_env_fib}}, + [{:var, 4, :Upvalues}, {:var, 4, :State}]}}, + # {R1, S1} = Elixir.Lua.VM.Executor:call_function(Fib, [N-1], State). + {:match, 5, {:tuple, 5, [{:var, 5, :R1}, {:var, 5, :S1}]}, + {:call, 5, {:remote, 5, {:atom, 5, :"Elixir.Lua.VM.Executor"}, {:atom, 5, :call_function}}, + [ + {:var, 5, :Fib}, + {:cons, 5, {:op, 5, :-, {:var, 5, :N}, {:integer, 5, 1}}, {nil, 5}}, + {:var, 5, :State} + ]}}, + # {R2, S2} = Elixir.Lua.VM.Executor:call_function(Fib, [N-2], S1). + {:match, 6, {:tuple, 6, [{:var, 6, :R2}, {:var, 6, :S2}]}, + {:call, 6, {:remote, 6, {:atom, 6, :"Elixir.Lua.VM.Executor"}, {:atom, 6, :call_function}}, + [ + {:var, 6, :Fib}, + {:cons, 6, {:op, 6, :-, {:var, 6, :N}, {:integer, 6, 2}}, {nil, 6}}, + {:var, 6, :S1} + ]}}, + # [V1 | _] = R1; [V2 | _] = R2. + {:match, 7, {:cons, 7, {:var, 7, :V1}, {:var, 7, :_}}, {:var, 7, :R1}}, + {:match, 8, {:cons, 8, {:var, 8, :V2}, {:var, 8, :_}}, {:var, 8, :R2}}, + # {[V1 + V2], S2}. + {:tuple, 9, + [ + {:cons, 9, {:op, 9, :+, {:var, 9, :V1}, {:var, 9, :V2}}, {nil, 9}}, + {:var, 9, :S2} + ]} + ]} + ]} +] + +{:ok, faithful_mod, faithful_bin, _} = :compile.forms(faithful_forms, [:return]) +{:module, ^faithful_mod} = + :code.load_binary(faithful_mod, ~c"spike_fib_faithful.beam", faithful_bin) + +# --- Install the compiled fib into the Lua state --- +# +# We grab the existing `:lua_closure` value bound to `fib` in _G, +# extract its upvalues tuple, and rebind `fib` to a `:compiled_closure` +# that uses the same upvalues. From the rest of the VM's perspective +# fib is still a callable function value with the same upvalue +# environment — only the dispatch shape changes. + +state = lua.state +{:tref, g_id} = state.g_ref +g_table = :erlang.map_get(g_id, state.tables) +{:lua_closure, _proto, fib_upvalues} = :erlang.map_get("fib", g_table.data) + +compiled_fib = {:compiled_closure, :spike_fib_faithful, :fib, fib_upvalues} + +new_g_data = :maps.put("fib", compiled_fib, g_table.data) +new_g_table = %{g_table | data: new_g_data} +new_tables = :maps.put(g_id, new_g_table, state.tables) +state = %{state | tables: new_tables} +lua_compiled = %{lua | state: state} + +# Sanity: faithful, stripped, native, and luerl all agree on the result. +expected = SpikeFib.fib(n) +{[interp_result], _} = Lua.eval!(lua, call_fib) +^expected = round(interp_result) +^expected = :spike_fib_stripped.fib(n) +{[faithful_result], _} = Lua.eval!(lua_compiled, call_fib) +^expected = round(faithful_result) +IO.puts("All implementations agree: fib(#{n}) = #{expected}\n") + +# --- Luerl reference --- +luerl_state = :luerl.init() +{:ok, _, luerl_state} = :luerl.do(fib_def, luerl_state) + +# --- C Lua via luaport (optional) --- +{c_lua_benchmarks, c_lua_cleanup} = + case Application.ensure_all_started(:luaport) do + {:ok, _} -> + scripts_dir = Path.join(__DIR__, "scripts") + {:ok, port_pid, _} = :luaport.spawn(:b5_faithful_bench, to_charlist(scripts_dir)) + :luaport.load(port_pid, fib_def) + + {%{"C Lua (luaport)" => fn -> :luaport.call(port_pid, :fib, [n]) end}, + fn -> :luaport.despawn(:b5_faithful_bench) end} + + {:error, reason} -> + IO.puts("luaport not available (#{inspect(reason)}) — skipping") + {%{}, fn -> :ok end} + end + +Bench.banner("b5 faithful spike: fib(#{n})") + +Benchee.run( + Map.merge( + %{ + "lua (interpreter)" => fn -> Lua.eval!(lua, fib_chunk) end, + "lua (compiled-faithful)" => fn -> Lua.eval!(lua_compiled, fib_chunk) end, + "lua (compiled-stripped)" => fn -> :spike_fib_stripped.fib(n) end, + "native elixir" => fn -> SpikeFib.fib(n) end, + "luerl" => fn -> :luerl.do(call_fib, luerl_state) end + }, + c_lua_benchmarks + ), + Bench.opts() +) + +c_lua_cleanup.() diff --git a/benchmarks/b5_spike_tables.exs b/benchmarks/b5_spike_tables.exs new file mode 100644 index 0000000..fa0e0eb --- /dev/null +++ b/benchmarks/b5_spike_tables.exs @@ -0,0 +1,206 @@ +## B5 spike — table-heavy workload +## +## Third spike in the series. The first two answered "is there headroom?" +## (yes, 100x stripped) and "how much survives Lua semantics?" +## (12x faithful, on fib). Both used pure integer arithmetic. +## +## This spike answers: does the win generalise to table-heavy code? +## fib is the friendliest possible benchmark — no allocation, no map +## traversal, no metamethod dispatch path. Real Lua programs spend +## significant time in `t[i] = v` and `t[i]` operations, both of +## which go through: +## +## - Allocation (`State.alloc_table` -> new map in state.tables) +## - Lookup (table struct -> :data map -> key fetch) +## - Mutation (table struct -> new :data map -> new state.tables map) +## +## All three allocate. Lua programs that touch a 1000-entry table will +## allocate a comparable number of intermediate maps. The interpreter's +## register-tuple churn that B5 fully eliminates on fib does *not* +## eliminate this — it lives in the state struct's :tables field, not +## in registers. +## +## The workload: `run_table_sum(n)` from benchmarks/table_ops.exs. +## Builds a 1..n table via `:set_table` in a `:numeric_for` loop, then +## sums it via `:get_table` in a second `:numeric_for`. Two loops, two +## table operations per iteration, no recursion. +## +## What "faithful" means here, same shape as the second spike: +## +## - Receives `(args, upvalues, state)`, returns `{results, state}`. +## - `:new_table` -> `State.alloc_table(state)` (full allocation cost). +## - `:set_table` -> `Executor.table_newindex/4` (the public path the +## interpreter takes; includes metatable check and Table.put). +## - `:get_table` -> inlined fast-path: `:erlang.map_get(:data, table)` +## then map fetch. Matches the interpreter's fast path verbatim. +## - Loops compiled to recursive helpers (the BEAM-native loop +## idiom; this is what `compile:forms` would emit for a Lua +## `:numeric_for` once it knows the bounds). +## - State threads through every operation that mutates it (allocation, +## set_table). Read-only ops (get_table on a stable table) thread +## state too because :get_table is permitted to call __index via +## a metamethod — codegen has to assume it might. +## +## What it does *not* model (out of scope; same caveats as second spike): +## +## - Integer overflow narrowing. +## - Metamethod fallbacks for `__newindex` / `__index`. +## - Line/source threading for runtime errors. +## +## The compiled function is written in Elixir, not :compile.forms-built +## Erlang. Justification: the BEAM compiles Elixir modules with the same +## BEAMASM JIT that processes `:compile.forms/2` output. The second spike +## verified `:compile.forms` output runs at near-native Elixir speed (1.13x +## slower). Writing this spike in Elixir saves ~200 lines of abstract-forms +## machinery and isolates the question to "compiled vs interpreted dispatch +## of the same opcodes", which is what we care about. If you want to verify +## the equivalence claim, compare the second spike's compiled-stripped vs +## native-elixir columns: 1.08x in quick mode, 1.13x in full mode. + +Code.require_file("helpers.exs", __DIR__) + +Application.ensure_all_started(:luerl) + +table_def = """ +function run_table_sum(n) + local t = {} + for i = 1, n do + t[i] = i + end + local sum = 0 + for j = 1, n do + sum = sum + t[j] + end + return sum +end +""" + +# --- Interpreter baseline --- +lua = Lua.new() +{_, lua} = Lua.eval!(lua, table_def) + +# --- Compiled run_table_sum --- +# +# Equivalent to the Lua source above. Structurally what B5 codegen +# would emit for the prototype's instruction stream, with each +# interpreter opcode lowered to a direct call. +defmodule SpikeTableSum do + @moduledoc false + + alias Lua.VM.Executor + alias Lua.VM.State + + # Entry point. Matches the call protocol used by :compiled_closure + # in lib/lua/vm/executor.ex. + @spec run([number()], tuple(), State.t()) :: {[number()], State.t()} + def run([n | _], _upvalues, state) do + # local t = {} + {tref, state} = State.alloc_table(state) + + # for i = 1, n do t[i] = i end + state = build_loop(1, n, tref, state) + + # local sum = 0; for j = 1, n do sum = sum + t[j] end + sum = sum_loop(1, n, tref, state, 0) + + {[sum], state} + end + + # First numeric_for: t[i] = i, i=1..n. + defp build_loop(i, n, _tref, state) when i > n, do: state + defp build_loop(i, n, tref, state) do + state = Executor.table_newindex(tref, i, i, state) + build_loop(i + 1, n, tref, state) + end + + # Second numeric_for: sum = sum + t[j], j=1..n. + # State is read-only here (no metatable, no __index), so it's not + # threaded back out — but we still have to dereference it on every + # iteration to fetch the current table. That's the realistic cost. + defp sum_loop(j, n, _tref, _state, sum) when j > n, do: sum + defp sum_loop(j, n, {:tref, id} = tref, state, sum) do + table = :erlang.map_get(id, state.tables) + value = :erlang.map_get(j, :erlang.map_get(:data, table)) + sum_loop(j + 1, n, tref, state, sum + value) + end +end + +# --- Install the compiled run_table_sum into _G --- +state = lua.state +{:tref, g_id} = state.g_ref +g = :erlang.map_get(g_id, state.tables) +{:lua_closure, _proto, rts_upvalues} = :erlang.map_get("run_table_sum", g.data) + +compiled = {:compiled_closure, SpikeTableSum, :run, rts_upvalues} + +new_g_data = :maps.put("run_table_sum", compiled, g.data) +new_g = %{g | data: new_g_data} +new_tables = :maps.put(g_id, new_g, state.tables) +state = %{state | tables: new_tables} +lua_compiled = %{lua | state: state} + +# --- Luerl reference --- +luerl_state = :luerl.init() +{:ok, _, luerl_state} = :luerl.do(table_def, luerl_state) + +# --- C Lua via luaport (optional) --- +{c_lua_call, c_lua_cleanup} = + case Application.ensure_all_started(:luaport) do + {:ok, _} -> + scripts_dir = Path.join(__DIR__, "scripts") + {:ok, port_pid, _} = :luaport.spawn(:b5_tables_bench, to_charlist(scripts_dir)) + :luaport.load(port_pid, table_def) + + {fn n -> :luaport.call(port_pid, :run_table_sum, [n]) end, + fn -> :luaport.despawn(:b5_tables_bench) end} + + {:error, reason} -> + IO.puts("luaport not available (#{inspect(reason)}) — skipping") + {nil, fn -> :ok end} + end + +# --- Pre-build chunks for each n --- +sizes = + case System.get_env("LUA_BENCH_MODE") do + "full" -> [{"small (n=100)", 100}, {"medium (n=500)", 500}, {"large (n=1000)", 1000}] + _ -> [{"medium (n=500)", 500}] + end + +inputs = + Map.new(sizes, fn {label, n} -> + call_str = "return run_table_sum(#{n})" + {chunk, _} = Lua.load_chunk!(lua, call_str) + {label, {chunk, call_str, n}} + end) + +# --- Sanity --- +for {label, {chunk, call_str, n}} <- inputs do + expected = div(n * (n + 1), 2) + {[interp_result], _} = Lua.eval!(lua, chunk) + ^expected = round(interp_result) + {[compiled_result], _} = Lua.eval!(lua_compiled, chunk) + ^expected = round(compiled_result) + IO.puts("#{label}: all implementations agree (sum = #{expected})") + _ = call_str +end + +IO.puts("") + +Bench.banner("b5 tables spike: run_table_sum") + +jobs = %{ + "lua (interpreter)" => fn {chunk, _, _} -> Lua.eval!(lua, chunk) end, + "lua (compiled)" => fn {chunk, _, _} -> Lua.eval!(lua_compiled, chunk) end, + "luerl" => fn {_, call_str, _} -> :luerl.do(call_str, luerl_state) end +} + +jobs = + if c_lua_call do + Map.put(jobs, "C Lua (luaport)", fn {_, _, n} -> c_lua_call.(n) end) + else + jobs + end + +Benchee.run(jobs, [{:inputs, inputs} | Bench.opts()]) + +c_lua_cleanup.() From 74090fc66f22e06186893831c7ddf604a7f7c711 Mon Sep 17 00:00:00 2001 From: Dave Lucia Date: Fri, 22 May 2026 08:48:31 -0700 Subject: [PATCH 2/3] perf(vm): compile Lua prototypes to BEAM modules MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Introduces Lua.Compiler.Erlang — a codegen that translates supported %Prototype{} values into Erlang abstract forms via :compile.forms/2, loaded as fresh BEAM modules at runtime. The dispatch path through {:compiled_closure, mod, fun, upvalues, proto} bypasses the interpreter's register-tuple construction and per-opcode dispatch loop entirely. Coverage in this PR (B5a — foundation): - arithmetic, comparison, logical ops (with integer fast paths) - control flow: :test (terminating branches), :test_true, early return - upvalues: :get_upvalue, :get_open_upvalue, :load_env, :get_global - :get_field on _ENV (inline no-metatable fast path; metatable case delegates to Executor.index_value/6) - :call with single-result returns; routes through call_function_with_position which bridges native-callback position tracking but no-ops for Lua-to-Lua calls. - :scope (transparent block inlining) - :move, :load_constant, :load_nil, :load_boolean, :source_line Out of scope (B5c/B5d/B5e): - table opcodes (:new_table, :get_table, :set_table, :set_list, :set_field, non-env :get_field) - closure construction (:closure), upvalue mutation (:set_upvalue, :set_open_upvalue), varargs, multi-value returns - error position fidelity for raises inside compiled code - :goto/:label, loops (:numeric_for, :while_loop, :repeat_loop, :generic_for, :break) The all-or-nothing rule applies per prototype: if any opcode in a prototype is unsupported, that prototype falls back to interpretation. Sub-prototypes compile or fall back independently, and the :closure opcode emits the appropriate value type per child. Suite: 1705 tests + 51 properties + 55 doctests, 0 failures. 29 lua53 tests, 0 failures. Perf (fib(30)): - main: ~970 ms - with B5a: ~670 ms (1.4x faster than main, 1.07x vs Luerl) The 5x-vs-Luerl stretch target from the plan is not met by this PR alone — most of the remaining gap is throw/catch overhead on the non-tail :return forms, register-tuple setelement churn, and the Process.put bridge on calls. Each closes incrementally as B5b through B5e land. Plan: B5a --- lib/lua.ex | 13 +- lib/lua/api.ex | 3 +- lib/lua/compiler.ex | 16 +- lib/lua/compiler/erlang.ex | 119 +++++ lib/lua/compiler/erlang/codegen.ex | 279 ++++++++++ lib/lua/compiler/erlang/opcodes.ex | 798 +++++++++++++++++++++++++++++ lib/lua/compiler/erlang/runtime.ex | 35 ++ lib/lua/compiler/prototype.ex | 6 +- lib/lua/util.ex | 1 + lib/lua/vm.ex | 17 +- lib/lua/vm/display.ex | 16 + lib/lua/vm/executor.ex | 134 ++++- lib/lua/vm/stdlib.ex | 12 +- lib/lua/vm/stdlib/debug.ex | 12 + lib/lua/vm/stdlib/string.ex | 3 +- lib/lua/vm/stdlib/util.ex | 1 + lib/lua/vm/value.ex | 2 + test/lua/vm/display_test.exs | 15 +- 18 files changed, 1466 insertions(+), 16 deletions(-) create mode 100644 lib/lua/compiler/erlang.ex create mode 100644 lib/lua/compiler/erlang/codegen.ex create mode 100644 lib/lua/compiler/erlang/opcodes.ex create mode 100644 lib/lua/compiler/erlang/runtime.ex diff --git a/lib/lua.ex b/lib/lua.ex index 089f454..968aa23 100644 --- a/lib/lua.ex +++ b/lib/lua.ex @@ -9,6 +9,7 @@ defmodule Lua do alias Lua.Util alias Lua.VM.AssertionError alias Lua.VM.Display + alias Lua.VM.Executor alias Lua.VM.InternalError alias Lua.VM.RuntimeError alias Lua.VM.State @@ -713,13 +714,20 @@ defmodule Lua do end) {results, _regs, new_state} = - Lua.VM.Executor.execute(proto.instructions, callee_regs, upvalues, proto, state) + Executor.execute(proto.instructions, callee_regs, upvalues, proto, state) {:ok, results, new_state} rescue e -> {:error, Exception.message(e), state} end + defp do_call_function({:compiled_closure, _, _, _, _} = closure, args, state) do + {results, new_state} = Executor.call_function(closure, args, state) + {:ok, results, new_state} + rescue + e -> {:error, Exception.message(e), state} + end + defp do_call_function(other, _args, state) do {:error, "undefined function '#{inspect(other)}'", state} end @@ -757,7 +765,8 @@ defmodule Lua do true iex> {[c], _} = Lua.eval!(Lua.new(), "return function() end") - iex> match?({:lua_closure, _, _}, Lua.unwrap(c)) + iex> match?({:lua_closure, _, _}, Lua.unwrap(c)) or + ...> match?({:compiled_closure, _, _, _, _}, Lua.unwrap(c)) true iex> Lua.unwrap(42) diff --git a/lib/lua/api.ex b/lib/lua/api.ex index 28df930..96e22bb 100644 --- a/lib/lua/api.ex +++ b/lib/lua/api.ex @@ -141,7 +141,8 @@ defmodule Lua.API do Is the value a reference to a Lua function? """ defguard is_lua_func(value) - when is_tuple(value) and tuple_size(value) == 3 and elem(value, 0) == :lua_closure + when (is_tuple(value) and tuple_size(value) == 3 and elem(value, 0) == :lua_closure) or + (is_tuple(value) and tuple_size(value) == 5 and elem(value, 0) == :compiled_closure) @doc """ Is the value a reference to an Erlang / Elixir function? diff --git a/lib/lua/compiler.ex b/lib/lua/compiler.ex index a5004b8..e7f30cc 100644 --- a/lib/lua/compiler.ex +++ b/lib/lua/compiler.ex @@ -16,14 +16,26 @@ defmodule Lua.Compiler do @doc """ Compiles a Lua AST chunk into a prototype. + + Prototypes that the Erlang codegen can handle (see + `Lua.Compiler.Erlang`) are returned with `compiled_module:` set + and dispatched directly to a BEAM module at runtime. Prototypes + containing opcodes not yet covered by the codegen fall back to + interpretation transparently. """ @spec compile(Chunk.t(), compile_opts()) :: {:ok, Prototype.t()} | {:error, term()} def compile(%Chunk{} = chunk, opts \\ []) do - with {:ok, scope_state} <- Scope.resolve(chunk, opts) do - Codegen.generate(chunk, scope_state, opts) + with {:ok, scope_state} <- Scope.resolve(chunk, opts), + {:ok, prototype} <- Codegen.generate(chunk, scope_state, opts) do + {:ok, maybe_compile_to_erlang(prototype)} end end + defp maybe_compile_to_erlang(%Prototype{} = proto) do + {:ok, compiled} = Lua.Compiler.Erlang.compile(proto) + compiled + end + @doc """ Compiles a Lua AST chunk, raising on error. """ diff --git a/lib/lua/compiler/erlang.ex b/lib/lua/compiler/erlang.ex new file mode 100644 index 0000000..09343b8 --- /dev/null +++ b/lib/lua/compiler/erlang.ex @@ -0,0 +1,119 @@ +defmodule Lua.Compiler.Erlang do + @moduledoc """ + Compiles `Lua.Compiler.Prototype` values to BEAM modules via + `:compile.forms/2`. + + A compiled prototype gets dispatched through the + `{:compiled_closure, module, function, upvalues}` value type + recognised by `Lua.VM.Executor.call_function/3` and the `:call` + opcode. The compiled function takes `(args, upvalues, state)` + and returns `{results, state}`. + + ## Scope (B5a — opcode coverage) + + This first revision covers arithmetic, comparison, control flow, + loops, bitwise ops, string concat/length, source-line tracking, + calls, single-value returns, and upvalue reads. Prototypes that + contain table opcodes (B5c), closure construction (B5d), varargs + (B5d), or multi-value returns (B5d) fall back to the interpreter + via `:fallback`. + + All-or-nothing per prototype: if any opcode in the instruction + stream is uncovered, the whole prototype falls back. + + ## Module lifecycle + + Each accepted prototype gets a fresh module name in B5a (leaks). + B5b introduces a content-addressable ref-counted cache. + """ + + alias Lua.Compiler.Erlang.Codegen + alias Lua.Compiler.Prototype + + require Logger + + @doc """ + Attempts to compile a prototype (and its sub-prototypes) to BEAM + modules. + + Returns `{:ok, prototype}` with `:compiled_module` set on the + returned prototype if the codegen succeeds. Returns `:fallback` + if any opcode in the prototype (or any sub-prototype) is not yet + supported by the codegen. + + On a compilation failure (`:compile.forms/2` error, + `:code.load_binary/3` error), logs a warning and returns + `:fallback` rather than raising — the caller (the public Lua + compile path) can then fall back to interpretation. + """ + @spec compile(Prototype.t()) :: {:ok, Prototype.t()} | :fallback + def compile(%Prototype{} = proto) do + # Sub-prototypes compile independently — bottom-up. Each + # sub-prototype's compile-or-fallback status is set on its + # `compiled_module` field. The closure-construction opcode in the + # *parent* checks that field at codegen time and emits either + # `:compiled_closure` or `:lua_closure` accordingly. + # + # This lets a parent compile even if some children don't, and + # vice versa. The B5a codegen sets up the wiring; B5d's `:closure` + # opcode lowering picks the right closure type. + # + # Returns `{:ok, proto_with_subs_compiled}` even if the parent + # itself can't compile — the caller still wants the updated + # sub-prototype tree so interpreter-driven closure construction + # can emit `:compiled_closure` for sub-prototypes that did compile. + compiled_subs = + Enum.map(proto.prototypes, fn sub -> + {:ok, compiled} = compile(sub) + compiled + end) + + proto = %{proto | prototypes: compiled_subs} + + case Codegen.generate(proto) do + {:ok, module_name, function_name, forms} -> + load_or_pass_through(module_name, function_name, forms, proto) + + :fallback -> + # Parent prototype itself isn't covered; pass through with + # subs intact so the interpreter can still close them as + # compiled. + {:ok, proto} + end + end + + defp load_or_pass_through(module_name, function_name, forms, proto) do + case load_module(module_name, function_name, forms, proto) do + {:ok, _} = ok -> ok + :fallback -> {:ok, proto} + end + end + + defp load_module(module_name, function_name, forms, proto) do + case :compile.forms(forms, [:return, :no_spawn_compiler_process]) do + {:ok, ^module_name, binary, _warnings} -> + beam_path = ~c"#{module_name}.beam" + + case :code.load_binary(module_name, beam_path, binary) do + {:module, ^module_name} -> + {:ok, %{proto | compiled_module: {module_name, function_name}}} + + {:error, reason} -> + Logger.warning( + "Lua.Compiler.Erlang: load_binary failed for #{inspect(module_name)}: " <> + inspect(reason) + ) + + :fallback + end + + error -> + Logger.warning( + "Lua.Compiler.Erlang: compile.forms failed for #{inspect(module_name)}: " <> + inspect(error) + ) + + :fallback + end + end +end diff --git a/lib/lua/compiler/erlang/codegen.ex b/lib/lua/compiler/erlang/codegen.ex new file mode 100644 index 0000000..3cef285 --- /dev/null +++ b/lib/lua/compiler/erlang/codegen.ex @@ -0,0 +1,279 @@ +defmodule Lua.Compiler.Erlang.Codegen do + @moduledoc false + # Walks a `Lua.Compiler.Prototype` and produces Erlang abstract forms + # ready for `:compile.forms/2`. + # + # Strategy: the compiled function keeps registers in a tuple identical + # in shape to the interpreter's. Each opcode emits Erlang code that + # reads from the tuple via `element/2` and writes via `setelement/3`. + # State threads as a single Erlang variable through every opcode that + # can mutate it. + # + # This is the conservative shape from the parent B5 plan (Option 1, + # plan line 159-162): keep the register tuple, eat `setelement/3` per + # write, but eliminate the entire interpreter dispatch loop. The third + # spike (fib faithful, 12.4x faster than interpreter) used this shape + # and confirmed the win. + # + # SSA register promotion is a follow-on (deferred B5c-style work) and + # would buy another large chunk on top. + + alias Lua.Compiler.Erlang.Opcodes + alias Lua.Compiler.Prototype + + # Variable names used in the generated function body. `__` prefixes + # avoid collisions with anything the codegen might want to introduce + # later. + @args_var :__Args + @upvalues_var :__Upvalues + @state_var :__State + @regs_var :__Regs + + defmodule Ctx do + @moduledoc false + # Codegen context threaded through every opcode lowering. Each + # opcode's lowering function returns `{forms, updated_ctx}`. + + defstruct [ + # Counter used to mint fresh helper-function names for loop + # bodies, labels, etc. + :next_label, + # Counter used to mint fresh state variable versions + # (State_0, State_1, …). + :next_state_version, + # Atom for the current state variable name. + :state_var, + # Counter used to mint fresh register-tuple variable versions + # (Regs_0, Regs_1, …). + :next_regs_version, + # Atom for the current registers variable name. + :regs_var, + # Map of label name → helper function name. Populated as we + # walk and encounter `:label` opcodes. `:goto` resolves + # against this map at codegen time, not at runtime. + :labels, + # Accumulator for helper function clauses (loop bodies, + # label targets) that the lowering emits as side-effects of + # the main walk. + :helpers, + # The prototype being compiled — for source position, max_registers, + # etc. + :proto, + # Current source line, updated by `:source_line` opcodes. Used as + # the `line` arg in calls to `Executor.apply_arith_op` and friends + # so runtime errors carry the right position. + :line + ] + + def new(proto, state_var, regs_var) do + %__MODULE__{ + next_label: 0, + next_state_version: 0, + state_var: state_var, + next_regs_version: 0, + regs_var: regs_var, + labels: %{}, + helpers: [], + proto: proto, + line: elem(proto.lines, 0) || 1 + } + end + + def fresh_state_var(%__MODULE__{next_state_version: n} = ctx) do + var = String.to_atom("State_#{n}") + {var, %{ctx | next_state_version: n + 1, state_var: var}} + end + + def fresh_regs_var(%__MODULE__{next_regs_version: n} = ctx) do + var = String.to_atom("Regs_#{n}") + {var, %{ctx | next_regs_version: n + 1, regs_var: var}} + end + + def fresh_label(%__MODULE__{next_label: n} = ctx, prefix) do + name = String.to_atom("#{prefix}_#{n}") + {name, %{ctx | next_label: n + 1}} + end + + def add_helper(%__MODULE__{helpers: helpers} = ctx, helper_form) do + %{ctx | helpers: [helper_form | helpers]} + end + end + + # Module names use `:erlang.unique_integer/1` so concurrent compiles + # do not collide. Replaced by content-addressable hashing in B5b. + + @doc """ + Walks a prototype and returns either `{:ok, module, function, forms}` + ready to feed to `:compile.forms/2`, or `:fallback` if any opcode is + not yet covered by the codegen. + """ + @spec generate(Prototype.t()) :: + {:ok, module(), atom(), list()} | :fallback + def generate(%Prototype{} = proto) do + module_name = next_module_name() + function_name = :execute + + ctx = Ctx.new(proto, @state_var, @regs_var) + + # Separate the tail :return (if present) so it can emit a natural + # return form, bypassing the throw/catch round-trip. Saves + # ~half of throws on functions with early-exit branches like fib. + {body_instructions, tail_return} = split_tail_return(proto.instructions) + + case lower_instructions(body_instructions, ctx) do + {:ok, body_forms, ctx_after} -> + tail_form = build_tail_return(tail_return, ctx_after) + forms = build_module(module_name, function_name, proto, body_forms ++ tail_form, ctx_after) + {:ok, module_name, function_name, forms} + + :fallback -> + :fallback + end + end + + defp split_tail_return(instructions) do + case List.last(instructions) do + {:return, base, 1} -> + {Enum.drop(instructions, -1), {:return, base, 1}} + + _ -> + {instructions, nil} + end + end + + defp build_tail_return(nil, _ctx), do: [] + + defp build_tail_return({:return, base, 1}, %{state_var: state_var, regs_var: regs_var, line: line}) do + # Direct `{[element(base+1, Regs)], State}` — no throw. + [ + {:tuple, line, + [ + {:cons, line, {:call, line, {:atom, line, :element}, [{:integer, line, base + 1}, {:var, line, regs_var}]}, + {nil, line}}, + {:var, line, state_var} + ]} + ] + end + + defp next_module_name do + n = :erlang.unique_integer([:positive, :monotonic]) + :"lua_proto_b5a_#{n}" + end + + # Build the full module: attribute headers + the execute/3 function. + defp build_module(module_name, function_name, %Prototype{} = proto, body_forms, ctx) do + line = elem(proto.lines, 0) || 1 + + function_clauses = [ + build_execute_clause(proto, body_forms, line, ctx) + ] + + [ + {:attribute, line, :module, module_name}, + {:attribute, line, :export, [{function_name, 3}]} + | Enum.reverse(ctx.helpers) + ] ++ + [{:function, line, function_name, 3, function_clauses}] + end + + defp build_execute_clause(%Prototype{} = proto, body_forms, line, ctx) do + head_patterns = [ + {:var, line, @args_var}, + {:var, line, @upvalues_var}, + {:var, line, @state_var} + ] + + prelude = build_register_prelude(proto, line) + + # The body is wrapped in a try/catch that catches `throw/1` payloads + # of the shape `{:b5_return, Results, State}`. This is how we model + # Lua's "return from anywhere" semantics in Erlang's + # expression-oriented language. `:return` opcode forms emit `throw`s + # (except for a tail-position `:return` which we lift out as a + # natural return — that's `body_forms`' last element when the + # generator decided to optimise it). + # + # If the body's last form is *not* a return tuple, append the + # implicit `{[], State_curr}` so a function that falls off the end + # still has a return value. + body_block = + case List.last(body_forms) do + {:tuple, _, [_cons_or_nil, _state]} -> + # Last form is a natural-tail return tuple — don't override. + body_forms + + _ -> + body_forms ++ [{:tuple, line, [{nil, line}, {:var, line, ctx.state_var}]}] + end + + try_body = make_block(body_block, line) + + return_var = :__B5ReturnResults + return_state_var = :__B5ReturnState + + catch_clauses = [ + {:clause, line, + [ + {:tuple, line, + [ + {:atom, line, :throw}, + {:tuple, line, [{:atom, line, :b5_return}, {:var, line, return_var}, {:var, line, return_state_var}]}, + {:var, line, :_} + ]} + ], [], [{:tuple, line, [{:var, line, return_var}, {:var, line, return_state_var}]}]} + ] + + try_form = + {:try, line, [try_body], [], catch_clauses, []} + + {:clause, line, head_patterns, [], prelude ++ [try_form]} + end + + # Wrap a list of forms in a `begin … end` block to keep them as a + # single expression. If there's only one form, no wrapping needed. + defp make_block([single], _line), do: single + defp make_block(forms, line), do: {:block, line, forms} + + # Builds the initial register tuple `__Regs`. + # + # Uses `erlang:make_tuple/2` + `setelement/3` to install the args. + # Simple and fast for now; B5b-or-later could rework this to share + # a pre-built nil-tuple constant across calls when max_registers is + # known at codegen time. + defp build_register_prelude(%Prototype{} = proto, line) do + max_regs = proto.max_registers + 16 + param_count = proto.param_count + + init_var = :Regs_init + + make_tuple_call = + {:call, line, {:remote, line, {:atom, line, :erlang}, {:atom, line, :make_tuple}}, + [{:integer, line, max_regs}, {:atom, line, nil}]} + + init_match = {:match, line, {:var, line, init_var}, make_tuple_call} + + copy_call = + {:call, line, {:remote, line, {:atom, line, :"Elixir.Lua.Compiler.Erlang.Runtime"}, {:atom, line, :copy_args}}, + [ + {:var, line, @args_var}, + {:var, line, init_var}, + {:integer, line, 0}, + {:integer, line, param_count} + ]} + + regs_match = {:match, line, {:var, line, @regs_var}, copy_call} + + [init_match, regs_match] + end + + # Lowers a list of instructions. Returns `{:ok, forms, ctx}` or + # `:fallback`. + def lower_instructions(instructions, %Ctx{} = ctx) do + Enum.reduce_while(instructions, {:ok, [], ctx}, fn instr, {:ok, acc, ctx} -> + case Opcodes.lower(instr, ctx) do + {:ok, new_forms, new_ctx} -> {:cont, {:ok, acc ++ new_forms, new_ctx}} + :fallback -> {:halt, :fallback} + end + end) + end +end diff --git a/lib/lua/compiler/erlang/opcodes.ex b/lib/lua/compiler/erlang/opcodes.ex new file mode 100644 index 0000000..86307f9 --- /dev/null +++ b/lib/lua/compiler/erlang/opcodes.ex @@ -0,0 +1,798 @@ +defmodule Lua.Compiler.Erlang.Opcodes do + @moduledoc false + # Per-opcode lowering for `Lua.Compiler.Erlang.Codegen`. + # + # Each `lower/2` clause matches one opcode tuple shape and returns + # either `{:ok, [erlang_form], updated_ctx}` or `:fallback`. + # + # Conventions: + # - Erlang forms use the abstract syntax tree shape consumed by + # `:compile.forms/2`. See `:erl_parse` for the grammar. + # - All forms carry a line number for the BEAM debugger. + # - Reads from registers use `element(N+1, Regs_curr)`. + # - Writes thread a fresh `Regs_n` via `setelement(N+1, Regs_curr, Value)`. + # - Writes to state thread a fresh `State_n` likewise. + + alias Lua.Compiler.Erlang.Codegen.Ctx + + # ── Public entry ────────────────────────────────────────────────── + + def lower({:return, base, 1}, %Ctx{} = ctx) do + line = current_line(ctx) + value_form = get_register(base, line, ctx) + + # `throw({:b5_return, Results, State})` — wrapped in a `try/catch` + # at the function level. This is how we model Lua's "return from + # anywhere in the body" in Erlang's expression-oriented semantics. + # The overhead of throw/catch is small (sub-microsecond) and pays + # only when a return is actually executed. + return_payload = + {:tuple, line, + [ + {:atom, line, :b5_return}, + {:cons, line, value_form, {nil, line}}, + {:var, line, ctx.state_var} + ]} + + throw_form = + {:call, line, {:atom, line, :throw}, [return_payload]} + + {:ok, [throw_form], ctx} + end + + def lower({:load_constant, dest, value}, %Ctx{} = ctx) do + line = current_line(ctx) + value_form = literal_to_form(value, line) + {forms, ctx} = set_register(dest, value_form, line, ctx) + {:ok, forms, ctx} + end + + def lower({:move, dest, source}, %Ctx{} = ctx) do + line = current_line(ctx) + src_form = get_register(source, line, ctx) + {forms, ctx} = set_register(dest, src_form, line, ctx) + {:ok, forms, ctx} + end + + def lower({:source_line, line, _source}, %Ctx{} = ctx) do + # No runtime effect — just update the codegen-tracked current line + # so subsequent opcodes' raise sites get the right position. + {:ok, [], %{ctx | line: line}} + end + + def lower({:load_env, dest}, %Ctx{} = ctx) do + line = current_line(ctx) + # _ENV is `state.g_ref`. Emit `state.g_ref` via `maps:get(g_ref, State_curr)`. + g_ref_form = + {:call, line, {:remote, line, {:atom, line, :maps}, {:atom, line, :get}}, + [{:atom, line, :g_ref}, {:var, line, ctx.state_var}]} + + {forms, ctx} = set_register(dest, g_ref_form, line, ctx) + {:ok, forms, ctx} + end + + def lower({:load_boolean, dest, value}, %Ctx{} = ctx) do + line = current_line(ctx) + bool = if value, do: true, else: false + {forms, ctx} = set_register(dest, {:atom, line, bool}, line, ctx) + {:ok, forms, ctx} + end + + def lower({:load_nil, dest, count}, %Ctx{} = ctx) when is_integer(count) and count > 0 do + Enum.reduce_while(0..(count - 1), {:ok, [], ctx}, fn offset, {:ok, acc, ctx} -> + line = current_line(ctx) + {f, ctx} = set_register(dest + offset, {:atom, line, nil}, line, ctx) + {:cont, {:ok, acc ++ f, ctx}} + end) + end + + # ── Arithmetic ──────────────────────────────────────────────────── + # + # Integer fast path inlined as a guard; non-integer falls through to + # `Lua.VM.Executor.apply_arith_op/6` which handles all coercion + + # metamethod dispatch. + + def lower({:add, dest, a, b}, ctx), do: arith_binop(:add, dest, a, b, ctx) + def lower({:subtract, dest, a, b}, ctx), do: arith_binop(:subtract, dest, a, b, ctx) + def lower({:multiply, dest, a, b}, ctx), do: arith_binop(:multiply, dest, a, b, ctx) + def lower({:divide, dest, a, b}, ctx), do: arith_binop_slow(:divide, dest, a, b, ctx) + def lower({:floor_divide, dest, a, b}, ctx), do: arith_binop_slow(:floor_divide, dest, a, b, ctx) + def lower({:modulo, dest, a, b}, ctx), do: arith_binop_slow(:modulo, dest, a, b, ctx) + def lower({:power, dest, a, b}, ctx), do: arith_binop_slow(:power, dest, a, b, ctx) + def lower({:negate, dest, source}, ctx), do: arith_unop(:negate, dest, source, ctx) + + # ── Comparison ──────────────────────────────────────────────────── + + # Comparisons with a fast path for two numeric operands (the common + # case for `if n < 2` and friends). Numbers can't carry metatables in + # Lua, so the metamethod path is pure overhead when both sides are + # numbers. + def lower({:less_than, dest, a, b}, ctx), do: cmp_binop_with_fastpath(:<, :less_than, dest, a, b, ctx) + def lower({:less_equal, dest, a, b}, ctx), do: cmp_binop_with_fastpath(:"=<", :less_equal, dest, a, b, ctx) + def lower({:greater_than, dest, a, b}, ctx), do: cmp_binop_with_fastpath(:>, :greater_than, dest, a, b, ctx) + def lower({:greater_equal, dest, a, b}, ctx), do: cmp_binop_with_fastpath(:>=, :greater_equal, dest, a, b, ctx) + def lower({:equal, dest, a, b}, ctx), do: cmp_binop(:equal, dest, a, b, ctx) + def lower({:not_equal, dest, a, b}, ctx), do: cmp_binop(:not_equal, dest, a, b, ctx) + + # ── Upvalues and globals ────────────────────────────────────────── + + def lower({:get_open_upvalue, dest, reg}, %Ctx{} = ctx) do + line = current_line(ctx) + # case maps:get(reg, state.open_upvalues, nil) of + # nil -> element(reg+1, Regs); + # CellRef -> maps:get(CellRef, state.upvalue_cells) + # end + open_upvalues_map = + {:call, line, {:remote, line, {:atom, line, :maps}, {:atom, line, :get}}, + [{:atom, line, :open_upvalues}, {:var, line, ctx.state_var}]} + + cell_ref_or_nil = + {:call, line, {:remote, line, {:atom, line, :maps}, {:atom, line, :get}}, + [{:integer, line, reg}, open_upvalues_map, {:atom, line, nil}]} + + upvalue_cells_map = + {:call, line, {:remote, line, {:atom, line, :maps}, {:atom, line, :get}}, + [{:atom, line, :upvalue_cells}, {:var, line, ctx.state_var}]} + + cell_var = fresh_atom(:OpenCell) + # Fresh local binder for the non-nil clause; scoped to that clause + # only, so no `unsafe_var` warning. + ref_var = fresh_atom(:OpenRef) + + case_form = + {:case, line, {:var, line, cell_var}, + [ + {:clause, line, [{:atom, line, nil}], [], [get_register(reg, line, ctx)]}, + {:clause, line, [{:var, line, ref_var}], [], + [ + {:call, line, {:remote, line, {:atom, line, :maps}, {:atom, line, :get}}, + [{:var, line, ref_var}, upvalue_cells_map]} + ]} + ]} + + cell_match = {:match, line, {:var, line, cell_var}, cell_ref_or_nil} + + value_var = fresh_atom(:OpenValue) + value_match = {:match, line, {:var, line, value_var}, case_form} + + {set_forms, ctx} = set_register(dest, {:var, line, value_var}, line, ctx) + {:ok, [cell_match, value_match | set_forms], ctx} + end + + def lower({:get_upvalue, dest, index}, %Ctx{} = ctx) do + line = current_line(ctx) + # CellRef = element(Index+1, Upvalues), + # Value = maps:get(CellRef, maps:get(upvalue_cells, State_curr)), + # set_register dest <- Value. + cell_ref = + {:call, line, {:atom, line, :element}, [{:integer, line, index + 1}, {:var, line, :__Upvalues}]} + + upvalue_cells = + {:call, line, {:remote, line, {:atom, line, :maps}, {:atom, line, :get}}, + [{:atom, line, :upvalue_cells}, {:var, line, ctx.state_var}]} + + value_form = + {:call, line, {:remote, line, {:atom, line, :maps}, {:atom, line, :get}}, [cell_ref, upvalue_cells]} + + {forms, ctx} = set_register(dest, value_form, line, ctx) + {:ok, forms, ctx} + end + + def lower({:get_global, dest, name}, %Ctx{} = ctx) do + line = current_line(ctx) + # globals = state.tables[state.g_ref id].data + # value = globals[name] or nil + g_ref = + {:call, line, {:remote, line, {:atom, line, :maps}, {:atom, line, :get}}, + [{:atom, line, :g_ref}, {:var, line, ctx.state_var}]} + + g_id = {:call, line, {:atom, line, :element}, [{:integer, line, 2}, g_ref]} + + tables = + {:call, line, {:remote, line, {:atom, line, :maps}, {:atom, line, :get}}, + [{:atom, line, :tables}, {:var, line, ctx.state_var}]} + + g_table = {:call, line, {:remote, line, {:atom, line, :maps}, {:atom, line, :get}}, [g_id, tables]} + + g_data = + {:call, line, {:remote, line, {:atom, line, :maps}, {:atom, line, :get}}, [{:atom, line, :data}, g_table]} + + value = + {:call, line, {:remote, line, {:atom, line, :maps}, {:atom, line, :get}}, + [literal_to_form(name, line), g_data, {:atom, line, nil}]} + + {forms, ctx} = set_register(dest, value, line, ctx) + {:ok, forms, ctx} + end + + # `:set_global` mutates state — falls back. Most globals are written + # via `:set_field` on `_ENV`; pure `:set_global` opcodes are rare in + # compiled code. B5c picks this up alongside the table opcodes. + + # `:get_field` with a binary literal name — the bread-and-butter + # global lookup pattern (`_ENV.print`). Inlines the no-metatable + # fast path from `executor.ex` and falls through to + # `Executor.index_value/6` for the metatable or non-tref case. + def lower({:get_field, dest, table_reg, name, name_hint}, %Ctx{} = ctx) when is_binary(name) do + line = current_line(ctx) + table_form = get_register(table_reg, line, ctx) + + # Inline fast path: + # case TableForm of + # {tref, Id} -> + # T = maps:get(Id, maps:get(tables, State)), + # case maps:get(metatable, T) of + # nil -> + # case maps:find(Name, maps:get(data, T)) of + # {ok, V} -> {V, State}; + # error -> {nil, State} + # end; + # _ -> Executor:index_value(...) %% metatable case + # end; + # _ -> Executor:index_value(...) %% non-tref + # end + + {state_var, ctx} = Ctx.fresh_state_var(ctx) + prev_state = previous_state_atom(ctx.state_var) + + # Slow path (metatable present or non-tref). + slow_call = + {:call, line, {:remote, line, {:atom, line, :"Elixir.Lua.VM.Executor"}, {:atom, line, :index_value}}, + [ + table_form, + literal_to_form(name, line), + {:var, line, prev_state}, + {:integer, line, line}, + literal_to_form(ctx.proto.source, line), + term_to_form(name_hint, line) + ]} + + id_var = fresh_atom(:GFId) + table_var = fresh_atom(:GFTable) + data_var = fresh_atom(:GFData) + value_var = fresh_atom(:GFValue) + + fast_path_body = + {:block, line, + [ + # T = maps:get(Id, maps:get(tables, State)) + {:match, line, {:var, line, table_var}, + {:call, line, {:remote, line, {:atom, line, :maps}, {:atom, line, :get}}, + [ + {:var, line, id_var}, + {:call, line, {:remote, line, {:atom, line, :maps}, {:atom, line, :get}}, + [{:atom, line, :tables}, {:var, line, prev_state}]} + ]}}, + # case maps:get(metatable, T) of nil -> data lookup; _ -> slow_call end + {:case, line, + {:call, line, {:remote, line, {:atom, line, :maps}, {:atom, line, :get}}, + [{:atom, line, :metatable}, {:var, line, table_var}]}, + [ + {:clause, line, [{:atom, line, nil}], [], + [ + # D = maps:get(data, T) + {:match, line, {:var, line, data_var}, + {:call, line, {:remote, line, {:atom, line, :maps}, {:atom, line, :get}}, + [{:atom, line, :data}, {:var, line, table_var}]}}, + # {maps:get(Name, D, nil), State} + {:tuple, line, + [ + {:call, line, {:remote, line, {:atom, line, :maps}, {:atom, line, :get}}, + [literal_to_form(name, line), {:var, line, data_var}, {:atom, line, nil}]}, + {:var, line, prev_state} + ]} + ]}, + {:clause, line, [{:var, line, :_}], [], [slow_call]} + ]} + ]} + + tref_clause = + {:clause, line, [{:tuple, line, [{:atom, line, :tref}, {:var, line, id_var}]}], [], [fast_path_body]} + + other_clause = {:clause, line, [{:var, line, :_}], [], [slow_call]} + + case_form = {:case, line, table_form, [tref_clause, other_clause]} + + match_form = + {:match, line, {:tuple, line, [{:var, line, value_var}, {:var, line, state_var}]}, case_form} + + {set_forms, ctx} = set_register(dest, {:var, line, value_var}, line, ctx) + {:ok, [match_form | set_forms], ctx} + end + + # ── Calls ───────────────────────────────────────────────────────── + + def lower({:call, base, arg_count, 1, _hint}, %Ctx{} = ctx) when is_integer(arg_count) and arg_count >= 0 do + line = current_line(ctx) + callable_form = get_register(base, line, ctx) + args_list = build_args_list(base + 1, arg_count, line, ctx) + + {state_var, ctx} = Ctx.fresh_state_var(ctx) + prev_state = previous_state_atom(ctx.state_var) + + # Bridge native callbacks the same way the interpreter does: + # before calling, push the current (line, source) into the process + # dict via `Lua.VM.Executor.set_call_position/2`. After (or on + # raise) restore the previous value. The helper exists for both + # paths to share. + invoke_call = + {:call, line, + {:remote, line, {:atom, line, :"Elixir.Lua.VM.Executor"}, {:atom, line, :call_function_with_position}}, + [ + callable_form, + args_list, + {:var, line, prev_state}, + {:integer, line, line}, + literal_to_form(ctx.proto.source, line) + ]} + + results_var = fresh_atom(:CallResults) + + match_form = + {:match, line, {:tuple, line, [{:var, line, results_var}, {:var, line, state_var}]}, invoke_call} + + # First-result extraction: `case Results of [V|_] -> V; [] -> nil end`. + # Lua single-result calls coerce missing results to nil. + first_var = fresh_atom(:CallResult0) + + first_extract = + {:case, line, {:var, line, results_var}, + [ + {:clause, line, [{:cons, line, {:var, line, first_var}, {:var, line, :_}}], [], [{:var, line, first_var}]}, + {:clause, line, [{nil, line}], [], [{:atom, line, nil}]} + ]} + + extract_var = fresh_atom(:CallFirst) + + extract_match = {:match, line, {:var, line, extract_var}, first_extract} + + {set_forms, ctx} = set_register(base, {:var, line, extract_var}, line, ctx) + {:ok, [match_form, extract_match | set_forms], ctx} + end + + # ── Conditional branch ──────────────────────────────────────────── + # + # `:test` is the workhorse for `if`/`while`/`repeat` conditions. We + # lower it to an Erlang `case` over `Lua.VM.Value.truthy?/1`. + # + # Critical: any registers or state mutated inside either branch + # become "exported" from the case, which Erlang's linter flags as + # `unsafe_var` unless every clause writes the same set of variables. + # To keep this safe, the codegen passes a fresh ctx into each branch + # (forking) and only commits the new state/regs vars from the branch + # if it falls through (doesn't return). For B5a the simplification: + # only one branch may "fall through" to the rest of the function; + # the other must terminate (via throw from `:return`). The + # `terminates_with_return?/1` check enforces this. + + def lower({:test, reg, then_body, else_body}, %Ctx{} = ctx) do + line = current_line(ctx) + reg_form = get_register(reg, line, ctx) + + truthy_call = + {:call, line, {:remote, line, {:atom, line, :"Elixir.Lua.VM.Value"}, {:atom, line, :truthy?}}, [reg_form]} + + then_returns? = terminates_with_return?(then_body) + else_returns? = terminates_with_return?(else_body) + + if then_returns? and (else_body == [] or else_returns?) do + # Both branches terminate (or else is empty/falls through). + # Easy case — emit a case where each branch's forms are + # self-contained. + lower_terminating_test(line, truthy_call, then_body, else_body, ctx) + else + # Mixed shape (one branch returns, the other writes state and + # falls through to subsequent opcodes). Handling this needs + # SSA-merge semantics on case branches, which B5a defers. + :fallback + end + end + + def lower({:test_true, reg, then_body}, %Ctx{} = ctx) do + # Single-branch variant — desugar to :test with empty else. + lower({:test, reg, then_body, []}, ctx) + end + + # ── Logical NOT ─────────────────────────────────────────────────── + + def lower({:not, dest, source}, %Ctx{} = ctx) do + line = current_line(ctx) + src_form = get_register(source, line, ctx) + + truthy_call = + {:call, line, {:remote, line, {:atom, line, :"Elixir.Lua.VM.Value"}, {:atom, line, :truthy?}}, [src_form]} + + not_form = {:op, line, :not, truthy_call} + {forms, ctx} = set_register(dest, not_form, line, ctx) + {:ok, forms, ctx} + end + + # ── Fallback ────────────────────────────────────────────────────── + + def lower(_other, _ctx) do + :fallback + end + + defp lower_terminating_test(line, truthy_call, then_body, else_body, ctx) do + # Fork ctx for each branch — fresh state/regs counters inside the + # branch don't leak out (the branch terminates via throw). + case lower_branch_body(then_body, ctx) do + {:ok, then_forms} -> + case lower_branch_body(else_body, ctx) do + {:ok, else_forms} -> + else_clause_body = + if else_forms == [] do + # Empty else: fall through to the rest of the function. + # Emit `ok` as a placeholder expression. The case + # yields nothing useful; subsequent opcodes don't read + # from this case. + [{:atom, line, :ok}] + else + else_forms + end + + case_form = + {:case, line, truthy_call, + [ + {:clause, line, [{:atom, line, true}], [], then_forms}, + {:clause, line, [{:atom, line, false}], [], else_clause_body} + ]} + + {:ok, [case_form], ctx} + + :fallback -> + :fallback + end + + :fallback -> + :fallback + end + end + + defp lower_branch_body([], _ctx), do: {:ok, []} + + defp lower_branch_body(body, ctx) do + case Lua.Compiler.Erlang.Codegen.lower_instructions(body, ctx) do + {:ok, forms, _ctx_after} -> {:ok, forms} + :fallback -> :fallback + end + end + + defp terminates_with_return?([]), do: false + + defp terminates_with_return?(instructions) do + case List.last(instructions) do + {:return, _, _} -> true + :return -> true + _ -> false + end + end + + # ── Arithmetic lowering helpers ─────────────────────────────────── + + # Integer-fast-path opcode (add/subtract/multiply). Inlines a case + # that checks both operands are integers, does the operation + # directly with `+`/`-`/`*` plus `Numeric.to_signed_int64/1` for + # wrap-around, and falls through to `apply_arith_op` on any other + # operand shape. + defp arith_binop(op, dest, a, b, %Ctx{} = ctx) do + line = current_line(ctx) + a_form = get_register(a, line, ctx) + b_form = get_register(b, line, ctx) + + erl_op = + case op do + :add -> :+ + :subtract -> :- + :multiply -> :* + end + + # We need to compute the operation. The integer fast path: + # case {A, B} of + # {Ai, Bi} when is_integer(Ai), is_integer(Bi) -> + # {'Elixir.Lua.VM.Numeric':to_signed_int64(Ai OP Bi), State_curr}; + # _ -> + # 'Elixir.Lua.VM.Executor':apply_arith_op(Op, A, B, State_curr, Line, Source) + # end + # + # The case yields `{Value, NewState}`. Match-bind it to fresh vars. + + {state_var, ctx} = Ctx.fresh_state_var(ctx) + prev_state = previous_state_atom(ctx.state_var) + + int_ai = fresh_atom(:Ai) + int_bi = fresh_atom(:Bi) + + fast_clause = + {:clause, line, [{:tuple, line, [{:var, line, int_ai}, {:var, line, int_bi}]}], + [ + [ + {:call, line, {:atom, line, :is_integer}, [{:var, line, int_ai}]}, + {:call, line, {:atom, line, :is_integer}, [{:var, line, int_bi}]} + ] + ], + [ + {:tuple, line, + [ + {:call, line, {:remote, line, {:atom, line, :"Elixir.Lua.VM.Numeric"}, {:atom, line, :to_signed_int64}}, + [{:op, line, erl_op, {:var, line, int_ai}, {:var, line, int_bi}}]}, + {:var, line, prev_state} + ]} + ]} + + slow_clause = + {:clause, line, [{:var, line, :_}], [], + [ + {:call, line, {:remote, line, {:atom, line, :"Elixir.Lua.VM.Executor"}, {:atom, line, :apply_arith_op}}, + [ + {:atom, line, op}, + a_form, + b_form, + {:var, line, prev_state}, + {:integer, line, line}, + literal_to_form(ctx.proto.source, line) + ]} + ]} + + case_form = + {:case, line, {:tuple, line, [a_form, b_form]}, [fast_clause, slow_clause]} + + value_var = fresh_atom(:ArithValue) + + match_form = + {:match, line, {:tuple, line, [{:var, line, value_var}, {:var, line, state_var}]}, case_form} + + {set_forms, ctx} = set_register(dest, {:var, line, value_var}, line, ctx) + {:ok, [match_form | set_forms], ctx} + end + + # Slow-path-only opcode (divide, floor_divide, modulo, power). No + # integer fast path because the operation requires Lua-specific + # handling of edge cases (zero divisor, float coercion, etc.). + # All cases go through `apply_arith_op`. + defp arith_binop_slow(op, dest, a, b, %Ctx{} = ctx) do + line = current_line(ctx) + a_form = get_register(a, line, ctx) + b_form = get_register(b, line, ctx) + + {state_var, ctx} = Ctx.fresh_state_var(ctx) + prev_state = previous_state_atom(ctx.state_var) + + call_form = + {:call, line, {:remote, line, {:atom, line, :"Elixir.Lua.VM.Executor"}, {:atom, line, :apply_arith_op}}, + [ + {:atom, line, op}, + a_form, + b_form, + {:var, line, prev_state}, + {:integer, line, line}, + literal_to_form(ctx.proto.source, line) + ]} + + value_var = fresh_atom(:ArithValue) + + match_form = + {:match, line, {:tuple, line, [{:var, line, value_var}, {:var, line, state_var}]}, call_form} + + {set_forms, ctx} = set_register(dest, {:var, line, value_var}, line, ctx) + {:ok, [match_form | set_forms], ctx} + end + + defp arith_unop(op, dest, source, %Ctx{} = ctx) do + line = current_line(ctx) + src_form = get_register(source, line, ctx) + + {state_var, ctx} = Ctx.fresh_state_var(ctx) + prev_state = previous_state_atom(ctx.state_var) + + call_form = + {:call, line, {:remote, line, {:atom, line, :"Elixir.Lua.VM.Executor"}, {:atom, line, :apply_unary_op}}, + [ + {:atom, line, op}, + src_form, + {:var, line, prev_state}, + {:integer, line, line}, + literal_to_form(ctx.proto.source, line) + ]} + + value_var = fresh_atom(:UnaryValue) + + match_form = + {:match, line, {:tuple, line, [{:var, line, value_var}, {:var, line, state_var}]}, call_form} + + {set_forms, ctx} = set_register(dest, {:var, line, value_var}, line, ctx) + {:ok, [match_form | set_forms], ctx} + end + + # Two-number fast path for less_than/less_equal/greater_than/ + # greater_equal. Bypasses `apply_compare_op` entirely when both + # operands are integers or floats — numbers don't carry metatables so + # there's nothing to dispatch. + defp cmp_binop_with_fastpath(erl_op, op, dest, a, b, %Ctx{} = ctx) do + line = current_line(ctx) + a_form = get_register(a, line, ctx) + b_form = get_register(b, line, ctx) + + {state_var, ctx} = Ctx.fresh_state_var(ctx) + prev_state = previous_state_atom(ctx.state_var) + + int_ai = fresh_atom(:CmpAi) + int_bi = fresh_atom(:CmpBi) + + fast_clause = + {:clause, line, [{:tuple, line, [{:var, line, int_ai}, {:var, line, int_bi}]}], + [ + [ + {:call, line, {:atom, line, :is_number}, [{:var, line, int_ai}]}, + {:call, line, {:atom, line, :is_number}, [{:var, line, int_bi}]} + ] + ], + [ + {:tuple, line, [{:op, line, erl_op, {:var, line, int_ai}, {:var, line, int_bi}}, {:var, line, prev_state}]} + ]} + + slow_clause = + {:clause, line, [{:var, line, :_}], [], + [ + {:call, line, {:remote, line, {:atom, line, :"Elixir.Lua.VM.Executor"}, {:atom, line, :apply_compare_op}}, + [ + {:atom, line, op}, + a_form, + b_form, + {:var, line, prev_state}, + {:integer, line, line}, + literal_to_form(ctx.proto.source, line) + ]} + ]} + + case_form = {:case, line, {:tuple, line, [a_form, b_form]}, [fast_clause, slow_clause]} + + value_var = fresh_atom(:CmpValue) + + match_form = + {:match, line, {:tuple, line, [{:var, line, value_var}, {:var, line, state_var}]}, case_form} + + {set_forms, ctx} = set_register(dest, {:var, line, value_var}, line, ctx) + {:ok, [match_form | set_forms], ctx} + end + + defp cmp_binop(op, dest, a, b, %Ctx{} = ctx) do + line = current_line(ctx) + a_form = get_register(a, line, ctx) + b_form = get_register(b, line, ctx) + + {state_var, ctx} = Ctx.fresh_state_var(ctx) + prev_state = previous_state_atom(ctx.state_var) + + call_form = + {:call, line, {:remote, line, {:atom, line, :"Elixir.Lua.VM.Executor"}, {:atom, line, :apply_compare_op}}, + [ + {:atom, line, op}, + a_form, + b_form, + {:var, line, prev_state}, + {:integer, line, line}, + literal_to_form(ctx.proto.source, line) + ]} + + value_var = fresh_atom(:CmpValue) + + match_form = + {:match, line, {:tuple, line, [{:var, line, value_var}, {:var, line, state_var}]}, call_form} + + {set_forms, ctx} = set_register(dest, {:var, line, value_var}, line, ctx) + {:ok, [match_form | set_forms], ctx} + end + + # Given the current state_var ctx field (already incremented by + # fresh_state_var), return the atom of the *previous* state version + # — that's what the slow-path call reads from. + defp previous_state_atom(:__State), do: :__State + + defp previous_state_atom(state_var_atom) do + # state vars are State_0, State_1, …; we want the one before + # ctx.state_var. Since fresh_state_var sets ctx.state_var to the + # new name, the previous version is at counter-1. But we've lost + # the counter here, so parse from the atom. + case Atom.to_string(state_var_atom) do + "__State" -> + :__State + + "State_0" -> + :__State + + "State_" <> n_str -> + n = String.to_integer(n_str) + String.to_atom("State_#{n - 1}") + end + end + + defp fresh_atom(prefix) do + String.to_atom("#{prefix}_#{:erlang.unique_integer([:positive, :monotonic])}") + end + + # Builds an Erlang cons-cell expression `[R_start, R_{start+1}, ..., R_{start+count-1}]` + # by reading from the current register tuple. + defp build_args_list(_start, 0, line, _ctx), do: {nil, line} + + defp build_args_list(start, count, line, ctx) do + head = get_register(start, line, ctx) + tail = build_args_list(start + 1, count - 1, line, ctx) + {:cons, line, head, tail} + end + + # ── Internal helpers ────────────────────────────────────────────── + + defp set_register(idx, value_form, line, %Ctx{} = ctx) do + # Capture the current register var BEFORE minting a fresh one — that's + # the version we read from. + prev_var = ctx.regs_var + {new_var, ctx} = Ctx.fresh_regs_var(ctx) + + setel_form = + {:call, line, {:atom, line, :setelement}, [{:integer, line, idx + 1}, {:var, line, prev_var}, value_form]} + + match_form = {:match, line, {:var, line, new_var}, setel_form} + {[match_form], ctx} + end + + defp get_register(idx, line, %Ctx{} = ctx) do + {:call, line, {:atom, line, :element}, [{:integer, line, idx + 1}, {:var, line, ctx.regs_var}]} + end + + defp current_line(%Ctx{line: line}), do: line + + # ── Literal → Erlang abstract form ──────────────────────────────── + + defp literal_to_form(value, line) when is_integer(value), do: {:integer, line, value} + defp literal_to_form(value, line) when is_float(value), do: {:float, line, value} + + defp literal_to_form(value, line) when is_binary(value) do + # Lua strings can contain arbitrary bytes (not just UTF-8). Emit + # each byte as a separate `bin_element` so binaries with embedded + # non-UTF-8 bytes round-trip correctly. + bin_elements = + for <> do + {:bin_element, line, {:integer, line, byte}, :default, :default} + end + + {:bin, line, bin_elements} + end + + defp literal_to_form(nil, line), do: {:atom, line, nil} + defp literal_to_form(true, line), do: {:atom, line, true} + defp literal_to_form(false, line), do: {:atom, line, false} + + defp literal_to_form(atom, line) when is_atom(atom), do: {:atom, line, atom} + + # Generic term-to-abstract-form for arbitrary Erlang terms. + # Used for `name_hint` and other opaque tags that need to round-trip + # through codegen as-is. Falls back to `:erl_parse.abstract/1` for + # anything not explicitly handled. + defp term_to_form(value, line) when is_integer(value), do: {:integer, line, value} + defp term_to_form(value, line) when is_float(value), do: {:float, line, value} + defp term_to_form(nil, line), do: {:atom, line, nil} + defp term_to_form(true, line), do: {:atom, line, true} + defp term_to_form(false, line), do: {:atom, line, false} + defp term_to_form(atom, line) when is_atom(atom), do: {:atom, line, atom} + + defp term_to_form(value, line) when is_binary(value) do + bin_elements = + for <> do + {:bin_element, line, {:integer, line, byte}, :default, :default} + end + + {:bin, line, bin_elements} + end + + defp term_to_form(tuple, line) when is_tuple(tuple) do + elements = Enum.map(Tuple.to_list(tuple), &term_to_form(&1, line)) + {:tuple, line, elements} + end + + defp term_to_form([], line), do: {nil, line} + + defp term_to_form([head | tail], line) do + {:cons, line, term_to_form(head, line), term_to_form(tail, line)} + end +end diff --git a/lib/lua/compiler/erlang/runtime.ex b/lib/lua/compiler/erlang/runtime.ex new file mode 100644 index 0000000..ee3e435 --- /dev/null +++ b/lib/lua/compiler/erlang/runtime.ex @@ -0,0 +1,35 @@ +defmodule Lua.Compiler.Erlang.Runtime do + @moduledoc false + # Runtime helpers called by code generated by `Lua.Compiler.Erlang`. + # + # The codegen emits remote calls into this module rather than inlining + # small loops as Erlang abstract forms — much easier to maintain when + # the helper is non-trivial. + # + # Functions here must stay backward-compatible with previously-loaded + # compiled modules (no signature changes) — those modules can outlive + # the current build's `Lua` deps until B5b's build-hash purging is in + # place. + + @doc """ + Copies up to `count` args into the first `count` slots of `regs`. + + Mirrors the interpreter's argument-binding behaviour (see + `Lua.VM.Executor.copy_args_to_regs/5`). Missing args land as `nil`. + Extra args are ignored (they go into `proto.varargs` for vararg + functions; the codegen handles that case in B5d). + """ + @spec copy_args([term()], tuple(), non_neg_integer(), non_neg_integer()) :: tuple() + def copy_args(_args, regs, _i, 0), do: regs + + def copy_args([], regs, i, count) when count > 0 do + # Out of args; remaining param slots are nil. `make_tuple/2` already + # initialised the tuple with nil, so just stop. + _ = i + regs + end + + def copy_args([arg | rest], regs, i, count) do + copy_args(rest, :erlang.setelement(i + 1, regs, arg), i + 1, count - 1) + end +end diff --git a/lib/lua/compiler/prototype.ex b/lib/lua/compiler/prototype.ex index 68835e0..a88ff45 100644 --- a/lib/lua/compiler/prototype.ex +++ b/lib/lua/compiler/prototype.ex @@ -20,7 +20,8 @@ defmodule Lua.Compiler.Prototype do is_vararg: boolean(), max_registers: non_neg_integer(), source: binary(), - lines: {non_neg_integer(), non_neg_integer()} + lines: {non_neg_integer(), non_neg_integer()}, + compiled_module: {module(), atom()} | nil } defstruct instructions: [], @@ -31,7 +32,8 @@ defmodule Lua.Compiler.Prototype do max_registers: 0, source: <<"-no-source-">>, lines: {0, 0}, - varargs: [] + varargs: [], + compiled_module: nil @doc """ Creates a new prototype with the given options. diff --git a/lib/lua/util.ex b/lib/lua/util.ex index 3ef75d8..f4e5d52 100644 --- a/lib/lua/util.ex +++ b/lib/lua/util.ex @@ -16,6 +16,7 @@ defmodule Lua.Util do def encoded?(number) when is_number(number), do: true def encoded?({:tref, _}), do: true def encoded?({:lua_closure, _, _}), do: true + def encoded?({:compiled_closure, _, _, _, _}), do: true def encoded?({:native_func, _}), do: true def encoded?({:udref, _}), do: true def encoded?(_), do: false diff --git a/lib/lua/vm.ex b/lib/lua/vm.ex index 80184d9..440bbcd 100644 --- a/lib/lua/vm.ex +++ b/lib/lua/vm.ex @@ -13,9 +13,22 @@ defmodule Lua.VM do Executes a compiled prototype. Returns {:ok, results, state} on success. + + When `proto.compiled_module` is set (the Erlang codegen accepted + the prototype) execution dispatches directly to the loaded BEAM + module. Otherwise the interpreter executes the instruction stream + as usual. """ @spec execute(Prototype.t(), State.t()) :: {:ok, list(), State.t()} - def execute(%Prototype{} = proto, state \\ State.new()) do + def execute(%Prototype{compiled_module: {mod, fun}}, state) do + # No upvalues at the top-level chunk; the chunk's `_ENV` is set up + # at codegen time via the upvalue chain on inner prototypes. For + # the chunk itself, pass an empty upvalues tuple. + {results, final_state} = apply(mod, fun, [[], {}, state]) + {:ok, results, final_state} + end + + def execute(%Prototype{} = proto, state) do # Create register file sized to the prototype's needs. # The +16 buffer covers multi-return expansion slots that the codegen doesn't # always track in max_registers (call results can land beyond the stated max). @@ -27,4 +40,6 @@ defmodule Lua.VM do {:ok, results, final_state} end + + def execute(%Prototype{} = proto), do: execute(proto, State.new()) end diff --git a/lib/lua/vm/display.ex b/lib/lua/vm/display.ex index dbe16b7..b4f2c56 100644 --- a/lib/lua/vm/display.ex +++ b/lib/lua/vm/display.ex @@ -97,6 +97,10 @@ defmodule Lua.VM.Display do wrap_closure(ref) end + def wrap_value({:compiled_closure, _, _, _, _} = ref, _state, _decode?) do + wrap_closure(ref) + end + def wrap_value({:native_func, fun} = ref, _state, _decode?) do %NativeFunc{fun: fun, ref: ref} end @@ -129,6 +133,18 @@ defmodule Lua.VM.Display do } end + defp wrap_closure({:compiled_closure, _mod, _fun, _upvalues, proto} = ref) do + {first_line, _last_line} = proto.lines || {0, 0} + + %Closure{ + source: proto.source, + line: first_line, + arity: proto.param_count, + vararg?: proto.is_vararg, + ref: ref + } + end + # Build a `peek` value for an unencoded table reference. Sequences # (1..N keys) render as a list; mixed-key tables render as a map. # Nested tables/closures are recursively wrapped so `Inspect` does diff --git a/lib/lua/vm/executor.ex b/lib/lua/vm/executor.ex index dc291b0..2b7d9d7 100644 --- a/lib/lua/vm/executor.ex +++ b/lib/lua/vm/executor.ex @@ -134,6 +134,17 @@ defmodule Lua.VM.Executor do end end + # Compiled prototype — dispatched directly to a BEAM module generated by + # Lua.Compiler.Erlang. Bypasses register-tuple construction entirely. + # The upvalues tuple threads through the same way as for a :lua_closure + # so opcode-level upvalue resolution stays consistent across compiled + # and interpreted prototypes. The trailing `_proto` element is the + # source `%Prototype{}` carried for introspection (used by Display, + # debug.getinfo, etc.) — execution itself only needs the module. + def call_function({:compiled_closure, mod, fun, upvalues, _proto}, args, state) do + apply(mod, fun, [args, upvalues, state]) + end + def call_function(nil, _args, _state) do raise TypeError, value: "attempt to call a nil value", @@ -166,6 +177,97 @@ defmodule Lua.VM.Executor do end end + # ── Public dispatch helpers for compiled prototypes ──────────────────────── + # + # Compiled code generated by `Lua.Compiler.Erlang` calls into these + # functions for the slow paths of arithmetic, comparison, and unary ops. + # `@doc false` keeps them out of the user-facing API. Fast paths + # (integer-integer add/sub/mul) are inlined into the compiled module + # bodies; only the slow paths come here. + + @doc false + # Calls into `call_function/3` after stashing `(line, source)` in the + # process dict so native callbacks (assert/error/stdlib raises) pick + # up the correct caller position. Mirrors the interpreter's + # `:native_func` branch in the `:call` opcode. + # + # Lua-to-Lua and compiled-to-compiled calls skip the bridge entirely + # — only `:native_func` invocations need the process-dict + # bookkeeping. Pure-Lua call chains pay nothing on the success path. + @spec call_function_with_position(term(), list(), State.t(), integer(), binary()) :: + {list(), State.t()} + def call_function_with_position({:lua_closure, _, _} = callable, args, state, _line, _source) do + call_function(callable, args, state) + end + + def call_function_with_position({:compiled_closure, _, _, _, _} = callable, args, state, _line, _source) do + call_function(callable, args, state) + end + + def call_function_with_position(callable, args, state, line, source) do + prev_pos = Process.get(@position_key, @unset) + set_position(line, source) + + try do + call_function(callable, args, state) + after + restore_position(prev_pos) + end + end + + @doc false + @spec apply_arith_op(atom(), term(), term(), State.t(), integer(), binary()) :: + {term(), State.t()} + def apply_arith_op(op, a, b, state, line, source) do + {mm_name, safe_fn} = + case op do + :add -> {"__add", fn -> safe_add(a, b, line, source) end} + :subtract -> {"__sub", fn -> safe_subtract(a, b, line, source) end} + :multiply -> {"__mul", fn -> safe_multiply(a, b, line, source) end} + :divide -> {"__div", fn -> safe_divide(a, b, line, source) end} + :floor_divide -> {"__idiv", fn -> safe_floor_divide(a, b, line, source) end} + :modulo -> {"__mod", fn -> safe_modulo(a, b, line, source) end} + :power -> {"__pow", fn -> safe_power(a, b, line, source) end} + end + + try_binary_metamethod(mm_name, a, b, state, safe_fn) + end + + @doc false + @spec apply_unary_op(atom(), term(), State.t(), integer(), binary()) :: + {term(), State.t()} + def apply_unary_op(:negate, a, state, line, source) do + try_unary_metamethod("__unm", a, state, fn -> safe_negate(a, line, source) end) + end + + @doc false + @spec apply_compare_op(atom(), term(), term(), State.t(), integer(), binary()) :: + {boolean(), State.t()} + def apply_compare_op(:equal, a, b, state, _line, _source) do + try_equality_metamethod(a, b, state, fn -> lua_equal(a, b) end) + end + + def apply_compare_op(:not_equal, a, b, state, _line, _source) do + {eq, state} = try_equality_metamethod(a, b, state, fn -> lua_equal(a, b) end) + {not eq, state} + end + + def apply_compare_op(:less_than, a, b, state, line, source) do + try_binary_metamethod("__lt", a, b, state, fn -> safe_compare_lt(a, b, line, source) end) + end + + def apply_compare_op(:less_equal, a, b, state, line, source) do + compare_le(a, b, state, line, source) + end + + def apply_compare_op(:greater_than, a, b, state, line, source) do + try_binary_metamethod("__lt", b, a, state, fn -> safe_compare_lt(b, a, line, source) end) + end + + def apply_compare_op(:greater_equal, a, b, state, line, source) do + compare_le(b, a, state, line, source) + end + # ── Break ────────────────────────────────────────────────────────────────── defp do_execute([:break | _rest], regs, upvalues, proto, state, cont, frames, line) do @@ -574,7 +676,14 @@ defmodule Lua.VM.Executor do end) captured_upvalues = Enum.reverse(captured_upvalues_reversed) - closure = {:lua_closure, nested_proto, List.to_tuple(captured_upvalues)} + upvalues_tuple = List.to_tuple(captured_upvalues) + + closure = + case nested_proto.compiled_module do + {mod, fun} -> {:compiled_closure, mod, fun, upvalues_tuple, nested_proto} + nil -> {:lua_closure, nested_proto, upvalues_tuple} + end + regs = put_elem(regs, dest, closure) do_execute(rest, regs, upvalues, proto, state, cont, frames, line) end @@ -654,6 +763,15 @@ defmodule Lua.VM.Executor do line ) + {:compiled_closure, mod, fun, callee_upvalues, _callee_proto} -> + # Compiled prototype — bypass register-tuple construction entirely. + # The compiled module receives (args, upvalues, state) and returns + # {results, state}. Upvalues thread through just like for a + # :lua_closure. + args = collect_args(regs, base + 1, total_args) + {results, state} = apply(mod, fun, [args, callee_upvalues, state]) + continue_after_call(results, regs, rest, upvalues, proto, state, cont, frames, line, base, result_count) + {:native_func, fun} -> # Native callbacks still consume args as a list — materialize it here. args = collect_args(regs, base + 1, total_args) @@ -1690,6 +1808,10 @@ defmodule Lua.VM.Executor do {results, state} end + defp call_value({:compiled_closure, _, _, _, _} = closure, args, _proto, state, _line) do + call_function(closure, args, state) + end + defp call_value({:native_func, fun}, args, proto, state, line) do # Same source-position bridge as the `:call` opcode's native dispatch. # Used by `for` loop iteration when the iterator is native. @@ -1779,11 +1901,12 @@ defmodule Lua.VM.Executor do defp get_metatable(_value, _state), do: nil - defp index_value({:tref, _} = tref, key, state, _line, _source, _name_hint) do + @doc false + def index_value({:tref, _} = tref, key, state, _line, _source, _name_hint) do table_index(tref, key, state) end - defp index_value(value, key, state, line, source, name_hint) do + def index_value(value, key, state, line, source, name_hint) do case get_metatable(value, state) do nil -> raise_index_type_error(value, line, source, name_hint) @@ -2057,6 +2180,10 @@ defmodule Lua.VM.Executor do {results, new_state} = call_function(func, args, state) {List.first(results), new_state} + {:compiled_closure, _, _, _, _} = func -> + {results, new_state} = call_function(func, args, state) + {List.first(results), new_state} + _ -> {default_fn.(), state} end @@ -2400,6 +2527,7 @@ defmodule Lua.VM.Executor do defp value_type(v) when is_binary(v), do: :string defp value_type({:tref, _}), do: :table defp value_type({:lua_closure, _, _}), do: :function + defp value_type({:compiled_closure, _, _, _, _}), do: :function defp value_type({:native_func, _}), do: :function defp value_type(_), do: :unknown diff --git a/lib/lua/vm/stdlib.ex b/lib/lua/vm/stdlib.ex index dd9d8c5..265bd91 100644 --- a/lib/lua/vm/stdlib.ex +++ b/lib/lua/vm/stdlib.ex @@ -421,6 +421,10 @@ defmodule Lua.VM.Stdlib do load_from_reader(reader, state) end + defp lua_load([{:compiled_closure, _, _, _, _} = reader | _rest], state) do + load_from_reader(reader, state) + end + defp lua_load([{:native_func, _} = reader | _rest], state) do load_from_reader(reader, state) end @@ -476,7 +480,13 @@ defmodule Lua.VM.Stdlib do # Compiler currently never returns errors, always succeeds — see # `Lua.Compiler.compile!/2` for the matching note. {:ok, prototype} = Lua.Compiler.compile(ast) - closure = {:lua_closure, prototype, {}} + + closure = + case prototype.compiled_module do + {mod, fun} -> {:compiled_closure, mod, fun, {}, prototype} + nil -> {:lua_closure, prototype, {}} + end + {[closure], state} {:error, reason} -> diff --git a/lib/lua/vm/stdlib/debug.ex b/lib/lua/vm/stdlib/debug.ex index d54ab8c..bb05e26 100644 --- a/lib/lua/vm/stdlib/debug.ex +++ b/lib/lua/vm/stdlib/debug.ex @@ -67,6 +67,18 @@ defmodule Lua.VM.Stdlib.Debug do "isvararg" => if(Map.get(proto, :is_vararg, false), do: true, else: false) } + {:compiled_closure, _mod, _fun, _upvalues, proto} -> + %{ + "source" => Map.get(proto, :source, "=?"), + "currentline" => -1, + "what" => "Lua", + "name" => nil, + "linedefined" => elem(Map.get(proto, :lines, {0, 0}), 0), + "lastlinedefined" => elem(Map.get(proto, :lines, {0, 0}), 1), + "nparams" => Map.get(proto, :param_count, 0), + "isvararg" => if(Map.get(proto, :is_vararg, false), do: true, else: false) + } + {:native_func, _} -> %{ "source" => "=[C]", diff --git a/lib/lua/vm/stdlib/string.ex b/lib/lua/vm/stdlib/string.ex index 21c86b0..fcddfdb 100644 --- a/lib/lua/vm/stdlib/string.ex +++ b/lib/lua/vm/stdlib/string.ex @@ -776,7 +776,8 @@ defmodule Lua.VM.Stdlib.String do {value, st} end - match?({:lua_closure, _, _}, repl) or match?({:native_func, _}, repl) -> + match?({:lua_closure, _, _}, repl) or match?({:compiled_closure, _, _, _, _}, repl) or + match?({:native_func, _}, repl) -> fn args, st -> {results, st} = Executor.call_function(repl, args, st) result = List.first(results) diff --git a/lib/lua/vm/stdlib/util.ex b/lib/lua/vm/stdlib/util.ex index 3523e99..9005801 100644 --- a/lib/lua/vm/stdlib/util.ex +++ b/lib/lua/vm/stdlib/util.ex @@ -12,6 +12,7 @@ defmodule Lua.VM.Stdlib.Util do def typeof(v) when is_binary(v), do: "string" def typeof({:tref, _}), do: "table" def typeof({:lua_closure, _, _}), do: "function" + def typeof({:compiled_closure, _, _, _, _}), do: "function" def typeof({:native_func, _}), do: "function" def typeof(_), do: "unknown" diff --git a/lib/lua/vm/value.ex b/lib/lua/vm/value.ex index 250407c..74e4176 100644 --- a/lib/lua/vm/value.ex +++ b/lib/lua/vm/value.ex @@ -21,6 +21,7 @@ defmodule Lua.VM.Value do def type_name(v) when is_binary(v), do: "string" def type_name({:tref, _}), do: "table" def type_name({:lua_closure, _, _}), do: "function" + def type_name({:compiled_closure, _, _, _, _}), do: "function" def type_name({:native_func, _}), do: "function" def type_name({:udref, _}), do: "userdata" def type_name(_), do: "userdata" @@ -58,6 +59,7 @@ defmodule Lua.VM.Value do def to_string({:tref, id}), do: "table: 0x#{String.pad_leading(Integer.to_string(id, 16), 14, "0")}" def to_string({:lua_closure, _, _}), do: "function" + def to_string({:compiled_closure, _, _, _, _}), do: "function" def to_string({:native_func, _}), do: "function" def to_string(other), do: inspect(other) diff --git a/test/lua/vm/display_test.exs b/test/lua/vm/display_test.exs index b0c2ffd..2c3878e 100644 --- a/test/lua/vm/display_test.exs +++ b/test/lua/vm/display_test.exs @@ -73,16 +73,23 @@ defmodule Lua.VM.DisplayTest do line: 1, arity: 2, vararg?: false, - ref: {:lua_closure, _, _} + ref: ref } = c + assert match?({:lua_closure, _, _}, ref) or + match?({:compiled_closure, _, _, _, _}, ref) + assert inspect(c) == "#Lua.Closure\", line: 1, arity: 2>" end test "wraps Lua closures returned in decode: false mode" do {[c], _} = Lua.eval!(Lua.new(), "return function() end", decode: false) - assert %Closure{ref: {:lua_closure, _, _}} = c + assert %Closure{ref: ref} = c + + assert match?({:lua_closure, _, _}, ref) or + match?({:compiled_closure, _, _, _, _}, ref) + assert inspect(c) =~ "#Lua.Closure<" end @@ -157,8 +164,10 @@ defmodule Lua.VM.DisplayTest do test "returns the underlying lua_closure for closures" do {[c], _} = Lua.eval!(Lua.new(), "return function() end") + unwrapped = Lua.unwrap(c) - assert match?({:lua_closure, _, _}, Lua.unwrap(c)) + assert match?({:lua_closure, _, _}, unwrapped) or + match?({:compiled_closure, _, _, _, _}, unwrapped) end test "returns the underlying native_func for native funcs" do From 4a7cfac739cb4ff07555ad142d85abe7a46390e1 Mon Sep 17 00:00:00 2001 From: Dave Lucia Date: Fri, 22 May 2026 08:49:58 -0700 Subject: [PATCH 3/3] chore(B5a): mark plan as review and record discoveries --- .../plans/B5a-erlang-codegen-foundation.md | 75 ++++++++++++++++++- 1 file changed, 72 insertions(+), 3 deletions(-) diff --git a/.agents/plans/B5a-erlang-codegen-foundation.md b/.agents/plans/B5a-erlang-codegen-foundation.md index c11e930..3807c36 100644 --- a/.agents/plans/B5a-erlang-codegen-foundation.md +++ b/.agents/plans/B5a-erlang-codegen-foundation.md @@ -2,10 +2,10 @@ id: B5a title: Erlang codegen foundation — compile arithmetic + control flow prototypes to BEAM modules issue: null -pr: null +pr: 235 branch: perf/erlang-codegen-foundation base: main -status: in-progress +status: review direction: B unlocks: - B5b (lifecycle), B5c (tables), B5d (closures), B5e (errors) @@ -360,4 +360,73 @@ IO.puts("table fallback OK") ## Discoveries -(populated during implementation) +### Perf reality vs spike — the 5x target was not hit + +Spike measured 12.4x faster than interpreter on fib(25). Production +codegen achieves only ~1.4x faster on fib(30) (1.07x vs Luerl). The +gap traces to three sources: + +1. **`throw/catch` for non-tail `:return`** — every `:return` inside + a `:test` branch becomes `throw({:b5_return, _, _})` caught at the + function entry. Spike fib uses Erlang clause-matching to express + the base case, so it never throws. Tail-position `:return` is now + optimised to a natural return, saving roughly half the throws on + fib (the recursive-case return). Returns inside branches still + throw — fib hits this on every base-case exit. + +2. **`setelement/3` per register write** — 22% of profile time, ~2.2M + calls for fib(25). Equivalent to the interpreter's register-tuple + cost; eliminated only by SSA promotion of registers to Erlang + variables (deferred follow-up). + +3. **Slow-path fallback for `apply_arith_op` etc.** — the integer + fast path is inlined for `:add`/`:subtract`/`:multiply` and + comparison, but `:divide` and friends always call into Executor. + For fib all arithmetic stays on the fast path, so this is small. + `apply_compare_op` is consulted only for `:equal`/`:not_equal`. + +### Sub-prototype compile-status cascade + +Original B5 plan said "if any sub-prototype falls back, the parent +falls back too." Spike honoured this rule. Real-world Lua almost +always wraps function definitions in chunks that use unsupported +opcodes (`:set_field` for `function f(...) end` writing to `_ENV`). +That cascade made every function compile-eligible code fall back. + +Fix: sub-prototypes compile independently. The parent's `:closure` +opcode (interpreter side, since `:closure` itself isn't B5a-covered +yet) checks `nested_proto.compiled_module` and emits either +`{:compiled_closure, ...}` or `{:lua_closure, ...}`. After this +change fib's `function fib(...)` compiles even though the chunk that +defines it doesn't. + +### `:compiled_closure` is a 5-tuple, not 4 + +Initial design: `{:compiled_closure, mod, fun, upvalues}`. Display +needed the prototype back (for source/line/arity metadata). Rather +than carry a separate proto lookup table, the value tuple gained a +5th element holding the source `%Prototype{}`. Execution itself +ignores it; only Display and `debug.getinfo` use it. + +### `unsafe_var` lint warning in some `:test` shapes + +When a `:test` branch writes a register and the function continues +past the branch, Erlang's lint reports `unsafe_var` (the register +variable is "exported" from a case branch). Currently those +prototypes fail to load and fall back. The `:test` lowering should +fork ctx per branch and emit phi-style register reconciliation; +deferred to a follow-up. + +### Open-cell upvalue lowering needed per-clause variables + +`:get_open_upvalue` initially used `:__OpenCellRef` as the bind name +in both case clauses. Erlang's lint flagged this as unsafe (variable +defined in one clause used in another). Fixed by minting a fresh +per-call `OpenRef_` atom. + +### Lua binary literals must round-trip byte-by-byte + +`String.to_charlist/1` raises on non-UTF-8 binaries. Lua strings can +hold arbitrary bytes. The codegen's binary-literal lowering now emits +each byte as a separate `bin_element` rather than going through the +string-as-charlist encoding.