diff --git a/ROADMAP.md b/ROADMAP.md index 666a452..443fa5b 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -2,12 +2,14 @@ This is the strategic overview. For per-PR detail, see [`.agents/plans/`](.agents/plans). -## Status: 2026-05-06 +## Status: 2026-05-21 -- **Unit tests**: 1,420 passing, 0 failing, 31 skipped. -- **Lua 5.3 official suite**: 5/29 files passing (`simple_test.lua`, - `api.lua`, `bitwise.lua`, `code.lua`, `vararg.lua`). -- **Current focus**: [Direction A — Suite Triage](#in-flight-direction-a--suite-triage-milestone-100). +- **Unit tests**: 1,705 passing, 0 failing, 30 skipped. +- **Lua 5.3 official suite**: 6/29 files passing (`simple_test.lua`, + `api.lua`, `bitwise.lua`, `code.lua`, `tpack.lua`, `vararg.lua`). +- **Current focus**: post-B-series consolidation. See [Direction B — + Performance](#direction-b--performance-1-0-x) for what was tried, + what shipped, and what we learned about the limits. ## Done @@ -30,6 +32,13 @@ The new Elixir-native VM (replacing Luerl) is built up through: - O(N²) → O(N) upvalue collection in closure handler (PR #154). - O(1) upvalue access by storing upvalues as a tuple (PR #155). - Fully tail-recursive CPS executor with line tracking off heap (PR #156). + - Fast-path executor dispatch (numeric arith, comparisons, string + concat, `get_field` / `set_field`) (PR #223). + - In-range fast path for `Numeric.to_signed_int64/1` (B8, PR #227). + -3% on fib(30). + - Bench harness: quick mode + multi-n inputs via + `LUA_BENCH_MODE` (PR #230). 17 min → 80 s for the full suite in + quick mode; full mode preserved for publishable numbers. ## In flight: Direction A — Suite Triage (milestone `1.0.0`) @@ -72,17 +81,102 @@ under the [`0.5.0` milestone](https://github.com/tv-labs/lua/milestone/1). - **A12**: README and CHANGELOG for 1.0.0-rc.1. - **A13**: Cut `1.0.0-rc.1` (blocked on the rest). -## Next: Direction B — Performance (milestone `1.0.x`) - -Several B-direction wins shipped already (PRs #153–#156). What remains: - -- **B1**: Drop `source_line` instructions in non-debug compilation. -- **B2**: Codegen peephole pass (fold `load_constant N k; move M N` → `load_constant M k`). -- **B3**: Re-baseline benchmarks against Luerl and PUC-Lua. Decide whether further - architectural work (e.g. flat instruction stream + PC dispatch) is justified. - -Per-PR plans land in [`.agents/plans/B*.md`](.agents/plans) when Direction A -wraps. +## Direction B — Performance (`1.0.x`) + +Several B-direction wins landed early on (PRs #153–#156, #223). The +B4–B8 sweep in May 2026 then attempted four larger architectural +levers; the results are summarised here so the lessons survive the +ephemeral plan files. + +### Shipped + +- **B8 — Numeric narrowing fast path** (PR #227). Guard-clause short + circuits `Numeric.to_signed_int64/1` for in-range integers. + −3.3% on fib(30) chunk, no regressions. The realised win came + entirely from the guard short-circuit; `@compile {:inline, ...}` + does not cross module boundaries, so the cross-module call sites in + `Executor` / `Value` still trip a function boundary. +- **Bench harness rework** (PR #230). `LUA_BENCH_MODE=quick` (default) + cuts the full suite from ~17 min to ~80 s; `LUA_BENCH_MODE=full` + preserves the long windows plus a multi-`n` sweep (`{10, 100, + 1000}`) for the table workloads. This harness is what surfaced B7's + scale regression — the single-`n` measurement we had before would + have hidden it. + +### Tried and deferred (with findings) + +- **B6 — Eliminate per-tref `Map.fetch!` re-resolution.** Deferred in + PR #229 / #231. Post-PR #223 profile no longer supports the + hypothesis: `Map.get` is ~3.3% on fib(22) and ~0.04% on table_build. + The earlier headline number (~6.4%) was absorbed by the fast-path + work in PR #223. The remaining audit cleanup is worth doing later + as a refactor, not as a perf plan. +- **B7 — Array + hash split for `Lua.VM.Table`.** Implemented in PR + #229, closed unmerged. Wins at small `n` (-14% to -21% at `n=100`), + loses badly at large `n` (+30% to +40% at `n=1000`). Memory + regresses 3-5x at `n=1000`. The crossover is structural: BEAM + tuples are immutable, so every `setelement/3` on a 1024-cell tuple + copies the whole tuple. PUC-Lua avoids this with in-place mutation + in C; we cannot. A future plan could revisit with + *threshold-based promotion* (stay in the data map until + `array_len ≥ N`, then promote) — the small-`n` wins are real and + worth preserving if the regression can be avoided. +- **B4 — Flat instruction stream + PC dispatch.** Implemented end-to- + end on a throwaway branch (all 1705 tests + 29 lua53 suite tests + passed), closed unmerged (PR #233 records the findings). fib(30) + regressed 3%; `do_execute` self-time was unchanged (50.6% vs main's + 50.8%). On the BEAM, `[head | rest]` head-match destructures + head + tail in one op while `case :erlang.element(pc + 1, instrs) + do` is two ops (fetch + case discriminate); the hoped-for jump- + table optimization did not produce a net win. The + `Lua.Compiler.Linearize` design that the implementation used is + reusable as a **compile-time** input to B5 without affecting the + runtime executor. + +### What we learned + +- **Measure against today's profile, not the plan's old profile.** + B6's hypothesis was already obsolete when we got to it — PR #223 + had absorbed the win. Each B-plan should re-baseline before + starting. +- **Multi-`n` measurement is essential for table workloads.** A + single `n=500` data point is right on the BEAM-tuple-copy crossover + for B7-style array promotion; either side of that crossover tells + a completely different story. The bench harness rework was net + positive for the rest of the series — without it the B7 regression + at scale would have shipped. +- **BEAM optimisations are subtle.** `[head | rest]` head-matching is + heavily optimized and is hard to beat with `case`-on-tuple-element. + `@compile {:inline, ...}` does not cross module boundaries. + Refactors that *should* help on theoretical grounds may not on the + BEAM specifically; we have to measure. +- **Immutable data structures bound how fast we can be.** B7 hit this + with `setelement/3` on large tuples. The same constraint shapes + what B5 can deliver — register-tuple `setelement/3` is still 25% + of every workload's profile and the BEAM gives us no way around + that without going outside the VM (NIFs, ETS, persistent_term). + +### Remaining lever: B5 — Compile prototypes to Erlang functions + +B5 is the architectural lever for serious throughput: translate each +`%Lua.Compiler.Prototype{}` to an Erlang function body and call +`:compile.forms/2`, letting the BEAM JIT (BEAMASM on OTP 25+) +natively optimize the hot path. Plan stretch: fib parity with Luerl +(±5%). Plan: +[`.agents/plans/B5-compile-prototypes-to-erlang.md`](.agents/plans/B5-compile-prototypes-to-erlang.md). + +B4's deferral does not block B5: the `Lua.Compiler.Linearize` +implementation from B4 can be reintroduced as a compile-time +preparation step (feeding B5's codegen flat bytecode) without +touching the runtime executor. + +B5 is larger than B4 — full Erlang-AST codegen, module compile / load +/ purge lifecycle, fallback path for opcodes not yet translated. The +plan acknowledges that landing the framework is itself a +multi-month effort. Default position until a clear motivating +workload appears: **paused, with the implementation findings above +documenting why incremental dispatch-shape work is unlikely to move +the needle**. ## Deferred (intentional, not in 1.0)