Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
126 changes: 110 additions & 16 deletions ROADMAP.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,14 @@

This is the strategic overview. For per-PR detail, see [`.agents/plans/`](.agents/plans).

## Status: 2026-05-06
## Status: 2026-05-21

- **Unit tests**: 1,420 passing, 0 failing, 31 skipped.
- **Lua 5.3 official suite**: 5/29 files passing (`simple_test.lua`,
`api.lua`, `bitwise.lua`, `code.lua`, `vararg.lua`).
- **Current focus**: [Direction A — Suite Triage](#in-flight-direction-a--suite-triage-milestone-100).
- **Unit tests**: 1,705 passing, 0 failing, 30 skipped.
- **Lua 5.3 official suite**: 6/29 files passing (`simple_test.lua`,
`api.lua`, `bitwise.lua`, `code.lua`, `tpack.lua`, `vararg.lua`).
- **Current focus**: post-B-series consolidation. See [Direction B —
Performance](#direction-b--performance-1-0-x) for what was tried,
what shipped, and what we learned about the limits.

## Done

Expand All @@ -30,6 +32,13 @@ The new Elixir-native VM (replacing Luerl) is built up through:
- O(N²) → O(N) upvalue collection in closure handler (PR #154).
- O(1) upvalue access by storing upvalues as a tuple (PR #155).
- Fully tail-recursive CPS executor with line tracking off heap (PR #156).
- Fast-path executor dispatch (numeric arith, comparisons, string
concat, `get_field` / `set_field`) (PR #223).
- In-range fast path for `Numeric.to_signed_int64/1` (B8, PR #227).
-3% on fib(30).
- Bench harness: quick mode + multi-n inputs via
`LUA_BENCH_MODE` (PR #230). 17 min → 80 s for the full suite in
quick mode; full mode preserved for publishable numbers.

## In flight: Direction A — Suite Triage (milestone `1.0.0`)

Expand Down Expand Up @@ -72,17 +81,102 @@ under the [`0.5.0` milestone](https://github.com/tv-labs/lua/milestone/1).
- **A12**: README and CHANGELOG for 1.0.0-rc.1.
- **A13**: Cut `1.0.0-rc.1` (blocked on the rest).

## Next: Direction B — Performance (milestone `1.0.x`)

Several B-direction wins shipped already (PRs #153–#156). What remains:

- **B1**: Drop `source_line` instructions in non-debug compilation.
- **B2**: Codegen peephole pass (fold `load_constant N k; move M N` → `load_constant M k`).
- **B3**: Re-baseline benchmarks against Luerl and PUC-Lua. Decide whether further
architectural work (e.g. flat instruction stream + PC dispatch) is justified.

Per-PR plans land in [`.agents/plans/B*.md`](.agents/plans) when Direction A
wraps.
## Direction B — Performance (`1.0.x`)

Several B-direction wins landed early on (PRs #153–#156, #223). The
B4–B8 sweep in May 2026 then attempted four larger architectural
levers; the results are summarised here so the lessons survive the
ephemeral plan files.

### Shipped

- **B8 — Numeric narrowing fast path** (PR #227). Guard-clause short
circuits `Numeric.to_signed_int64/1` for in-range integers.
−3.3% on fib(30) chunk, no regressions. The realised win came
entirely from the guard short-circuit; `@compile {:inline, ...}`
does not cross module boundaries, so the cross-module call sites in
`Executor` / `Value` still trip a function boundary.
- **Bench harness rework** (PR #230). `LUA_BENCH_MODE=quick` (default)
cuts the full suite from ~17 min to ~80 s; `LUA_BENCH_MODE=full`
preserves the long windows plus a multi-`n` sweep (`{10, 100,
1000}`) for the table workloads. This harness is what surfaced B7's
scale regression — the single-`n` measurement we had before would
have hidden it.

### Tried and deferred (with findings)

- **B6 — Eliminate per-tref `Map.fetch!` re-resolution.** Deferred in
PR #229 / #231. Post-PR #223 profile no longer supports the
hypothesis: `Map.get` is ~3.3% on fib(22) and ~0.04% on table_build.
The earlier headline number (~6.4%) was absorbed by the fast-path
work in PR #223. The remaining audit cleanup is worth doing later
as a refactor, not as a perf plan.
- **B7 — Array + hash split for `Lua.VM.Table`.** Implemented in PR
#229, closed unmerged. Wins at small `n` (-14% to -21% at `n=100`),
loses badly at large `n` (+30% to +40% at `n=1000`). Memory
regresses 3-5x at `n=1000`. The crossover is structural: BEAM
tuples are immutable, so every `setelement/3` on a 1024-cell tuple
copies the whole tuple. PUC-Lua avoids this with in-place mutation
in C; we cannot. A future plan could revisit with
*threshold-based promotion* (stay in the data map until
`array_len ≥ N`, then promote) — the small-`n` wins are real and
worth preserving if the regression can be avoided.
- **B4 — Flat instruction stream + PC dispatch.** Implemented end-to-
end on a throwaway branch (all 1705 tests + 29 lua53 suite tests
passed), closed unmerged (PR #233 records the findings). fib(30)
regressed 3%; `do_execute` self-time was unchanged (50.6% vs main's
50.8%). On the BEAM, `[head | rest]` head-match destructures
head + tail in one op while `case :erlang.element(pc + 1, instrs)
do` is two ops (fetch + case discriminate); the hoped-for jump-
table optimization did not produce a net win. The
`Lua.Compiler.Linearize` design that the implementation used is
reusable as a **compile-time** input to B5 without affecting the
runtime executor.

### What we learned

- **Measure against today's profile, not the plan's old profile.**
B6's hypothesis was already obsolete when we got to it — PR #223
had absorbed the win. Each B-plan should re-baseline before
starting.
- **Multi-`n` measurement is essential for table workloads.** A
single `n=500` data point is right on the BEAM-tuple-copy crossover
for B7-style array promotion; either side of that crossover tells
a completely different story. The bench harness rework was net
positive for the rest of the series — without it the B7 regression
at scale would have shipped.
- **BEAM optimisations are subtle.** `[head | rest]` head-matching is
heavily optimized and is hard to beat with `case`-on-tuple-element.
`@compile {:inline, ...}` does not cross module boundaries.
Refactors that *should* help on theoretical grounds may not on the
BEAM specifically; we have to measure.
- **Immutable data structures bound how fast we can be.** B7 hit this
with `setelement/3` on large tuples. The same constraint shapes
what B5 can deliver — register-tuple `setelement/3` is still 25%
of every workload's profile and the BEAM gives us no way around
that without going outside the VM (NIFs, ETS, persistent_term).

### Remaining lever: B5 — Compile prototypes to Erlang functions

B5 is the architectural lever for serious throughput: translate each
`%Lua.Compiler.Prototype{}` to an Erlang function body and call
`:compile.forms/2`, letting the BEAM JIT (BEAMASM on OTP 25+)
natively optimize the hot path. Plan stretch: fib parity with Luerl
(±5%). Plan:
[`.agents/plans/B5-compile-prototypes-to-erlang.md`](.agents/plans/B5-compile-prototypes-to-erlang.md).

B4's deferral does not block B5: the `Lua.Compiler.Linearize`
implementation from B4 can be reintroduced as a compile-time
preparation step (feeding B5's codegen flat bytecode) without
touching the runtime executor.

B5 is larger than B4 — full Erlang-AST codegen, module compile / load
/ purge lifecycle, fallback path for opcodes not yet translated. The
plan acknowledges that landing the framework is itself a
multi-month effort. Default position until a clear motivating
workload appears: **paused, with the implementation findings above
documenting why incremental dispatch-shape work is unlikely to move
the needle**.

## Deferred (intentional, not in 1.0)

Expand Down
Loading