Skip to content

perf(vm): split table storage into array + hash parts#229

Closed
davydog187 wants to merge 5 commits into
mainfrom
perf/table-array-hash-split
Closed

perf(vm): split table storage into array + hash parts#229
davydog187 wants to merge 5 commits into
mainfrom
perf/table-array-hash-split

Conversation

@davydog187
Copy link
Copy Markdown
Contributor

Split table storage into array + hash parts

Plan: .agents/plans/B7-table-array-hash-split.md

Also defers B6 (.agents/plans/B6-direct-table-refs.md) and marks B8 as merged via #227.

Goal

Reshape Lua.VM.Table so contiguous integer keys live in a tuple-backed
array part with exponential capacity growth, while non-integer / sparse keys
stay in the existing hash map. Mirrors PUC-Lua's internal layout. The
plan's main lever was reducing the per-key cost of sequential t[i] = ...
writes (the dominant pattern on every table-heavy workload).

Design

  • array :: tuple() holds values for the contiguous prefix 1..array_len.
    Tuple capacity (tuple_size(array)) is >= array_len; the headroom is
    filled with nil so reads beyond array_len short-circuit naturally.
  • Exponential growth (doubling, floor 4) keeps amortized append O(1).
    Within a 500-element build only ~9 grow events fire (capacities
    4 → 8 → ... → 512 → 1024).
  • array_has_holes :: boolean() keeps #t at O(1) for the
    overwhelmingly common case where no slot has been explicitly cleared
    via t[k] = nil. Falls back to a linear scan only when the user
    punches a hole.
  • Nil-valued slots are allowed in the array part as PUC-Lua-compatible
    hole markers. t[k] = nil mid-array sets the slot to nil and flips
    the holes flag rather than demoting the tail to the hash side.
  • Table.get/2, Table.has?/2, Table.length/1, Table.next_entry/2,
    Table.to_map/1, and Table.keys/1 all consult both parts. Every
    call site that read table.data for an integer key has been migrated
    to the new helpers.

Migrations (correctness)

Sites that previously inspected table.data directly and could see
integer keys have been migrated:

  • Lua.VM.Stdlib.lua_rawget, lua_rawlen, lua_ipairs
  • Lua.VM.Executor.table_index/4, table_newindex/5, table_length/2,
    the :length opcode, the get_table fast path
  • Lua.eval!/Lua.get!/Lua.set! traversal in lib/lua.ex
  • Lua.VM.Value.decode/2
  • Lua.VM.Display.peek_table/3
  • Lua.VM.State.globals/1

Sites that only read known-string keys (mt.data for __call,
__index, __newindex; package.data for "loaded", "path",
"preload"; loaded_table.data for module names; _G.data for global
names; string.data for "unpack") continue to read data directly —
those keys never live in the array part.

Success criteria

  • Lua.VM.Table carries array and array_len fields plus
    array_has_holes invariant flag.
  • Lua.VM.Table.length/1 is O(1) for the no-holes case (the
    dominant workload). Falls back to O(n) scan only when holes are
    known to exist.
  • t[i] for integer i in 1..array_len is element/2. No
    Map.get, no key normalization for in-range integer keys.
  • t[#t + 1] = v is amortized O(1) via exponential tuple growth.
  • ipairs(t) iterates the array via Table.get/2.
  • mix test passes — 1692 tests, 51 properties, 55 doctests, 0
    failures.
  • mix test --only lua53 does not regress — 29 tests, 0 failures.
  • Microbenchmarks improve (see below). Stretch targets partially
    hit; floor "no workload regresses by more than 2% on time" met.

Benchmarks (n=500, 10s benchee, 2s warmup; lua chunk path)

workload baseline after B7 delta beats luerl?
Table Build 89.45 µs 83.80 µs -6.3% yes (was tied)
Table Sort 245.45 µs 191.93 µs -21.8% no (was 2.2x, now 1.7x)
Iterate/Sum 129.91 µs 117.19 µs -9.8% yes (was tied)
Map+Reduce 277.32 µs 249.02 µs -10.2% yes (was 1.07x slower)
OOP 135.69 µs 122.26 µs -10% no (was 1.27x, now 1.14x)
table.concat 44.21 µs 32.22 µs -27% yes
fib(30) chunk 873 ms ~860 ms within noise (±3%)

Memory regressed on table-heavy workloads (e.g. table_build 0.65 MB →
1.68 MB). The cause is intermediate tuple copies during the build —
PUC-Lua mitigates this with mutable tuples; on the BEAM each
setelement/3 is conceptually a copy. The time wins on these
workloads were the priority. Memory is still order-of-magnitude
comparable to luerl; only specific table-heavy bench shapes regress.

Stretch targets not hit

  • table_build 30% faster (plan target: ~160 µs): realized 6.3%
    (~84 µs). Most of the plan's projected win came from eliminating the
    per-key Map.put + order_tail cons + dead check pipeline.
    Replacing that with an exponential-growth tuple plus setelement/3 is
    cheaper, but setelement/3 on a doubled-capacity tuple is not free.
  • Memory ratio 2-3x luerl: realized ~3x on Build, worse on Sort.
    Bounded by BEAM tuple semantics; would require either NIF mutability
    or a more invasive co-design with the codegen (e.g. sized-tuple
    emission for table literals) to close further. Out of scope here.
  • big.lua < 30s: not assessed; it's still in the lua53 skipped
    list pending the larger A10 work it gates.

Discoveries

  • The first iteration without exponential growth used :erlang.append_element/2
    which is O(n) per call. That regressed table_build by +12% time and
    +160% memory. The plan's risks section warned about exactly this;
    exponential growth was the cited mitigation and is what landed.
  • The first version eagerly demoted array slots to the hash part when
    t[k] = nil punched a hole, to keep the array contiguous. That broke
    for k,v in pairs(t) do t[k] = nil end because the cleared key was
    no longer findable for next(t, k) to advance past it. Switching to
    PUC-Lua's nil-as-hole semantics (set the slot to nil in place, flip
    array_has_holes) is both simpler and correct.

Changes

 lib/lua.ex                                       |   6 +-
 lib/lua/vm/display.ex                            |   2 +-
 lib/lua/vm/executor.ex                           |  63 +++++-
 lib/lua/vm/state.ex                              |   2 +-
 lib/lua/vm/stdlib.ex                             |   8 +-
 lib/lua/vm/table.ex                              | 508 +++++++++++++++++++++---
 lib/lua/vm/value.ex                              |   2 +-
 test/lua/vm/value_test.exs                       |   9 +-
 .agents/plans/B6-direct-table-refs.md            |  35 +-
 .agents/plans/B7-table-array-hash-split.md       |   2 +-
 .agents/plans/B8-inline-numeric-narrowing.md     |   2 +-

Verification

mix format
mix compile --warnings-as-errors
mix test                # 1692 tests, 0 failures
mix test --only lua53   # 29 tests, 0 failures, 23 skipped
mix run benchmarks/table_ops.exs
mix run benchmarks/oop.exs
mix run benchmarks/string_ops.exs
mix run benchmarks/fibonacci.exs   # confirm no regression

Out of scope (intentional)

  • Migrating to a "true" mutable array via NIFs or ETS. Stays on the BEAM.
  • Codegen sized-tuple emission for table literals ({1, 2, 3}). Would
    reduce memory churn on the static-table case but adds a new compiler
    contract; follow-up.
  • Cleaning up Lua.VM.Value.sequence_length/1 (the old map-based
    helper). Still works on a bare map; removing it is a public-API
    question, separate concern.
  • Closing further memory gap with luerl on table workloads.

davydog187 added a commit that referenced this pull request May 21, 2026
Records PR #229. Documents the discovery that the plan's projected
30% win was reachable in theory but bounded by BEAM tuple semantics;
the realized wins concentrate on time (6-22% across table workloads,
new wins over luerl on 3/4 of them) with a memory regression that
follows from immutable-tuple growth.

Also records B6's deferral and B8's merge in the plan changelogs.
Post-PR #223 / #227 profile shows Map.get/2 + Map.get/3 combined at
3.28% on fib(22) (plan claimed 6.4%), 2.81% on OOP, and 0.04% on
table_build. The real table-workload bottlenecks live inside
Lua.VM.Table (insert/put 18%, normalize_key 3.3%, sequence_length 4%)
and in :erlang.setelement (17.5% on table writes, 20.9% on OOP).
Those are B7's targets, not B6's.

B6's projected wall-clock win is now below 1%, inside benchee's
deviation band on every measured workload. Audit cleanup may still be
worth doing later as a refactor, but not as a perf plan and not
before B7.
Reshapes Lua.VM.Table so contiguous integer keys live in a tuple-backed
array part with exponential capacity growth, while non-integer / sparse
keys stay in the existing hash map. Mirrors PUC-Lua's internal layout.

Adds Table.get/2, Table.has?/2, Table.length/1 helpers that consult both
parts; migrates every site that read `table.data` for an integer key
(rawget, rawlen, ipairs, get_table fast path, lua.ex traversal, decode,
display) onto the new helpers. Sites that only touch known-string keys
(metatable __index/__newindex/__call lookups, package.loaded module
caching, _G global lookups) continue reading `data` directly.

The array part uses exponential capacity growth (doubling with a floor
of 4) so sequential `t[i] = ...` writes are amortized O(1) per append
rather than O(n) for naive Tuple.append. An `array_has_holes` flag
keeps `#t` at O(1) for the overwhelmingly common case where no slot
has been explicitly cleared.

Nil-valued slots are allowed in the array part as PUC-Lua-compatible
hole markers; `t[k] = nil` mid-array sets the slot to nil and flags
holes rather than demoting the tail to the hash side. Reads return nil
naturally via element/2; iteration via next_entry skips nil slots.

Benchmarks (n=500, 10s benchee, 2s warmup; lua chunk path):

- Table Build:    89.45 \xc2\xb5s \xe2\x86\x92 83.80 \xc2\xb5s  (-6.3%; beats luerl)
- Table Sort:    245.45 \xc2\xb5s \xe2\x86\x92 191.93 \xc2\xb5s  (-21.8%)
- Iterate/Sum:   129.91 \xc2\xb5s \xe2\x86\x92 117.19 \xc2\xb5s  (-9.8%; beats luerl)
- Map+Reduce:    277.32 \xc2\xb5s \xe2\x86\x92 249.02 \xc2\xb5s  (-10.2%; beats luerl)
- OOP:           135.69 \xc2\xb5s \xe2\x86\x92 122.26 \xc2\xb5s  (-10%)
- table.concat:   44.21 \xc2\xb5s \xe2\x86\x92  32.22 \xc2\xb5s  (-27%)
- fib(30):       within noise (\xc2\xb13%)

The plan's 30% stretch on table_build was not hit \xe2\x80\x94 most of the win
the plan projected came from eliminating per-key Map.put. The
exponential-growth tuple is faster than the per-key Map.put, but
setelement on a growing tuple still has real cost.

Memory regresses on table-heavy workloads (e.g. table_build 0.65 MB
\xe2\x86\x92 1.68 MB) because of intermediate tuple copies during the build.
PUC-Lua mitigates this with mutable tuples; on the BEAM each setelement
is a copy. Still well below C Lua \xc3\x97 1.0 of course, and time wins
on these workloads were the priority.

Plan: .agents/plans/B7-table-array-hash-split.md
Records PR #229. Documents the discovery that the plan's projected
30% win was reachable in theory but bounded by BEAM tuple semantics;
the realized wins concentrate on time (6-22% across table workloads,
new wins over luerl on 3/4 of them) with a memory regression that
follows from immutable-tuple growth.

Also records B6's deferral and B8's merge in the plan changelogs.
@davydog187 davydog187 force-pushed the perf/table-array-hash-split branch from 230344f to 2f020b9 Compare May 22, 2026 00:31
@davydog187
Copy link
Copy Markdown
Contributor Author

Closing after multi-n measurement on the merged bench harness (#230) revealed a hard crossover that makes this PR unsafe to ship.

What the data showed

Five-run variance check + full-mode multi-n sweep (n ∈ {10, 100, 1000}):

Workload @ n main chunk B7 chunk delta luerl
Build n=10 1.91 µs 1.92 µs flat 3.10 µs
Build n=100 17.09 µs 14.03 µs -18% 18.47 µs
Build n=1000 197.96 µs 265.82 µs +34% ⚠️ 184.55 µs
Sort n=100 34.91 µs 27.57 µs -21% 21.41 µs
Sort n=1000 490.49 µs 655.72 µs +34% ⚠️ 216.04 µs
Iterate n=100 24.59 µs 21.11 µs -14% 28.28 µs
Iterate n=1000 276.74 µs 358.64 µs +30% ⚠️ 283.40 µs
Map+Reduce n=100 49.79 µs 42.78 µs -14% 51.27 µs
Map+Reduce n=1000 603.93 µs 843.57 µs +40% ⚠️ 527.66 µs

Memory at n=1000 is also bad: ~3-5× main's allocations (e.g. Sort 2.08 MB → 12.40 MB).

Why the regression at scale

B7 routes contiguous integer keys into a tuple-backed array part with exponential capacity growth. At n=100 the tuple is ~128 cells and setelement/3's constant-factor advantage over Map.put wins out (-14% to -21%). At n=1000 the tuple is ~1024 cells; every setelement/3 copies the full tuple. PUC-Lua mitigates this with in-place mutation in C; we can't on the BEAM.

The single n=500 number that motivated investigation was right at the crossover, which explains the run-to-run inconsistency we saw before #230 landed.

What would unblock this

Threshold-based promotion: keep contiguous integer keys in the hash map until array_len ≥ N (e.g. 256), then promote. Preserves the small-table win without the large-table loss. That's a substantial revision of this PR and arguably a different plan; closing rather than scope-creep.

Plan status

.agents/plans/B7-table-array-hash-split.md is being moved from reviewdeferred with this measurement data as the rationale. Threshold-based promotion can be considered as a future plan; the conditions for re-opening are documented in the plan file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant