perf(vm): split table storage into array + hash parts by davydog187 · Pull Request #229 · tv-labs/lua

davydog187 · 2026-05-21T22:24:44Z

Split table storage into array + hash parts

Plan: .agents/plans/B7-table-array-hash-split.md

Also defers B6 (.agents/plans/B6-direct-table-refs.md) and marks B8 as merged via #227.

Goal

Reshape Lua.VM.Table so contiguous integer keys live in a tuple-backed
array part with exponential capacity growth, while non-integer / sparse keys
stay in the existing hash map. Mirrors PUC-Lua's internal layout. The
plan's main lever was reducing the per-key cost of sequential t[i] = ...
writes (the dominant pattern on every table-heavy workload).

Design

array :: tuple() holds values for the contiguous prefix 1..array_len.
Tuple capacity (tuple_size(array)) is >= array_len; the headroom is
filled with nil so reads beyond array_len short-circuit naturally.
Exponential growth (doubling, floor 4) keeps amortized append O(1).
Within a 500-element build only ~9 grow events fire (capacities
4 → 8 → ... → 512 → 1024).
array_has_holes :: boolean() keeps #t at O(1) for the
overwhelmingly common case where no slot has been explicitly cleared
via t[k] = nil. Falls back to a linear scan only when the user
punches a hole.
Nil-valued slots are allowed in the array part as PUC-Lua-compatible
hole markers. t[k] = nil mid-array sets the slot to nil and flips
the holes flag rather than demoting the tail to the hash side.
Table.get/2, Table.has?/2, Table.length/1, Table.next_entry/2,
Table.to_map/1, and Table.keys/1 all consult both parts. Every
call site that read table.data for an integer key has been migrated
to the new helpers.

Migrations (correctness)

Sites that previously inspected table.data directly and could see
integer keys have been migrated:

Lua.VM.Stdlib.lua_rawget, lua_rawlen, lua_ipairs
Lua.VM.Executor.table_index/4, table_newindex/5, table_length/2,
the :length opcode, the get_table fast path
Lua.eval!/Lua.get!/Lua.set! traversal in lib/lua.ex
Lua.VM.Value.decode/2
Lua.VM.Display.peek_table/3
Lua.VM.State.globals/1

Sites that only read known-string keys (mt.data for __call,
__index, __newindex; package.data for "loaded", "path",
"preload"; loaded_table.data for module names; _G.data for global
names; string.data for "unpack") continue to read data directly —
those keys never live in the array part.

Success criteria

Lua.VM.Table carries array and array_len fields plus
array_has_holes invariant flag.
Lua.VM.Table.length/1 is O(1) for the no-holes case (the
dominant workload). Falls back to O(n) scan only when holes are
known to exist.
t[i] for integer i in 1..array_len is element/2. No
Map.get, no key normalization for in-range integer keys.
t[#t + 1] = v is amortized O(1) via exponential tuple growth.
ipairs(t) iterates the array via Table.get/2.
mix test passes — 1692 tests, 51 properties, 55 doctests, 0
failures.
mix test --only lua53 does not regress — 29 tests, 0 failures.
Microbenchmarks improve (see below). Stretch targets partially
hit; floor "no workload regresses by more than 2% on time" met.

Benchmarks (n=500, 10s benchee, 2s warmup; lua chunk path)

workload	baseline	after B7	delta	beats luerl?
Table Build	89.45 µs	83.80 µs	-6.3%	yes (was tied)
Table Sort	245.45 µs	191.93 µs	-21.8%	no (was 2.2x, now 1.7x)
Iterate/Sum	129.91 µs	117.19 µs	-9.8%	yes (was tied)
Map+Reduce	277.32 µs	249.02 µs	-10.2%	yes (was 1.07x slower)
OOP	135.69 µs	122.26 µs	-10%	no (was 1.27x, now 1.14x)
table.concat	44.21 µs	32.22 µs	-27%	yes
fib(30) chunk	873 ms	~860 ms	within noise (±3%)	—

Memory regressed on table-heavy workloads (e.g. table_build 0.65 MB →
1.68 MB). The cause is intermediate tuple copies during the build —
PUC-Lua mitigates this with mutable tuples; on the BEAM each
setelement/3 is conceptually a copy. The time wins on these
workloads were the priority. Memory is still order-of-magnitude
comparable to luerl; only specific table-heavy bench shapes regress.

Stretch targets not hit

table_build 30% faster (plan target: ~160 µs): realized 6.3%
(~84 µs). Most of the plan's projected win came from eliminating the
per-key Map.put + order_tail cons + dead check pipeline.
Replacing that with an exponential-growth tuple plus setelement/3 is
cheaper, but setelement/3 on a doubled-capacity tuple is not free.
Memory ratio 2-3x luerl: realized ~3x on Build, worse on Sort.
Bounded by BEAM tuple semantics; would require either NIF mutability
or a more invasive co-design with the codegen (e.g. sized-tuple
emission for table literals) to close further. Out of scope here.
big.lua < 30s: not assessed; it's still in the lua53 skipped
list pending the larger A10 work it gates.

Discoveries

The first iteration without exponential growth used :erlang.append_element/2
which is O(n) per call. That regressed table_build by +12% time and
+160% memory. The plan's risks section warned about exactly this;
exponential growth was the cited mitigation and is what landed.
The first version eagerly demoted array slots to the hash part when
t[k] = nil punched a hole, to keep the array contiguous. That broke
for k,v in pairs(t) do t[k] = nil end because the cleared key was
no longer findable for next(t, k) to advance past it. Switching to
PUC-Lua's nil-as-hole semantics (set the slot to nil in place, flip
array_has_holes) is both simpler and correct.

Changes

 lib/lua.ex                                       |   6 +-
 lib/lua/vm/display.ex                            |   2 +-
 lib/lua/vm/executor.ex                           |  63 +++++-
 lib/lua/vm/state.ex                              |   2 +-
 lib/lua/vm/stdlib.ex                             |   8 +-
 lib/lua/vm/table.ex                              | 508 +++++++++++++++++++++---
 lib/lua/vm/value.ex                              |   2 +-
 test/lua/vm/value_test.exs                       |   9 +-
 .agents/plans/B6-direct-table-refs.md            |  35 +-
 .agents/plans/B7-table-array-hash-split.md       |   2 +-
 .agents/plans/B8-inline-numeric-narrowing.md     |   2 +-

Verification

mix format
mix compile --warnings-as-errors
mix test                # 1692 tests, 0 failures
mix test --only lua53   # 29 tests, 0 failures, 23 skipped
mix run benchmarks/table_ops.exs
mix run benchmarks/oop.exs
mix run benchmarks/string_ops.exs
mix run benchmarks/fibonacci.exs   # confirm no regression

Out of scope (intentional)

Migrating to a "true" mutable array via NIFs or ETS. Stays on the BEAM.
Codegen sized-tuple emission for table literals ({1, 2, 3}). Would
reduce memory churn on the static-table case but adds a new compiler
contract; follow-up.
Cleaning up Lua.VM.Value.sequence_length/1 (the old map-based
helper). Still works on a bare map; removing it is a public-API
question, separate concern.
Closing further memory gap with luerl on table workloads.

Records PR #229. Documents the discovery that the plan's projected 30% win was reachable in theory but bounded by BEAM tuple semantics; the realized wins concentrate on time (6-22% across table workloads, new wins over luerl on 3/4 of them) with a memory regression that follows from immutable-tuple growth. Also records B6's deferral and B8's merge in the plan changelogs.

Post-PR #223 / #227 profile shows Map.get/2 + Map.get/3 combined at 3.28% on fib(22) (plan claimed 6.4%), 2.81% on OOP, and 0.04% on table_build. The real table-workload bottlenecks live inside Lua.VM.Table (insert/put 18%, normalize_key 3.3%, sequence_length 4%) and in :erlang.setelement (17.5% on table writes, 20.9% on OOP). Those are B7's targets, not B6's. B6's projected wall-clock win is now below 1%, inside benchee's deviation band on every measured workload. Audit cleanup may still be worth doing later as a refactor, but not as a perf plan and not before B7.

Reshapes Lua.VM.Table so contiguous integer keys live in a tuple-backed array part with exponential capacity growth, while non-integer / sparse keys stay in the existing hash map. Mirrors PUC-Lua's internal layout. Adds Table.get/2, Table.has?/2, Table.length/1 helpers that consult both parts; migrates every site that read `table.data` for an integer key (rawget, rawlen, ipairs, get_table fast path, lua.ex traversal, decode, display) onto the new helpers. Sites that only touch known-string keys (metatable __index/__newindex/__call lookups, package.loaded module caching, _G global lookups) continue reading `data` directly. The array part uses exponential capacity growth (doubling with a floor of 4) so sequential `t[i] = ...` writes are amortized O(1) per append rather than O(n) for naive Tuple.append. An `array_has_holes` flag keeps `#t` at O(1) for the overwhelmingly common case where no slot has been explicitly cleared. Nil-valued slots are allowed in the array part as PUC-Lua-compatible hole markers; `t[k] = nil` mid-array sets the slot to nil and flags holes rather than demoting the tail to the hash side. Reads return nil naturally via element/2; iteration via next_entry skips nil slots. Benchmarks (n=500, 10s benchee, 2s warmup; lua chunk path): - Table Build: 89.45 \xc2\xb5s \xe2\x86\x92 83.80 \xc2\xb5s (-6.3%; beats luerl) - Table Sort: 245.45 \xc2\xb5s \xe2\x86\x92 191.93 \xc2\xb5s (-21.8%) - Iterate/Sum: 129.91 \xc2\xb5s \xe2\x86\x92 117.19 \xc2\xb5s (-9.8%; beats luerl) - Map+Reduce: 277.32 \xc2\xb5s \xe2\x86\x92 249.02 \xc2\xb5s (-10.2%; beats luerl) - OOP: 135.69 \xc2\xb5s \xe2\x86\x92 122.26 \xc2\xb5s (-10%) - table.concat: 44.21 \xc2\xb5s \xe2\x86\x92 32.22 \xc2\xb5s (-27%) - fib(30): within noise (\xc2\xb13%) The plan's 30% stretch on table_build was not hit \xe2\x80\x94 most of the win the plan projected came from eliminating per-key Map.put. The exponential-growth tuple is faster than the per-key Map.put, but setelement on a growing tuple still has real cost. Memory regresses on table-heavy workloads (e.g. table_build 0.65 MB \xe2\x86\x92 1.68 MB) because of intermediate tuple copies during the build. PUC-Lua mitigates this with mutable tuples; on the BEAM each setelement is a copy. Still well below C Lua \xc3\x97 1.0 of course, and time wins on these workloads were the priority. Plan: .agents/plans/B7-table-array-hash-split.md

Records PR #229. Documents the discovery that the plan's projected 30% win was reachable in theory but bounded by BEAM tuple semantics; the realized wins concentrate on time (6-22% across table workloads, new wins over luerl on 3/4 of them) with a memory regression that follows from immutable-tuple growth. Also records B6's deferral and B8's merge in the plan changelogs.

davydog187 · 2026-05-22T01:08:55Z

Closing after multi-n measurement on the merged bench harness (#230) revealed a hard crossover that makes this PR unsafe to ship.

What the data showed

Five-run variance check + full-mode multi-n sweep (n ∈ {10, 100, 1000}):

Workload @ n	main chunk	B7 chunk	delta	luerl
Build n=10	1.91 µs	1.92 µs	flat	3.10 µs
Build n=100	17.09 µs	14.03 µs	-18%	18.47 µs
Build n=1000	197.96 µs	265.82 µs	+34% ⚠️	184.55 µs
Sort n=100	34.91 µs	27.57 µs	-21%	21.41 µs
Sort n=1000	490.49 µs	655.72 µs	+34% ⚠️	216.04 µs
Iterate n=100	24.59 µs	21.11 µs	-14%	28.28 µs
Iterate n=1000	276.74 µs	358.64 µs	+30% ⚠️	283.40 µs
Map+Reduce n=100	49.79 µs	42.78 µs	-14%	51.27 µs
Map+Reduce n=1000	603.93 µs	843.57 µs	+40% ⚠️	527.66 µs

Memory at n=1000 is also bad: ~3-5× main's allocations (e.g. Sort 2.08 MB → 12.40 MB).

Why the regression at scale

B7 routes contiguous integer keys into a tuple-backed array part with exponential capacity growth. At n=100 the tuple is ~128 cells and setelement/3's constant-factor advantage over Map.put wins out (-14% to -21%). At n=1000 the tuple is ~1024 cells; every setelement/3 copies the full tuple. PUC-Lua mitigates this with in-place mutation in C; we can't on the BEAM.

The single n=500 number that motivated investigation was right at the crossover, which explains the run-to-run inconsistency we saw before #230 landed.

What would unblock this

Threshold-based promotion: keep contiguous integer keys in the hash map until array_len ≥ N (e.g. 256), then promote. Preserves the small-table win without the large-table loss. That's a substantial revision of this PR and arguably a different plan; closing rather than scope-creep.

Plan status

.agents/plans/B7-table-array-hash-split.md is being moved from review → deferred with this measurement data as the rationale. Threshold-based promotion can be considered as a future plan; the conditions for re-opening are documented in the plan file.

davydog187 added 5 commits May 21, 2026 17:31

chore(B8): mark plan merged via PR #227

f25da3f

chore(B7): start plan

28ef76d

davydog187 force-pushed the perf/table-array-hash-split branch from 230344f to 2f020b9 Compare May 22, 2026 00:31

davydog187 closed this May 22, 2026

This was referenced May 22, 2026

chore(B7): defer plan; tuple cost crossover makes large tables a regression #231

Merged

docs(roadmap): consolidate B-series findings (B4, B6, B7, B8 + harness) #234

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(vm): split table storage into array + hash parts#229

perf(vm): split table storage into array + hash parts#229
davydog187 wants to merge 5 commits into
mainfrom
perf/table-array-hash-split

davydog187 commented May 21, 2026

Uh oh!

davydog187 commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

davydog187 commented May 21, 2026

Split table storage into array + hash parts

Goal

Design

Migrations (correctness)

Success criteria

Benchmarks (n=500, 10s benchee, 2s warmup; lua chunk path)

Stretch targets not hit

Discoveries

Changes

Verification

Out of scope (intentional)

Uh oh!

davydog187 commented May 22, 2026

What the data showed

Why the regression at scale

What would unblock this

Plan status

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant