chore: wire slimmed PLENA compiler cleanup by booth-algo · Pull Request #34 · AICrossSim/PLENA_Simulator

booth-algo · 2026-05-12T17:30:23Z

Summary

Advances the PLENA_Compiler submodule to the slimmed compiler cleanup branch (2af064f), tracked in refactor: slim PLENA compiler frontend and codegen PLENA_Compiler#38.
Moves the canonical ATen e2e runner from compiler.generator to compiler.aten.e2e_runner so the ATen path no longer appears to be generator-owned.
Keeps compatibility aliases for existing scripts: compiler.generator.aten_runner and just test-generator-aten now forward to the canonical ATen runner.
Updates simulator docs/just recipes to use just test-aten-e2e and python -m compiler.aten.e2e_runner.
Updates simulator testbench imports to canonical compiler APIs after compatibility shims were removed: compiler.asm_templates.flashattn and create_mem_for_sim.
Merges current main so this PR is scoped after the already-merged emulator precision/config fix in fix(emulator): bf16 scalar FP registers + fp32 vector ops #32.

Validation

Add test-generator-aten justfile recipe that invokes the new ATen backend mode in the compiler submodule. Bump compiler submodule to feat/generator-aten-backend.

Add build_and_run_multi_layer_test that chains N decoder layers in a single PlenaCompiler program with pre-norm + residual architecture: embed_add → [rms_norm → flash_attn → residual → rms_norm → ffn → residual] × N → rms_norm RoPE is omitted (requires precomputed Q_rot from golden intermediates, orthogonal to testing multi-layer chaining). Includes VRAM padding to avoid the known ffn_asm absolute-address intermediate layout overlapping the residual scratch buffer. Verified: 1-layer 100% allclose, 2-layer 93.60% allclose (SmolLM2-135M).

- BEHAVIOR.MATRIX_SRAM_SIZE: 1024 → 4096 (64 MRAM tiles for 3× K-split FFN) - BEHAVIOR.VECTOR_SRAM_SIZE: 65536 → 524288 (native hidden=384 activations) - BEHAVIOR.MAX_LOOP_INSTRUCTIONS: 10000 → 100000 (native FFN loop bodies)

- CRITICAL-1: add return after _skip_if_hf_unavailable to prevent unbound variable on exception fall-through - HIGH-1: _rms_norm_ref now uses bfloat16 intermediates (rsqrt path) matching PLENA hardware behavior

…ackend

Prevents bf16 overflow → NaN propagation in multi-layer runs. exp(89) exceeds bf16 max → inf, and downstream -inf * 0 = NaN. Clamping matches hardware exp unit behavior (saturate, not overflow). Verified: test-softmax and test-flash-attention pass unchanged.

Promote vector add/sub/mul/reciprocal operands to f32 before arithmetic, then quantize back to the original data type. This matches the golden precision model and eliminates NaN accumulation across many-layer runs (5-layer: 100% allclose, vision encoder: 99.95%). - Remove mimalloc dependency (Cargo.lock cleanup) - Bump compiler submodule (load_toml_config fix, PlenaFrontend e2e, layer progress markers, bf16 exp overflow tests) - Add docs/vlm-support.md covering ATen vs Generator pipelines, SmolVLM2 test inventory, and new vision ops/templates - Add tools/run_emulator_tracked.sh for progress-monitored runs

Promote scalar FP registers (fp_reg) and FPSRAM from f16 to f32. This fixes softmax underflow when attention scores have wide spread (e.g., SmolVLM2-256M where QK^T values reach ~2600). The RTL uses custom e6m5 (12-bit) format, the TOML declares e8m7, and the emulator previously used IEEE f16 (1+5+10) — a three-way mismatch. Using f32 for the emulator's scalar path avoids precision loss entirely, matching the vector ops pattern (promote to f32, quantize on VRAM write). Also promotes vector add/sub/mul/reciprocal to f32 before arithmetic to match the golden precision model (100% allclose on clm-60m 1/5/10 layers). Status: clm-60m passes at 100% allclose. SmolVLM2 no longer produces NaN (was 0%, now 32.78%) but has a separate ISA accuracy issue at hidden=576 that needs further investigation.

…ests - Remove all SCALE debug instrumentation (was 7x slowdown) - Remove CONFIG startup print - Fix default MXFP8 scale sign: true→false in load_config.rs defaults (defensive fix, TOML loads correctly regardless) - Add unit tests for e8m0 scale decode and MXFP8 round-trip - Compiler submodule: fix K-split temp buffer overlap in linear_ops.py, add _ksplit_matmul to FFN golden reference Pending verification: full SmolVLM2 1-layer test running in background (~80 min). The linear_ops K-split fix + golden precision fix should improve from 32.78% allclose.

…s root cause

Feedback comparison using the preserved vram_dump confirms: when the golden reference uses the emulator's own intermediate values as input, the FFN+norm output matches at 100% allclose. The 32.62% failure is ENTIRELY from golden reference precision drift — the golden chain accumulates float32 differently than the hardware's BF16 intermediates. The emulator correctly implements: - MXFP8 scale application ✓ - Linear projections (K-split, stride mode) ✓ - RMS norm at any magnitude ✓ - Flash attention (per-head, with causal mask) ✓ - FFN (gate+up+down, 576→1536→576) ✓ - Residual adds ✓ - f32 scalar FP registers ✓ Fix needed: align golden reference to exactly match hardware precision at each intermediate step (not just BF16 truncation after ops, but also K-split accumulation order and exact matmul precision boundaries).

… comments - CRITICAL: fix all 4 default MXFP8 scale types to sign:false (e8m0 is unsigned) - HIGH: promote V_EXP_V to f32 (consistent with add/sub/mul/reciprocal) - HIGH: fix indentation on vec_i_f32 lines (extra 4 spaces) - HIGH: document S_RECI_FP div-by-zero behavior (IEEE infinity, intentional) - MEDIUM: remove unconditional println! in mv() (was debug noise) - LOW: clean up test comment scratch notes in dtype.rs

…ackend # Conflicts: # compiler # transactional_emulator/src/main.rs

# Conflicts: # PLENA_Compiler

Change scalar FP registers and FPSRAM from IEEE f16 (1+5+10) to bf16 (1+8+7), matching the declared TOML precision (e8m7). IEEE f16's 5-bit exponent (range ~6e-8) causes softmax underflow for models with large attention scores (SmolVLM2: all NaN). bf16's 8-bit exponent (range ~1.2e-38) prevents this while staying closer to the RTL's e6m5 format than f32 would. Results: f16 regs: SmolVLM2 = 0% (all NaN) bf16 regs: SmolVLM2 = 100% VRAM stage allclose (MSE=3.54e-05) clm-60m: unaffected (works with any format) Also adds S_RECI_FP MIN_POSITIVE clamp as defense-in-depth.

…#36)

booth-algo added 30 commits April 27, 2026 15:00

feat(generator): add ATen backend mode + justfile recipe

a5e9f53

Add test-generator-aten justfile recipe that invokes the new ATen backend mode in the compiler submodule. Bump compiler submodule to feat/generator-aten-backend.

fix: audit fixes — CRITICAL fall-through + rms_norm_ref bf16

dc0652f

- CRITICAL-1: add return after _skip_if_hf_unavailable to prevent unbound variable on exception fall-through - HIGH-1: _rms_norm_ref now uses bfloat16 intermediates (rsqrt path) matching PLENA hardware behavior

config: VECTOR_SRAM_SIZE 524K → 4M for lm_head at native vocab

c4b9779

test: vram_fill_zero column block regression test (Codex)

249baf6

Merge remote-tracking branch 'origin/main' into feat/generator-aten-b…

92835c3

…ackend

fix: bump compiler submodule (golden BF16 truncation + K-split temp fix)

9b7c078

docs: SmolVLM2 component isolation findings — all ops pass individually

404b011

docs: update findings — all components pass, golden chain precision i…

e97dda8

…s root cause

fix: golden uses MXFP8 X + documented emulator correctness proof

a71e847

feat: add VRAM stage comparison tool

088c8d9

style: remove stale comments referencing golden/debugging from emulator

bf9f941

docs: detailed session summary (Apr 30 – May 11)

b5fc509

chore(emulator): preserve inherited comments

9cee4af

Merge remote-tracking branch 'origin/main' into feat/generator-aten-b…

666cbc2

…ackend # Conflicts: # compiler # transactional_emulator/src/main.rs

style(emulator): satisfy rustfmt

669611f

docs: drop internal claude finding note

1937bad

docs: keep session summary local

a4dbc14

chore: use canonical plena compiler imports

0d90c09

chore: update compiler docs pointer

849cbb2

chore: update compiler cleanup references

23573cd

docs: clean up stale compiler references

06778eb

chore: advance compiler audit cleanup

4163f85

booth-algo added 5 commits May 12, 2026 18:26

Merge remote-tracking branch 'origin/main' into HEAD

370ab4c

# Conflicts: # PLENA_Compiler

chore: point ATen e2e docs at canonical runner

24f68aa

fix: all 4 MXFP8 default scale types to sign:false (e8m0 is unsigned)

44f689c

revert: remove bf16 emulator changes (moved to fix/bf16-scalar-regs PR …

c38b319

…#36)

booth-algo merged commit 80f165b into main May 13, 2026
2 of 3 checks passed

booth-algo deleted the feat/generator-aten-backend branch May 13, 2026 20:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: wire slimmed PLENA compiler cleanup#34

chore: wire slimmed PLENA compiler cleanup#34
booth-algo merged 35 commits into
mainfrom
feat/generator-aten-backend

booth-algo commented May 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

booth-algo commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

booth-algo commented May 12, 2026 •

edited

Loading