Skip to content

chore: wire slimmed PLENA compiler cleanup#34

Merged
booth-algo merged 35 commits into
mainfrom
feat/generator-aten-backend
May 13, 2026
Merged

chore: wire slimmed PLENA compiler cleanup#34
booth-algo merged 35 commits into
mainfrom
feat/generator-aten-backend

Conversation

@booth-algo
Copy link
Copy Markdown
Collaborator

@booth-algo booth-algo commented May 12, 2026

Summary

  • Advances the PLENA_Compiler submodule to the slimmed compiler cleanup branch (2af064f), tracked in refactor: slim PLENA compiler frontend and codegen PLENA_Compiler#38.
  • Moves the canonical ATen e2e runner from compiler.generator to compiler.aten.e2e_runner so the ATen path no longer appears to be generator-owned.
  • Keeps compatibility aliases for existing scripts: compiler.generator.aten_runner and just test-generator-aten now forward to the canonical ATen runner.
  • Updates simulator docs/just recipes to use just test-aten-e2e and python -m compiler.aten.e2e_runner.
  • Updates simulator testbench imports to canonical compiler APIs after compatibility shims were removed: compiler.asm_templates.flashattn and create_mem_for_sim.
  • Merges current main so this PR is scoped after the already-merged emulator precision/config fix in fix(emulator): bf16 scalar FP registers + fp32 vector ops #32.

Validation

  • git diff --check in PLENA_Compiler
  • git diff --check in PLENA_Simulator
  • tracked-file audit for stale compatibility names/imports (build_sim_env, tile_compiler, old flash_attn_asm import path, old compiler/... file paths, DSL, stale generator-owned ATen runner references)
  • compile checks for touched Python files
  • focused ruff checks for touched compiler/testbench files
  • python -m compiler.aten.e2e_runner --help
  • python -m generator.aten_runner --help compatibility alias
  • asm_templates/tests/test_vram_sub_projection.py - 5 passed
  • aten/tests/test_plena_compiler.py - 13 passed
  • asm_templates/tests/test_large_immediate.py - 23 passed
  • clm-60m 2-layer ATen/e2e passed before the alias-only runner move: 100% allclose, max abs error 0.171875, MAE 2.233887e-02

Add test-generator-aten justfile recipe that invokes the new ATen
backend mode in the compiler submodule.

Bump compiler submodule to feat/generator-aten-backend.
Add build_and_run_multi_layer_test that chains N decoder layers in a
single PlenaCompiler program with pre-norm + residual architecture:

  embed_add → [rms_norm → flash_attn → residual → rms_norm → ffn → residual] × N → rms_norm

RoPE is omitted (requires precomputed Q_rot from golden intermediates,
orthogonal to testing multi-layer chaining).

Includes VRAM padding to avoid the known ffn_asm absolute-address
intermediate layout overlapping the residual scratch buffer.

Verified: 1-layer 100% allclose, 2-layer 93.60% allclose (SmolLM2-135M).
- BEHAVIOR.MATRIX_SRAM_SIZE: 1024 → 4096 (64 MRAM tiles for 3× K-split FFN)
- BEHAVIOR.VECTOR_SRAM_SIZE: 65536 → 524288 (native hidden=384 activations)
- BEHAVIOR.MAX_LOOP_INSTRUCTIONS: 10000 → 100000 (native FFN loop bodies)
- CRITICAL-1: add return after _skip_if_hf_unavailable to prevent
  unbound variable on exception fall-through
- HIGH-1: _rms_norm_ref now uses bfloat16 intermediates (rsqrt path)
  matching PLENA hardware behavior
Prevents bf16 overflow → NaN propagation in multi-layer runs.
exp(89) exceeds bf16 max → inf, and downstream -inf * 0 = NaN.
Clamping matches hardware exp unit behavior (saturate, not overflow).
Verified: test-softmax and test-flash-attention pass unchanged.
Promote vector add/sub/mul/reciprocal operands to f32 before
arithmetic, then quantize back to the original data type. This matches
the golden precision model and eliminates NaN accumulation across
many-layer runs (5-layer: 100% allclose, vision encoder: 99.95%).

- Remove mimalloc dependency (Cargo.lock cleanup)
- Bump compiler submodule (load_toml_config fix, PlenaFrontend e2e,
  layer progress markers, bf16 exp overflow tests)
- Add docs/vlm-support.md covering ATen vs Generator pipelines,
  SmolVLM2 test inventory, and new vision ops/templates
- Add tools/run_emulator_tracked.sh for progress-monitored runs
Promote scalar FP registers (fp_reg) and FPSRAM from f16 to f32.
This fixes softmax underflow when attention scores have wide spread
(e.g., SmolVLM2-256M where QK^T values reach ~2600).

The RTL uses custom e6m5 (12-bit) format, the TOML declares e8m7,
and the emulator previously used IEEE f16 (1+5+10) — a three-way
mismatch. Using f32 for the emulator's scalar path avoids precision
loss entirely, matching the vector ops pattern (promote to f32,
quantize on VRAM write).

Also promotes vector add/sub/mul/reciprocal to f32 before arithmetic
to match the golden precision model (100% allclose on clm-60m 1/5/10
layers).

Status: clm-60m passes at 100% allclose. SmolVLM2 no longer produces
NaN (was 0%, now 32.78%) but has a separate ISA accuracy issue at
hidden=576 that needs further investigation.
…ests

- Remove all SCALE debug instrumentation (was 7x slowdown)
- Remove CONFIG startup print
- Fix default MXFP8 scale sign: true→false in load_config.rs defaults
  (defensive fix, TOML loads correctly regardless)
- Add unit tests for e8m0 scale decode and MXFP8 round-trip
- Compiler submodule: fix K-split temp buffer overlap in linear_ops.py,
  add _ksplit_matmul to FFN golden reference

Pending verification: full SmolVLM2 1-layer test running in background
(~80 min). The linear_ops K-split fix + golden precision fix should
improve from 32.78% allclose.
Feedback comparison using the preserved vram_dump confirms:
when the golden reference uses the emulator's own intermediate
values as input, the FFN+norm output matches at 100% allclose.

The 32.62% failure is ENTIRELY from golden reference precision
drift — the golden chain accumulates float32 differently than
the hardware's BF16 intermediates.

The emulator correctly implements:
- MXFP8 scale application ✓
- Linear projections (K-split, stride mode) ✓
- RMS norm at any magnitude ✓
- Flash attention (per-head, with causal mask) ✓
- FFN (gate+up+down, 576→1536→576) ✓
- Residual adds ✓
- f32 scalar FP registers ✓

Fix needed: align golden reference to exactly match hardware
precision at each intermediate step (not just BF16 truncation
after ops, but also K-split accumulation order and exact
matmul precision boundaries).
… comments

- CRITICAL: fix all 4 default MXFP8 scale types to sign:false (e8m0 is unsigned)
- HIGH: promote V_EXP_V to f32 (consistent with add/sub/mul/reciprocal)
- HIGH: fix indentation on vec_i_f32 lines (extra 4 spaces)
- HIGH: document S_RECI_FP div-by-zero behavior (IEEE infinity, intentional)
- MEDIUM: remove unconditional println! in mv() (was debug noise)
- LOW: clean up test comment scratch notes in dtype.rs
…ackend

# Conflicts:
#	compiler
#	transactional_emulator/src/main.rs
Change scalar FP registers and FPSRAM from IEEE f16 (1+5+10) to bf16
(1+8+7), matching the declared TOML precision (e8m7).

IEEE f16's 5-bit exponent (range ~6e-8) causes softmax underflow for
models with large attention scores (SmolVLM2: all NaN). bf16's 8-bit
exponent (range ~1.2e-38) prevents this while staying closer to the
RTL's e6m5 format than f32 would.

Results:
  f16 regs:  SmolVLM2 = 0% (all NaN)
  bf16 regs: SmolVLM2 = 100% VRAM stage allclose (MSE=3.54e-05)
  clm-60m:   unaffected (works with any format)

Also adds S_RECI_FP MIN_POSITIVE clamp as defense-in-depth.
@booth-algo booth-algo merged commit 80f165b into main May 13, 2026
2 of 3 checks passed
@booth-algo booth-algo deleted the feat/generator-aten-backend branch May 13, 2026 20:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant