chore: wire slimmed PLENA compiler cleanup#34
Merged
Conversation
Add test-generator-aten justfile recipe that invokes the new ATen backend mode in the compiler submodule. Bump compiler submodule to feat/generator-aten-backend.
Add build_and_run_multi_layer_test that chains N decoder layers in a single PlenaCompiler program with pre-norm + residual architecture: embed_add → [rms_norm → flash_attn → residual → rms_norm → ffn → residual] × N → rms_norm RoPE is omitted (requires precomputed Q_rot from golden intermediates, orthogonal to testing multi-layer chaining). Includes VRAM padding to avoid the known ffn_asm absolute-address intermediate layout overlapping the residual scratch buffer. Verified: 1-layer 100% allclose, 2-layer 93.60% allclose (SmolLM2-135M).
- BEHAVIOR.MATRIX_SRAM_SIZE: 1024 → 4096 (64 MRAM tiles for 3× K-split FFN) - BEHAVIOR.VECTOR_SRAM_SIZE: 65536 → 524288 (native hidden=384 activations) - BEHAVIOR.MAX_LOOP_INSTRUCTIONS: 10000 → 100000 (native FFN loop bodies)
- CRITICAL-1: add return after _skip_if_hf_unavailable to prevent unbound variable on exception fall-through - HIGH-1: _rms_norm_ref now uses bfloat16 intermediates (rsqrt path) matching PLENA hardware behavior
Prevents bf16 overflow → NaN propagation in multi-layer runs. exp(89) exceeds bf16 max → inf, and downstream -inf * 0 = NaN. Clamping matches hardware exp unit behavior (saturate, not overflow). Verified: test-softmax and test-flash-attention pass unchanged.
Promote vector add/sub/mul/reciprocal operands to f32 before arithmetic, then quantize back to the original data type. This matches the golden precision model and eliminates NaN accumulation across many-layer runs (5-layer: 100% allclose, vision encoder: 99.95%). - Remove mimalloc dependency (Cargo.lock cleanup) - Bump compiler submodule (load_toml_config fix, PlenaFrontend e2e, layer progress markers, bf16 exp overflow tests) - Add docs/vlm-support.md covering ATen vs Generator pipelines, SmolVLM2 test inventory, and new vision ops/templates - Add tools/run_emulator_tracked.sh for progress-monitored runs
Promote scalar FP registers (fp_reg) and FPSRAM from f16 to f32. This fixes softmax underflow when attention scores have wide spread (e.g., SmolVLM2-256M where QK^T values reach ~2600). The RTL uses custom e6m5 (12-bit) format, the TOML declares e8m7, and the emulator previously used IEEE f16 (1+5+10) — a three-way mismatch. Using f32 for the emulator's scalar path avoids precision loss entirely, matching the vector ops pattern (promote to f32, quantize on VRAM write). Also promotes vector add/sub/mul/reciprocal to f32 before arithmetic to match the golden precision model (100% allclose on clm-60m 1/5/10 layers). Status: clm-60m passes at 100% allclose. SmolVLM2 no longer produces NaN (was 0%, now 32.78%) but has a separate ISA accuracy issue at hidden=576 that needs further investigation.
…ests - Remove all SCALE debug instrumentation (was 7x slowdown) - Remove CONFIG startup print - Fix default MXFP8 scale sign: true→false in load_config.rs defaults (defensive fix, TOML loads correctly regardless) - Add unit tests for e8m0 scale decode and MXFP8 round-trip - Compiler submodule: fix K-split temp buffer overlap in linear_ops.py, add _ksplit_matmul to FFN golden reference Pending verification: full SmolVLM2 1-layer test running in background (~80 min). The linear_ops K-split fix + golden precision fix should improve from 32.78% allclose.
Feedback comparison using the preserved vram_dump confirms: when the golden reference uses the emulator's own intermediate values as input, the FFN+norm output matches at 100% allclose. The 32.62% failure is ENTIRELY from golden reference precision drift — the golden chain accumulates float32 differently than the hardware's BF16 intermediates. The emulator correctly implements: - MXFP8 scale application ✓ - Linear projections (K-split, stride mode) ✓ - RMS norm at any magnitude ✓ - Flash attention (per-head, with causal mask) ✓ - FFN (gate+up+down, 576→1536→576) ✓ - Residual adds ✓ - f32 scalar FP registers ✓ Fix needed: align golden reference to exactly match hardware precision at each intermediate step (not just BF16 truncation after ops, but also K-split accumulation order and exact matmul precision boundaries).
… comments - CRITICAL: fix all 4 default MXFP8 scale types to sign:false (e8m0 is unsigned) - HIGH: promote V_EXP_V to f32 (consistent with add/sub/mul/reciprocal) - HIGH: fix indentation on vec_i_f32 lines (extra 4 spaces) - HIGH: document S_RECI_FP div-by-zero behavior (IEEE infinity, intentional) - MEDIUM: remove unconditional println! in mv() (was debug noise) - LOW: clean up test comment scratch notes in dtype.rs
…ackend # Conflicts: # compiler # transactional_emulator/src/main.rs
# Conflicts: # PLENA_Compiler
Change scalar FP registers and FPSRAM from IEEE f16 (1+5+10) to bf16 (1+8+7), matching the declared TOML precision (e8m7). IEEE f16's 5-bit exponent (range ~6e-8) causes softmax underflow for models with large attention scores (SmolVLM2: all NaN). bf16's 8-bit exponent (range ~1.2e-38) prevents this while staying closer to the RTL's e6m5 format than f32 would. Results: f16 regs: SmolVLM2 = 0% (all NaN) bf16 regs: SmolVLM2 = 100% VRAM stage allclose (MSE=3.54e-05) clm-60m: unaffected (works with any format) Also adds S_RECI_FP MIN_POSITIVE clamp as defense-in-depth.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PLENA_Compilersubmodule to the slimmed compiler cleanup branch (2af064f), tracked in refactor: slim PLENA compiler frontend and codegen PLENA_Compiler#38.compiler.generatortocompiler.aten.e2e_runnerso the ATen path no longer appears to be generator-owned.compiler.generator.aten_runnerandjust test-generator-atennow forward to the canonical ATen runner.just test-aten-e2eandpython -m compiler.aten.e2e_runner.compiler.asm_templates.flashattnandcreate_mem_for_sim.mainso this PR is scoped after the already-merged emulator precision/config fix in fix(emulator): bf16 scalar FP registers + fp32 vector ops #32.Validation
git diff --checkinPLENA_Compilergit diff --checkinPLENA_Simulatorbuild_sim_env,tile_compiler, oldflash_attn_asmimport path, oldcompiler/...file paths,DSL, stale generator-owned ATen runner references)python -m compiler.aten.e2e_runner --helppython -m generator.aten_runner --helpcompatibility aliasasm_templates/tests/test_vram_sub_projection.py- 5 passedaten/tests/test_plena_compiler.py- 13 passedasm_templates/tests/test_large_immediate.py- 23 passedclm-60m2-layer ATen/e2e passed before the alias-only runner move: 100% allclose, max abs error 0.171875, MAE 2.233887e-02