fix(emulator): bf16 scalar FP registers (matches TOML e8m7 spec)#36
Merged
Conversation
booth-algo
added a commit
that referenced
this pull request
May 13, 2026
1f99faa to
bb4f431
Compare
Change scalar FP registers and FPSRAM from IEEE f16 (1+5+10) to bf16 (1+8+7), matching the declared TOML precision (e8m7). IEEE f16's 5-bit exponent (range ~6e-8) causes softmax underflow for models with large attention scores (SmolVLM2: all NaN). bf16's 8-bit exponent (range ~1.2e-38) prevents this while staying closer to the RTL's e6m5 format than f32 would. Results: f16 regs: SmolVLM2 = 0% (all NaN) bf16 regs: SmolVLM2 = 100% VRAM stage allclose (MSE=3.54e-05) clm-60m: unaffected (works with any format) Also adds S_RECI_FP MIN_POSITIVE clamp as defense-in-depth.
35966b3 to
ad5d14e
Compare
ad5d14e to
885c3fc
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR keeps the simulator
main.rsbehavior aligned withmain, then applies the intended scalar FP format change:main.rschanges from the earlier PR lineagef16tobf16PLENA_Compilerat compilermainafter fix: validate final layer in VRAM stage compare PLENA_Compiler#41 (9c12f0b)Compiler #41 only updates the VRAM-stage validation helper so multi-layer native runs can validate the final decoder layer instead of being stuck on layer 0.
Problem
The emulator previously used IEEE f16 for scalar FP registers and FPSRAM. That format has a 5-bit exponent and is too narrow for SmolVLM2 softmax scalar state: values like
exp(S - max)can underflow to zero, which can make the reciprocal becomeInfand then propagate NaNs.It also did not match the TOML precision intent for scalar FP state:
Emulator Changes
fp_reg: [f16; 8]->fp_reg: [bf16; 8]fpsram: Vec<f16>->fpsram: Vec<bf16>bf16::from_f32(...)fp_sram.binremains f16-encoded on disk for compatibility and is converted to bf16 on preloadS_EXP_FPkeeps the[-88, 88]clamp to avoid exponential overflowValidation
Native SmolVLM2 2-layer emulator run:
HuggingFaceTB/SmolVLM2-256M-Video-Instructseq_len=64,num_layers=2,hidden=576,inter=1536106,726lines036,864nan_count=0,inf_count=0-13.9375 / 21.625 / 0.0407127Comparison results:
The standard final-output comparison is not the correctness gate for native SmolVLM2 because PyTorch and PLENA
M_MMaccumulate matmuls in different orders/tile chunks, and drift compounds over the full chain.The authoritative check is the VRAM-in-the-loop stage comparison: read emulator VRAM at the stage boundary, compute the next-stage golden from that exact input, and compare that segment.
Final-layer VRAM-stage comparison for the 2-layer run:
layer_idx=1O_proj + residual: 100.00% allclose, MSE4.18e-06norm + FFN + residual + final_norm: 100.00% allclose, MSE2.61e-05Test Plan
main.rsis scoped to the strict revert-plus-f16tobf16changepy_compilefor updated compiler VRAM-stage helper