fix(emulator): bf16 scalar FP registers (matches TOML e8m7 spec) by booth-algo · Pull Request #36 · AICrossSim/PLENA_Simulator

booth-algo · 2026-05-13T20:07:27Z

Summary

This PR keeps the simulator main.rs behavior aligned with main, then applies the intended scalar FP format change:

revert prior unrelated main.rs changes from the earlier PR lineage
change scalar FP architectural state from IEEE f16 to bf16
preserve the existing instruction behavior and file formats unless needed for the scalar format conversion
point PLENA_Compiler at compiler main after fix: validate final layer in VRAM stage compare PLENA_Compiler#41 (9c12f0b)

Compiler #41 only updates the VRAM-stage validation helper so multi-layer native runs can validate the final decoder layer instead of being stuck on layer 0.

Problem

The emulator previously used IEEE f16 for scalar FP registers and FPSRAM. That format has a 5-bit exponent and is too narrow for SmolVLM2 softmax scalar state: values like exp(S - max) can underflow to zero, which can make the reciprocal become Inf and then propagate NaNs.

It also did not match the TOML precision intent for scalar FP state:

Storage	TOML intent	Old emulator
Scalar FP regs	e8m7 / bf16-shaped range	IEEE f16
FPSRAM	scalar FP storage	IEEE f16
Vector SRAM	e8m7 via QuantTensor	bf16 path

Emulator Changes

fp_reg: [f16; 8] -> fp_reg: [bf16; 8]
fpsram: Vec<f16> -> fpsram: Vec<bf16>
scalar FP op results are stored through bf16::from_f32(...)
vector reductions write scalar results as bf16
fp_sram.bin remains f16-encoded on disk for compatibility and is converted to bf16 on preload
S_EXP_FP keeps the [-88, 88] clamp to avoid exponential overflow

Validation

Native SmolVLM2 2-layer emulator run:

model: HuggingFaceTB/SmolVLM2-256M-Video-Instruct
native dims: seq_len=64, num_layers=2, hidden=576, inter=1536
machine code: 106,726 lines
emulator exit code: 0
decoded outputs: 36,864
finite output: true
nan_count=0, inf_count=0
output min/max/mean: -13.9375 / 21.625 / 0.0407127

Comparison results:

Scalar format	SmolVLM2 standard compare	SmolVLM2 VRAM-stage compare	clm-60m
f16 baseline	0% (all NaN)	0%	100%
f16 + clamp	25% (partial NaN)	19.9%	not rerun
bf16, this PR	32.87-36.95% expected drift	100%	100%

The standard final-output comparison is not the correctness gate for native SmolVLM2 because PyTorch and PLENA M_MM accumulate matmuls in different orders/tile chunks, and drift compounds over the full chain.

The authoritative check is the VRAM-in-the-loop stage comparison: read emulator VRAM at the stage boundary, compute the next-stage golden from that exact input, and compare that segment.

Final-layer VRAM-stage comparison for the 2-layer run:

inferred layer_idx=1
O_proj + residual: 100.00% allclose, MSE 4.18e-06
norm + FFN + residual + final_norm: 100.00% allclose, MSE 2.61e-05
no NaNs/Infs in final output

Test Plan

Verified main.rs is scoped to the strict revert-plus-f16 to bf16 change
SmolVLM2 native 2-layer emulator completes with no NaNs/Infs
SmolVLM2 final-layer VRAM-stage comparison is 100% allclose
clm-60m remains 100%
Verified f16 baseline produces NaNs
Verified f16 + clamp is insufficient
py_compile for updated compiler VRAM-stage helper
parent ruff lint passes with checked-out submodule excluded, matching GitHub checkout behavior

…#36)

Change scalar FP registers and FPSRAM from IEEE f16 (1+5+10) to bf16 (1+8+7), matching the declared TOML precision (e8m7). IEEE f16's 5-bit exponent (range ~6e-8) causes softmax underflow for models with large attention scores (SmolVLM2: all NaN). bf16's 8-bit exponent (range ~1.2e-38) prevents this while staying closer to the RTL's e6m5 format than f32 would. Results: f16 regs: SmolVLM2 = 0% (all NaN) bf16 regs: SmolVLM2 = 100% VRAM stage allclose (MSE=3.54e-05) clm-60m: unaffected (works with any format) Also adds S_RECI_FP MIN_POSITIVE clamp as defense-in-depth.

booth-algo added a commit that referenced this pull request May 13, 2026

revert: remove bf16 emulator changes (moved to fix/bf16-scalar-regs PR …

c38b319

…#36)

booth-algo force-pushed the fix/bf16-scalar-regs branch from 1f99faa to bb4f431 Compare May 13, 2026 20:21

booth-algo force-pushed the fix/bf16-scalar-regs branch 2 times, most recently from 35966b3 to ad5d14e Compare May 13, 2026 23:18

fix: all 4 MXFP8 default scale types to sign:false (e8m0 is unsigned)

885c3fc

booth-algo force-pushed the fix/bf16-scalar-regs branch from ad5d14e to 885c3fc Compare May 13, 2026 23:30

booth-algo added 7 commits May 14, 2026 00:57

fix(emulator): strictly revert main changes before bf16

6b76a23

Clarify sliced simulator harnesses

512ea07

Point compiler submodule at sliced entrypoint branch

089b0cc

chore: bump compiler for final-layer VRAM compare

0bf724b

Merge origin/main into fix/bf16-scalar-regs

77e7fa1

fix: remove stale noqa from layer builder wrapper

d775e70

chore: point compiler submodule at main

5a8769a

booth-algo merged commit 0ea761f into main May 14, 2026
2 of 3 checks passed

booth-algo deleted the fix/bf16-scalar-regs branch May 14, 2026 19:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(emulator): bf16 scalar FP registers (matches TOML e8m7 spec)#36

fix(emulator): bf16 scalar FP registers (matches TOML e8m7 spec)#36
booth-algo merged 9 commits into
mainfrom
fix/bf16-scalar-regs

booth-algo commented May 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

booth-algo commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Emulator Changes

Validation

Test Plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

booth-algo commented May 13, 2026 •

edited

Loading