Skip to content

fix(emulator): bf16 scalar FP registers (matches TOML e8m7 spec)#36

Merged
booth-algo merged 9 commits into
mainfrom
fix/bf16-scalar-regs
May 14, 2026
Merged

fix(emulator): bf16 scalar FP registers (matches TOML e8m7 spec)#36
booth-algo merged 9 commits into
mainfrom
fix/bf16-scalar-regs

Conversation

@booth-algo
Copy link
Copy Markdown
Collaborator

@booth-algo booth-algo commented May 13, 2026

Summary

This PR keeps the simulator main.rs behavior aligned with main, then applies the intended scalar FP format change:

  • revert prior unrelated main.rs changes from the earlier PR lineage
  • change scalar FP architectural state from IEEE f16 to bf16
  • preserve the existing instruction behavior and file formats unless needed for the scalar format conversion
  • point PLENA_Compiler at compiler main after fix: validate final layer in VRAM stage compare PLENA_Compiler#41 (9c12f0b)

Compiler #41 only updates the VRAM-stage validation helper so multi-layer native runs can validate the final decoder layer instead of being stuck on layer 0.

Problem

The emulator previously used IEEE f16 for scalar FP registers and FPSRAM. That format has a 5-bit exponent and is too narrow for SmolVLM2 softmax scalar state: values like exp(S - max) can underflow to zero, which can make the reciprocal become Inf and then propagate NaNs.

It also did not match the TOML precision intent for scalar FP state:

Storage TOML intent Old emulator
Scalar FP regs e8m7 / bf16-shaped range IEEE f16
FPSRAM scalar FP storage IEEE f16
Vector SRAM e8m7 via QuantTensor bf16 path

Emulator Changes

  • fp_reg: [f16; 8] -> fp_reg: [bf16; 8]
  • fpsram: Vec<f16> -> fpsram: Vec<bf16>
  • scalar FP op results are stored through bf16::from_f32(...)
  • vector reductions write scalar results as bf16
  • fp_sram.bin remains f16-encoded on disk for compatibility and is converted to bf16 on preload
  • S_EXP_FP keeps the [-88, 88] clamp to avoid exponential overflow

Validation

Native SmolVLM2 2-layer emulator run:

  • model: HuggingFaceTB/SmolVLM2-256M-Video-Instruct
  • native dims: seq_len=64, num_layers=2, hidden=576, inter=1536
  • machine code: 106,726 lines
  • emulator exit code: 0
  • decoded outputs: 36,864
  • finite output: true
  • nan_count=0, inf_count=0
  • output min/max/mean: -13.9375 / 21.625 / 0.0407127

Comparison results:

Scalar format SmolVLM2 standard compare SmolVLM2 VRAM-stage compare clm-60m
f16 baseline 0% (all NaN) 0% 100%
f16 + clamp 25% (partial NaN) 19.9% not rerun
bf16, this PR 32.87-36.95% expected drift 100% 100%

The standard final-output comparison is not the correctness gate for native SmolVLM2 because PyTorch and PLENA M_MM accumulate matmuls in different orders/tile chunks, and drift compounds over the full chain.

The authoritative check is the VRAM-in-the-loop stage comparison: read emulator VRAM at the stage boundary, compute the next-stage golden from that exact input, and compare that segment.

Final-layer VRAM-stage comparison for the 2-layer run:

  • inferred layer_idx=1
  • O_proj + residual: 100.00% allclose, MSE 4.18e-06
  • norm + FFN + residual + final_norm: 100.00% allclose, MSE 2.61e-05
  • no NaNs/Infs in final output

Test Plan

  • Verified main.rs is scoped to the strict revert-plus-f16 to bf16 change
  • SmolVLM2 native 2-layer emulator completes with no NaNs/Infs
  • SmolVLM2 final-layer VRAM-stage comparison is 100% allclose
  • clm-60m remains 100%
  • Verified f16 baseline produces NaNs
  • Verified f16 + clamp is insufficient
  • py_compile for updated compiler VRAM-stage helper
  • parent ruff lint passes with checked-out submodule excluded, matching GitHub checkout behavior

Change scalar FP registers and FPSRAM from IEEE f16 (1+5+10) to bf16
(1+8+7), matching the declared TOML precision (e8m7).

IEEE f16's 5-bit exponent (range ~6e-8) causes softmax underflow for
models with large attention scores (SmolVLM2: all NaN). bf16's 8-bit
exponent (range ~1.2e-38) prevents this while staying closer to the
RTL's e6m5 format than f32 would.

Results:
  f16 regs:  SmolVLM2 = 0% (all NaN)
  bf16 regs: SmolVLM2 = 100% VRAM stage allclose (MSE=3.54e-05)
  clm-60m:   unaffected (works with any format)

Also adds S_RECI_FP MIN_POSITIVE clamp as defense-in-depth.
@booth-algo booth-algo force-pushed the fix/bf16-scalar-regs branch 2 times, most recently from 35966b3 to ad5d14e Compare May 13, 2026 23:18
@booth-algo booth-algo force-pushed the fix/bf16-scalar-regs branch from ad5d14e to 885c3fc Compare May 13, 2026 23:30
@booth-algo booth-algo merged commit 0ea761f into main May 14, 2026
2 of 3 checks passed
@booth-algo booth-algo deleted the fix/bf16-scalar-regs branch May 14, 2026 19:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant