Skip to content

test(moe_blockscale): fix the broken e2e test (inflated activations + f16 overflow); kernel is correct#643

Merged
coderfeli merged 2 commits into
ROCm:mainfrom
jhinpan:fix/moe-blockscale-test-assert
Jun 4, 2026
Merged

test(moe_blockscale): fix the broken e2e test (inflated activations + f16 overflow); kernel is correct#643
coderfeli merged 2 commits into
ROCm:mainfrom
jhinpan:fix/moe-blockscale-test-assert

Conversation

@jhinpan
Copy link
Copy Markdown
Contributor

@jhinpan jhinpan commented Jun 3, 2026

Fixes the broken test_moe_blockscale_e2e from #642.

What #642 actually was

The test reported the FlyDSL 2-stage block-scale pipeline as 100% wrong
(stage2 err_ratio = 1.0, all-NaN output), but the kernel is correct. The
1.0 was a test-harness artifact from three compounding bugs in the test:

  1. Activation data prep was ~2000x too large — the block-scale activation was
    built from the raw fp8 codes (x_q.float(), ~448-scale) instead of the
    dequantized activation (x_q * x_scale, ~0.2). The stage1 intermediate then
    reached absmax ≈ 2.4e7.
  2. f16 overflow — those magnitudes exceed f16 (max 65504), so the f16
    stage1/stage2 kernels overflowed to NaN, and the torch reference's own
    out1_torch_ref.to(torch.float16) cast overflowed too, poisoning the 2-stage
    reference. Pure-torch check, no kernel involved:
    err(2-stage ref vs fused ref) = 0.9999, cosine = 0.04, ‖2stage‖/‖fused‖ = 0.0025.
    aiter's own CK stage2 kernel fails the same comparison identically
    (err_vs_ref = 1.0) — the tell that the harness, not the kernel, was wrong.
  3. return instead of assert hid all of it.

This PR

  • Dequantize x before re-block-quantizing the activation (for both the torch
    reference and the FlyDSL kernel input) so the f16 pipeline stays in range.
  • Keep the finiteness + err_fly <= 0.05 asserts so a real regression can't pass
    silently.

With the fix, on gfx950 / MI350X (small-E8, medium-E8, and DS-V3 E=256/topk=8):

stage1 err = 0.0000   stage2 err = 0.0000
flydsl pipeline err vs ref = 0.0000–0.0016
aiter fused err = 0.0000
ck stage1/stage2 err_vs_ref = 0.0000 / 0.0000
passed

Notes

  • The FlyDSL 2-stage block-scale stage2 kernel is correct — no kernel change is
    needed.
    In isolation, fed clean in-range inputs and the same a2_bq, it
    matches torch_stage2_blockscale_ref at err = 0.0000, cos = 1.0000.
  • torch_stage2_blockscale_ref is also correct; it reproduces the fused reference
    exactly when fed an in-range activation. It was only poisoned by the overflowed
    input.
  • bf16 output is not the fix — stage1's cshuffle epilog is f16-only; the
    data-prep fix keeps the existing f16 path valid. (A separate hardening option is
    to give stage1/stage2 a bf16 output path for genuinely large-magnitude
    workloads.)
  • The trailing return (us_fly_total, us_aiter_fused) is kept for the __main__
    benchmark path and still triggers pytest's return-not-assert warning; splitting
    the benchmark out would remove it.

Closes #642.

Copilot AI review requested due to automatic review settings June 3, 2026 09:13
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Tightens test_moe_blockscale validation so the FlyDSL block-scale 2-stage pipeline can’t silently pass when it diverges from the torch reference.

Changes:

  • Add assertions that FlyDSL output is finite.
  • Add a hard failure if the computed err_fly exceeds a threshold, with a pointer to #642.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/kernels/test_moe_blockscale.py
Comment thread tests/kernels/test_moe_blockscale.py Outdated
Comment thread tests/kernels/test_moe_blockscale.py
Comment thread tests/kernels/test_moe_blockscale.py Outdated
@jhinpan jhinpan force-pushed the fix/moe-blockscale-test-assert branch from 4bd3b24 to ddf6e2c Compare June 3, 2026 09:51
@jhinpan jhinpan changed the title test(moe_blockscale): assert the e2e pipeline (currently passes on a wrong stage2) test(moe_blockscale): fix the broken e2e test (inflated activations + f16 overflow); kernel is correct Jun 3, 2026
… f16 overflow)

test_moe_blockscale_e2e reported the FlyDSL 2-stage block-scale pipeline as 100%
wrong (stage2 err_ratio = 1.0, all-NaN), but the kernel is correct. The 1.0 was a
test-harness artifact from three compounding bugs in the test:

1. The block-scale activation was built from the raw fp8 codes (x_q.float(),
   ~448-scale) instead of the dequantized activation (x_q * x_scale, ~0.2),
   inflating activations ~2000x; the stage1 intermediate reached ~2.4e7.
2. Those magnitudes overflow f16 (max 65504): the f16 stage1/stage2 kernels
   produced NaN, and the reference's own out1_torch_ref.to(torch.float16) cast
   overflowed and poisoned the 2-stage torch reference (pure-torch check:
   err(2-stage ref vs fused ref) = 0.9999, cos = 0.04). aiter's own CK stage2
   kernel failed the same comparison identically (err_vs_ref = 1.0).
3. The test return-ed timings instead of asserting, hiding all of it.

Fix: dequantize x before re-block-quantizing (for both the torch reference and the
FlyDSL kernel input), and assert finiteness + err_fly <= 0.05. With realistic
magnitudes the whole pipeline is finite and correct on gfx950 / MI350X across
small-E8, medium-E8, and DS-V3 (E=256/topk=8): stage1/stage2 err = 0.0000,
pipeline err <= 0.0016, all passed.

The FlyDSL 2-stage block-scale stage2 kernel and torch_stage2_blockscale_ref are
both correct; no kernel change is needed.

Closes ROCm#642

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@coderfeli coderfeli merged commit f3c8ff5 into ROCm:main Jun 4, 2026
9 checks passed
@jhinpan jhinpan deleted the fix/moe-blockscale-test-assert branch June 4, 2026 10:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

test_moe_blockscale_e2e harness bug: raw FP8-code activations cause f16 overflow and false 100% failures

3 participants