test(moe_blockscale): fix the broken e2e test (inflated activations + f16 overflow); kernel is correct by jhinpan · Pull Request #643 · ROCm/FlyDSL

jhinpan · 2026-06-03T09:13:14Z

Fixes the broken test_moe_blockscale_e2e from #642.

What #642 actually was

The test reported the FlyDSL 2-stage block-scale pipeline as 100% wrong
(stage2 err_ratio = 1.0, all-NaN output), but the kernel is correct. The
1.0 was a test-harness artifact from three compounding bugs in the test:

Activation data prep was ~2000x too large — the block-scale activation was
built from the raw fp8 codes (x_q.float(), ~448-scale) instead of the
dequantized activation (x_q * x_scale, ~0.2). The stage1 intermediate then
reached absmax ≈ 2.4e7.
f16 overflow — those magnitudes exceed f16 (max 65504), so the f16
stage1/stage2 kernels overflowed to NaN, and the torch reference's own
out1_torch_ref.to(torch.float16) cast overflowed too, poisoning the 2-stage
reference. Pure-torch check, no kernel involved:
err(2-stage ref vs fused ref) = 0.9999, cosine = 0.04, ‖2stage‖/‖fused‖ = 0.0025.
aiter's own CK stage2 kernel fails the same comparison identically
(err_vs_ref = 1.0) — the tell that the harness, not the kernel, was wrong.
return instead of assert hid all of it.

This PR

Dequantize x before re-block-quantizing the activation (for both the torch
reference and the FlyDSL kernel input) so the f16 pipeline stays in range.
Keep the finiteness + err_fly <= 0.05 asserts so a real regression can't pass
silently.

With the fix, on gfx950 / MI350X (small-E8, medium-E8, and DS-V3 E=256/topk=8):

stage1 err = 0.0000   stage2 err = 0.0000
flydsl pipeline err vs ref = 0.0000–0.0016
aiter fused err = 0.0000
ck stage1/stage2 err_vs_ref = 0.0000 / 0.0000
passed

Notes

The FlyDSL 2-stage block-scale stage2 kernel is correct — no kernel change is
needed. In isolation, fed clean in-range inputs and the same a2_bq, it
matches torch_stage2_blockscale_ref at err = 0.0000, cos = 1.0000.
torch_stage2_blockscale_ref is also correct; it reproduces the fused reference
exactly when fed an in-range activation. It was only poisoned by the overflowed
input.
bf16 output is not the fix — stage1's cshuffle epilog is f16-only; the
data-prep fix keeps the existing f16 path valid. (A separate hardening option is
to give stage1/stage2 a bf16 output path for genuinely large-magnitude
workloads.)
The trailing return (us_fly_total, us_aiter_fused) is kept for the __main__
benchmark path and still triggers pytest's return-not-assert warning; splitting
the benchmark out would remove it.

Closes #642.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Tightens test_moe_blockscale validation so the FlyDSL block-scale 2-stage pipeline can’t silently pass when it diverges from the torch reference.

Changes:

Add assertions that FlyDSL output is finite.
Add a hard failure if the computed err_fly exceeds a threshold, with a pointer to #642.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

… f16 overflow) test_moe_blockscale_e2e reported the FlyDSL 2-stage block-scale pipeline as 100% wrong (stage2 err_ratio = 1.0, all-NaN), but the kernel is correct. The 1.0 was a test-harness artifact from three compounding bugs in the test: 1. The block-scale activation was built from the raw fp8 codes (x_q.float(), ~448-scale) instead of the dequantized activation (x_q * x_scale, ~0.2), inflating activations ~2000x; the stage1 intermediate reached ~2.4e7. 2. Those magnitudes overflow f16 (max 65504): the f16 stage1/stage2 kernels produced NaN, and the reference's own out1_torch_ref.to(torch.float16) cast overflowed and poisoned the 2-stage torch reference (pure-torch check: err(2-stage ref vs fused ref) = 0.9999, cos = 0.04). aiter's own CK stage2 kernel failed the same comparison identically (err_vs_ref = 1.0). 3. The test return-ed timings instead of asserting, hiding all of it. Fix: dequantize x before re-block-quantizing (for both the torch reference and the FlyDSL kernel input), and assert finiteness + err_fly <= 0.05. With realistic magnitudes the whole pipeline is finite and correct on gfx950 / MI350X across small-E8, medium-E8, and DS-V3 (E=256/topk=8): stage1/stage2 err = 0.0000, pipeline err <= 0.0016, all passed. The FlyDSL 2-stage block-scale stage2 kernel and torch_stage2_blockscale_ref are both correct; no kernel change is needed. Closes ROCm#642 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings June 3, 2026 09:13

Copilot AI reviewed Jun 3, 2026

View reviewed changes

Comment thread tests/kernels/test_moe_blockscale.py

Comment thread tests/kernels/test_moe_blockscale.py Outdated

Comment thread tests/kernels/test_moe_blockscale.py

Comment thread tests/kernels/test_moe_blockscale.py Outdated

jhinpan force-pushed the fix/moe-blockscale-test-assert branch from 4bd3b24 to ddf6e2c Compare June 3, 2026 09:51

jhinpan changed the title ~~test(moe_blockscale): assert the e2e pipeline (currently passes on a wrong stage2)~~ test(moe_blockscale): fix the broken e2e test (inflated activations + f16 overflow); kernel is correct Jun 3, 2026

jhinpan force-pushed the fix/moe-blockscale-test-assert branch from ddf6e2c to 1079351 Compare June 3, 2026 09:57

jhinpan mentioned this pull request Jun 3, 2026

Fix both MoE kernels' correctness: aiter shuffle, f16-overflow data-prep, fp8 MoE tolerance (refs #4 #6, ROCm/FlyDSL#642) jhinpan/flydsl-kernel-profiling#5

Open

Merge branch 'main' into fix/moe-blockscale-test-assert

d540abc

coderfeli approved these changes Jun 4, 2026

View reviewed changes

coderfeli merged commit f3c8ff5 into ROCm:main Jun 4, 2026
9 checks passed

jhinpan deleted the fix/moe-blockscale-test-assert branch June 4, 2026 10:18

jhinpan mentioned this pull request Jun 4, 2026

📋 FlyDSL upstream tracker — jhinpan issues & PRs jhinpan/flydsl-kernel-profiling#7

Open

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(moe_blockscale): fix the broken e2e test (inflated activations + f16 overflow); kernel is correct#643

test(moe_blockscale): fix the broken e2e test (inflated activations + f16 overflow); kernel is correct#643
coderfeli merged 2 commits into
ROCm:mainfrom
jhinpan:fix/moe-blockscale-test-assert

jhinpan commented Jun 3, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jhinpan commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What #642 actually was

This PR

Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jhinpan commented Jun 3, 2026 •

edited

Loading