[Feat] Fuse TopKGatingSoftmax and MoE Sorting kernels by amd-wsung102 · Pull Request #582 · ROCm/FlyDSL

amd-wsung102 · 2026-05-28T06:10:51Z

Motivation

The topk_gating_softmax_kernel.py kernel and moe_sorting_kernel.py kernel can be fused for improved performance across eager mode, graph mode, and raw kernel time.

Relevant Files

kernels/moe_sorting_kernel.py - added fused topk and sorting
kernels/topk_gating_softmax_kernel.py - topk gating softmax kernel
tests/kernels/test_moe_sorting.py - unit test for the fused kernels

Additional Details

The fusion applies to the decode path in moe_sorting and only for number of tokens T=16 and T<16. For T > 16, the fusion doesn't yield improvements and this is an ongoing investigation, and a future PR can be created to tackle this issue.

Test Result - DeepSeek-R1: E=256, topk=8, model_dim=7168, bf16

All time are in us

Eager: 2-2.4x improvement
Graph: 1.12-1.14x improvement
Raw kernel: 1.3-1.4x improvement

T	unfused_eager	fused_eager	unfused_graph	fused_graph	unfused_kernel	fused_kernel	eager speedup	graph speedup	kernel speedup
1	32	13.1	16.2	14.3	13.5	10.3	2.44	1.13	1.32
2	32.5	16	16.2	14.4	14	10.5	2.03	1.12	1.34
4	32.5	16.2	16.7	14.8	14.5	11	2	1.13	1.32
8	33.7	15.9	17.3	15.4	15	11.4	2.11	1.12	1.32
12	34	14.9	19.3	16.9	17.1	12.2	2.28	1.14	1.4
16	33.9	15.4	19.7	17.4	17.7	12.9	2.2	1.13	1.37

Test Result - GPT-OSS 120B: E=128, topk=4, model_dim=2880, bf16

All time are in us

Eager: 2.13-2.24x improvement
Graph: 1.21-1.27x improvement
Raw kernel: 1.3-1.4x improvement

T	unfused_eager	fused_eager	unfused_graph	fused_graph	unfused_kernel	fused_kernel	eager speedup	graph speedup	kernel speedup
1	31.2	14.4	13.3	10.9	9.9	7.1	2.16	1.22	1.39
2	31.4	14.6	13.5	11.1	10.2	7.3	2.15	1.21	1.39
4	31.7	14.8	13.8	11.3	10.4	7.5	2.14	1.21	1.38
8	31.6	14.8	13.9	11.4	10.5	7.6	2.13	1.21	1.38
12	32.8	14.6	15.6	12.3	12.2	8.5	2.24	1.27	1.44
16	33.4	14.9	15.8	12.5	12.3	8.7	2.24	1.27	1.42

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

coderfeli · 2026-05-29T02:29:10Z

+
+
+@contextmanager
+def _if_then(if_op):


Why need this scf if?

The fused kernel needs the explicit scf.IfOp and the helper function _if_then because if the plain python if is used, the kernel JIT fails:

File ".../kernels/moe_sorting_kernel.py", line 1242, in __then_1
for _z in range(_zs, _ze, _z1):
TypeError: 'ArithValue' object cannot be interpreted as an integer

The fused kernel needs the explicit scf.IfOp and the helper function _if_then because if the plain python if is used, the kernel JIT fails:

File ".../kernels/moe_sorting_kernel.py", line 1242, in __then_1
for _z in range(_zs, _ze, _z1):
TypeError: 'ArithValue' object cannot be interpreted as an integer

I've fixed a similar bug before, but I'm not sure if it's the same issue. Could you please simplify the test case that's causing the error and let me reproduce it?

Hi @xudoyuan, you may use this commit 5627ef2, which uses the regular python if instead of scf if. Then, run pytest tests/kernels/test_moe_sorting.py::test_moe_softmax_sort_fused_oneshot -k "1-256-8-bf16".

It will show this

FAILED tests/kernels/test_moe_sorting.py::test_moe_softmax_sort_fused_oneshot[1-256-8-bf16] - TypeError: 'ArithValue' object cannot be interpreted as an integer

After the regular python if is switched to scf if, like in commit 07ea93d, the pytest doesn't show the error anymore.

Hi @xudoyuan, you may use this commit 5627ef2, which uses the regular python if instead of scf if. Then, run pytest tests/kernels/test_moe_sorting.py::test_moe_softmax_sort_fused_oneshot -k "1-256-8-bf16".

It will show this

FAILED tests/kernels/test_moe_sorting.py::test_moe_softmax_sort_fused_oneshot[1-256-8-bf16] - TypeError: 'ArithValue' object cannot be interpreted as an integer

After the regular python if is switched to scf if, like in commit 07ea93d, the pytest doesn't show the error anymore.

okay, let me check

File ".../kernels/moe_sorting_kernel.py", line 1242, in __then_1
for _z in range(_zs, _ze, _z1):
TypeError: 'ArithValue' object cannot be interpreted as an integer

Hi,

Please pay attention to this bugfix PR(#601). It wasn't merged before for some reason, but after merging, you won't need to use scf.if anymore.
Also, for kernel-internal 'device' functions like _emit_topk_gating_softmax_body, which appear outside of @flyc.kernel and contain control flow syntax (if/for, etc.), you still need to use @flyc.jit to decorate these functions.
After merging the PR and adding @flyc.jit, you can try not using scf.if; I've verified it locally and it works.

Thanks.

Got it, thank you Xudong! I will keep an eye out for the status of PR 601, and make the appropriate changes that you suggested.

Hi @xudoyuan @coderfeli , I have addressed the above issues in this commit 2538368. The commit replaced scf.if with plain python if, and added @flyc.jit to device functions like _emit_topk_gating_softmax_body.

Both unit test and CI passed, and performance improvement is the same as the data and tables in PR description.

…o topkgating for better clarity

amd-wsung102 and others added 4 commits May 27, 2026 23:33

Topk and sorting fusion

88e5954

Fix python black formatting

9324cad

Merge branch 'main' into fuse_topk_sorting_updated

0d1a323

Fix CI issue

231900e

coderfeli reviewed May 29, 2026

View reviewed changes

amd-wsung102 and others added 10 commits May 29, 2026 03:58

Relaxing assertion for avg_us

94c8fa0

Remove duplicated code and dead code, and moved topk functions back t…

5627ef2

…o topkgating for better clarity

Merge branch 'main' into fuse_topk_sorting_updated

4824b44

Adding scf if back due to TypeError

07ea93d

Fix python black formatting

cce3221

Deleted inline closures

5abc042

Fix python black formatting

eb29da0

Merge branch 'main' into fuse_topk_sorting_updated

a5ffcb1

Merge branch 'ROCm:main' into fuse_topk_sorting_updated

d5ef558

Change scf.if to if, and add flyc.jit to decorate device functions

2538368

amd-wsung102 requested review from coderfeli and xudoyuan June 4, 2026 03:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat] Fuse TopKGatingSoftmax and MoE Sorting kernels#582

[Feat] Fuse TopKGatingSoftmax and MoE Sorting kernels#582
amd-wsung102 wants to merge 14 commits into
ROCm:mainfrom
amd-wsung102:fuse_topk_sorting_updated

amd-wsung102 commented May 28, 2026 •

edited

Loading

Uh oh!

coderfeli May 29, 2026

Uh oh!

amd-wsung102 May 29, 2026

Uh oh!

xudoyuan May 29, 2026

Uh oh!

amd-wsung102 May 29, 2026 •

edited

Loading

Uh oh!

xudoyuan Jun 1, 2026

Uh oh!

xudoyuan Jun 1, 2026

Uh oh!

amd-wsung102 Jun 1, 2026

Uh oh!

amd-wsung102 Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants



		@contextmanager
		def _if_then(if_op):

Conversation

amd-wsung102 commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Relevant Files

Additional Details

Test Result - DeepSeek-R1: E=256, topk=8, model_dim=7168, bf16

Test Result - GPT-OSS 120B: E=128, topk=4, model_dim=2880, bf16

Submission Checklist

Uh oh!

coderfeli May 29, 2026

Choose a reason for hiding this comment

Uh oh!

amd-wsung102 May 29, 2026

Choose a reason for hiding this comment

Uh oh!

xudoyuan May 29, 2026

Choose a reason for hiding this comment

Uh oh!

amd-wsung102 May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xudoyuan Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

xudoyuan Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

amd-wsung102 Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

amd-wsung102 Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

amd-wsung102 commented May 28, 2026 •

edited

Loading

amd-wsung102 May 29, 2026 •

edited

Loading