Triton RMSNorm Optimizations by Micky774 · Pull Request #593 · ROCm/TransformerEngine

Micky774 · 2026-05-20T20:28:25Z

Description

Optimizes the Triton RMSNorm forward and backward kernels and adds an LDS-tiled FP8 transpose path. Measured 10%-50% improvements across a representative shape sweep for bf16 w/ no quantization or FP8 quant, and improvements of 3x-8x on FP8 Transpose outputs.

Benchmarks generated by this script.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Loop-invariant hoisting.
- Fwd non-blocked path: gamma load + ZERO_CENTERED_GAMMA adjustment + 1/n_cols hoisted outside the persistent row loop.
- Bwd non-blocked path: same gamma hoist; inv_n_cols hoisted.
- Bwd both paths: per-row c_scalar = nf*nf*grad_sum*inv_n_cols computed once before the dx/dg loop; dx expression refactored to nf * (dz*g - c*x) (saves one multiply per element).
Autotune wiring for bwd kernels. _rmsnorm_bwd_triton and _rmsnorm_bwd_dg_reduce_triton now follow the impl + autotune-wrapper dispatch pattern already used by the fwd kernel. te_rmsnorm_bwd_triton takes an autotune: bool = True kwarg; when off it uses the previously-hardcoded num_warps=8 + fixed BLOCK_SIZE_M/N=128/64 reduce tile.
External LDS-tiled FP8 transpose kernel. New _fp8_transpose_2d_impl (+ autotune wrapper) replaces the in-kernel out_transpose_ptr + cols * stride + row_idx strided byte stores that were uncoalesced (one byte per thread to a different cache line). The new kernel does a coalesced (BLOCK_M, BLOCK_N) read, tl.trans() for LDS-staged transpose, then coalesced strided write.

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

alextmagro

LGTM!

aris134

LGTM!

ipanfilo · 2026-05-30T02:39:35Z

 * NVTE_USE_CAST_TRANSPOSE_TRITON=1 can be used to enable cast transpose (bgrad) triton kernels;
 * NVTE_USE_LAYERNORM_TRITON=1 can be used to enable layernorm triton kernels.
 * NVTE_USE_RMSNORM_TRITON=1 can be used to enable rmsnorm triton kernels.
+* NVTE_RMS_EXTERNAL_TRANSPOSE=0 disables external transpose in RMSNorm Triton kernels and


It is not used in code

@ipanfilo Could you check if the comments has been addressed?

wenchenvincent · 2026-06-01T15:53:42Z

@alextmagro @aris134 I saw you had approved the PR. For the inline comments, let's also resolve conversation if the comments has been addressed.

Micky774 added 4 commits May 20, 2026 19:19

Initial optimizations

4b39845

Updated rmsnorm kernel w/ RMW accumulation pattern and autotuning

baaaec9

Added external transpose kernel for LDS optimized transpose

de2f7d8

Update test to account for new autotuning

12c680a

Micky774 requested review from ipanfilo, wangye805 and wenchenvincent as code owners May 20, 2026 20:28

Micky774 added the ci-level 3 CI test level 3 label May 21, 2026

Trim comments

5f2a993

Micky774 requested review from alextmagro, aris134 and matthiasdiener May 21, 2026 17:06

Added readme entry

06fac94

wenchenvincent requested a review from brunomazzottiamd May 27, 2026 16:20

aris134 reviewed May 27, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/triton_kernels/norms_common.py

aris134 reviewed May 27, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/triton_kernels/rmsnorm.py Outdated

aris134 reviewed May 27, 2026

View reviewed changes

Comment thread README.rst Outdated

aris134 requested changes May 27, 2026

View reviewed changes

alextmagro requested changes May 27, 2026

View reviewed changes

Micky774 added 4 commits May 27, 2026 21:01

Move transpose kernel

62f0b34

Updated readme

cad0786

Revert dg RMW changes

0d736a1

Always use LDS transpose for Triton RMSNorm fwd

6adbb66

Micky774 requested a review from alextmagro May 29, 2026 17:13

alextmagro approved these changes May 29, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/triton_kernels/rmsnorm.py

Micky774 requested a review from aris134 May 29, 2026 19:17

aris134 reviewed May 29, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/triton_kernels/rmsnorm.py Outdated

aris134 approved these changes May 29, 2026

View reviewed changes

ipanfilo requested changes May 30, 2026

View reviewed changes

Updated readme, fixed inline comment

c75679a

Conversation

Micky774 commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alextmagro left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

aris134 left a comment

Choose a reason for hiding this comment

Uh oh!

ipanfilo May 30, 2026

Choose a reason for hiding this comment

Uh oh!

wenchenvincent Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

Micky774 Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

wenchenvincent commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Micky774 commented May 20, 2026 •

edited

Loading