speed up nvte_multi_padding / nvte_multi_unpadding by matthiasdiener · Pull Request #592 · ROCm/TransformerEngine

matthiasdiener · 2026-05-20T15:25:03Z

Description

Please include a brief summary of the changes, relevant motivation and context.

Fixes https://github.com/ROCm/frameworks-internal/issues/16530

See https://github.com/ROCm/frameworks-internal/issues/16530#issuecomment-4502138388 for performance.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

aris134 · 2026-06-01T17:36:03Z

+    for (int i2 = 0; i2 < nvec; ++i2) {
+      const int row = tile_row + i1 * nvec + i2;
+      const int col = tile_col + j1 * nvec;
+      const int remaining = row_length - col;


Can we factor out these lines since they don't depend on i2? This also lets us compute valid_cols once per iter instead of repeating the ternary in both load/store calls:

const int col = tile_col + j1 * nvec; const int remaining = row_length - col; const int valid_cols = remaining > 0 ? min(remaining, nvec) : 0; #pragma unroll for (int i2 = 0; i2 < nvec; ++i2)

Done in cb9221d. Does not really affect performance; I suspect the compiler is able to factor these out by itself.

aris134 · 2026-06-01T17:42:27Z

 // Parameters to tune
 constexpr int n_warps_per_tile = 4;
 constexpr int threads_per_block = THREADS_PER_WARP * n_warps_per_tile;
 constexpr int desired_load_store_size = 8;
 constexpr int kMaxTensorsPerKernel = 64;  // Args must be <4 KB


Did you try tuning any of these parameters by chance? The current block size seems small. I am also wondering if you can squeeze out any more performance by tuning the load/store size

Good point, thanks. Bumping n_warps_per_tile (in dc708c6) does lead to a significant performance increase. Increasing desired_load_store_size reduced performance.

I wonder if templating out load and store separately would help -- previously I found that 16 outperformed for some kernels for loads, but store was well optimized at 8. This was for FP8 though so may be different here. Also non-temporal stores may be a potential improvement, check out rocm device utils for reference

I wonder if templating out load and store separately would help -- previously I found that 16 outperformed for some kernels for loads, but store was well optimized at 8. This was for FP8 though so may be different here.

No, that also seems to reduce performance in this case.

Also non-temporal stores may be a potential improvement, check out rocm device utils for reference

This did help with performance, done in 84b7d09

aris134 · 2026-06-01T17:46:26Z

+#pragma unroll
+    for (int i2 = 0; i2 < nvec; ++i2) {
+      const int row = tile_row + i1 * nvec + i2;
+      const int col = tile_col + j1 * nvec;


nit: similar to above for the multi_padding_kernel, I think col and the ternary op can be hoisted up

Done in cb9221d.

speed up nvte_multi_padding / nvte_multi_unpadding

ce6e865

matthiasdiener requested review from alextmagro and aris134 May 20, 2026 15:25

matthiasdiener self-assigned this May 20, 2026

matthiasdiener added the ci-level 1 CI test level 1 label May 20, 2026

matthiasdiener added 3 commits May 20, 2026 18:37

factor out binary search

a470ecb

Merge branch 'dev' into mdiener/speedup-pad-unpad

45b996a

guard

5f011ae

matthiasdiener marked this pull request as ready for review May 20, 2026 20:01

matthiasdiener requested review from ipanfilo, wangye805 and wenchenvincent as code owners May 20, 2026 20:01

aris134 reviewed Jun 1, 2026

View reviewed changes

Merge remote-tracking branch 'origin/dev' into mdiener/speedup-pad-unpad

a35459c

aris134 reviewed Jun 1, 2026

View reviewed changes

aris134 requested changes Jun 1, 2026

View reviewed changes

matthiasdiener added 2 commits June 1, 2026 12:56

factor out cols

cb9221d

bump n_warps_per_tile

dc708c6

matthiasdiener requested a review from aris134 June 1, 2026 19:59

use NT stores

84b7d09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

speed up nvte_multi_padding / nvte_multi_unpadding#592

speed up nvte_multi_padding / nvte_multi_unpadding#592
matthiasdiener wants to merge 8 commits into
devfrom
mdiener/speedup-pad-unpad

matthiasdiener commented May 20, 2026 •

edited

Loading

Uh oh!

aris134 Jun 1, 2026

Uh oh!

matthiasdiener Jun 1, 2026

Uh oh!

aris134 Jun 1, 2026

Uh oh!

matthiasdiener Jun 1, 2026

Uh oh!

alextmagro Jun 1, 2026 •

edited

Loading

Uh oh!

matthiasdiener Jun 1, 2026 •

edited

Loading

Uh oh!

aris134 Jun 1, 2026 •

edited

Loading

Uh oh!

matthiasdiener Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

matthiasdiener commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

aris134 Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

matthiasdiener Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

aris134 Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

matthiasdiener Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

alextmagro Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matthiasdiener Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aris134 Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matthiasdiener Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

matthiasdiener commented May 20, 2026 •

edited

Loading

alextmagro Jun 1, 2026 •

edited

Loading

matthiasdiener Jun 1, 2026 •

edited

Loading

aris134 Jun 1, 2026 •

edited

Loading