speed up nvte_multi_padding / nvte_multi_unpadding#592
Conversation
| for (int i2 = 0; i2 < nvec; ++i2) { | ||
| const int row = tile_row + i1 * nvec + i2; | ||
| const int col = tile_col + j1 * nvec; | ||
| const int remaining = row_length - col; |
There was a problem hiding this comment.
Can we factor out these lines since they don't depend on i2? This also lets us compute valid_cols once per iter instead of repeating the ternary in both load/store calls:
const int col = tile_col + j1 * nvec;
const int remaining = row_length - col;
const int valid_cols = remaining > 0 ? min(remaining, nvec) : 0;
#pragma unroll
for (int i2 = 0; i2 < nvec; ++i2)
There was a problem hiding this comment.
Done in cb9221d. Does not really affect performance; I suspect the compiler is able to factor these out by itself.
| // Parameters to tune | ||
| constexpr int n_warps_per_tile = 4; | ||
| constexpr int threads_per_block = THREADS_PER_WARP * n_warps_per_tile; | ||
| constexpr int desired_load_store_size = 8; | ||
| constexpr int kMaxTensorsPerKernel = 64; // Args must be <4 KB |
There was a problem hiding this comment.
Did you try tuning any of these parameters by chance? The current block size seems small. I am also wondering if you can squeeze out any more performance by tuning the load/store size
There was a problem hiding this comment.
Good point, thanks. Bumping n_warps_per_tile (in dc708c6) does lead to a significant performance increase. Increasing desired_load_store_size reduced performance.
There was a problem hiding this comment.
I wonder if templating out load and store separately would help -- previously I found that 16 outperformed for some kernels for loads, but store was well optimized at 8. This was for FP8 though so may be different here. Also non-temporal stores may be a potential improvement, check out rocm device utils for reference
There was a problem hiding this comment.
I wonder if templating out load and store separately would help -- previously I found that 16 outperformed for some kernels for loads, but store was well optimized at 8. This was for FP8 though so may be different here.
No, that also seems to reduce performance in this case.
Also non-temporal stores may be a potential improvement, check out rocm device utils for reference
This did help with performance, done in 84b7d09
| #pragma unroll | ||
| for (int i2 = 0; i2 < nvec; ++i2) { | ||
| const int row = tile_row + i1 * nvec + i2; | ||
| const int col = tile_col + j1 * nvec; |
There was a problem hiding this comment.
nit: similar to above for the multi_padding_kernel, I think col and the ternary op can be hoisted up
Description
Please include a brief summary of the changes, relevant motivation and context.
Fixes https://github.com/ROCm/frameworks-internal/issues/16530
See https://github.com/ROCm/frameworks-internal/issues/16530#issuecomment-4502138388 for performance.
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: