[gfx1250][gemm] Add PTPC FP8/A8W4 support by aoli26 · Pull Request #649 · ROCm/FlyDSL

aoli26 · 2026-06-03T17:00:43Z

Motivation

Add per-token per-channel (PTPC) scaling to the gfx1250 GEMM kernel, where scales are per-token sa[M] and per-channel sb[N] (constant along K) fp32 data and thus applied once in the epilogue rather than per K-block.

Technical Details

PTPC FP8 runs the unscaled WMMA in the K-loop while A8W4 uses the scaled f8f6f4 op with an identity scale, and sa*sb is applied in fp32 in the epilogue (split-K supported via per-chunk scale + atomic add). All changes are compile-time gated to PTPC so the mxscale path is untouched; PTPC additionally skips scale TDM/LDS (only 2 loader waves needed) and prefetches the epilogue sa/sb loads behind the last WMMAs.

Test Plan

pytest tests/kernels/test_gemm_fp8fp4_gfx1250.py -k ptpc, plus ISA inspection of the PTPC kernels.

Test Result

All 14 PTPC tests pass (FP8 + A8W4 + split-K); ISA confirms scale TDM removal and epilogue prefetch with lower VGPR count and 0 spill.

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

aoli26 added 4 commits June 3, 2026 16:42

feat: add ptpc fp8, a8w4 gemm

d013fae

optimize ptpc epilogue vgpr prefetch

1d1ad98

ptpc use no-scale wmma for compatibility

5058565

mxscale/ptpc a8w4 use latest fp8 scheduler

81ab9bc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[gfx1250][gemm] Add PTPC FP8/A8W4 support#649

[gfx1250][gemm] Add PTPC FP8/A8W4 support#649
aoli26 wants to merge 4 commits into
gfx1250/gemm_fp8_optfrom
gfx1250/gemm_ptpc

aoli26 commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aoli26 commented Jun 3, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant