Skip to content

[inductor][rocm] make AMD MM matrix_instr_nonkdim configurable#3234

Open
reger-men wants to merge 1 commit into
ROCm:developfrom
reger-men:pr3-mfma-nonkdim
Open

[inductor][rocm] make AMD MM matrix_instr_nonkdim configurable#3234
reger-men wants to merge 1 commit into
ROCm:developfrom
reger-men:pr3-mfma-nonkdim

Conversation

@reger-men
Copy link
Copy Markdown

Adds torch._inductor.config.rocm.mfma_nonkdim, reading the env var TORCHINDUCTOR_MFMA_NONKDIM. The AMD MM Triton template autotune sweep is now driven from this knob rather than being hard-coded to [0, 16]. Default behaviour is unchanged on ROCm; ignored on other backends.

Recognised values:

value autotune sweep default matrix_instr_nonkdim
unset [0, 16] 16 (upstream)
0 / 16 / 32 [value] value
auto [0, 16, 32] 16

mfma_32x32x*_bf16 is only emitted when 32 is in the sweep, so auto is the safe opt-in for shapes where the mfma_32 path might win and 32 forces it on. Per-workload tuning knob, do not set system-wide.

Test plan

  • test_amd_mfma_nonkdim_config.py covers unset / forced 0 / forced 16 / forced 32 / auto / garbage via torch._inductor.config.patch
  • subprocess probe asserts the import-time env parser handles 0, 16, 32, auto, AUTO, an empty string, and a non-integer
  • existing AMD MM autotune tests continue to pass with env unset

@rocm-repo-management-api
Copy link
Copy Markdown

rocm-repo-management-api Bot commented May 19, 2026

Jenkins build for 78e99ac5b00742298bb83fe8d49b1a3d5991856c commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

@rocm-repo-management-api
Copy link
Copy Markdown

rocm-repo-management-api Bot commented May 20, 2026

Jenkins build for e2926f2d3ada9da02d55fc19a97bd0028fe6f1f5 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

Adds torch._inductor.config.rocm.mfma_nonkdim, reading the env var
TORCHINDUCTOR_MFMA_NONKDIM. The AMD MM Triton template autotune sweep
is now driven from this knob rather than being hard-coded to [0, 16].
Default behaviour is unchanged on ROCm; ignored on other backends.

Recognised values:
  unset             upstream default ([0, 16] sweep, ROCmGemmConfig default 16)
  "0" / "16" / "32" force a single value; sweep collapses to [value]
  "auto"            extend the autotune sweep to [0, 16, 32]; default stays 16

mfma_32x32x*_bf16 is only emitted when 32 is in the sweep, so "auto"
is the safe opt-in for shapes where the mfma_32 path might win.

Test under test/inductor/test_amd_mfma_nonkdim_config.py covers all
modes (unset / forced int / "auto" / garbage) by patching the config
attribute in-process via torch._inductor.config.patch, plus a
subprocess probe that spawns a fresh Python with the env var set to
exercise the import-time env parser.
@rocm-repo-management-api
Copy link
Copy Markdown

rocm-repo-management-api Bot commented May 21, 2026

Jenkins build for f26feec7ab0bf69d6f3b70be5a08fd354f690cc8 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant