[ROCm] Fix TorchScript JIT BF16 HIPRTC overload conflict (ROCM-23829) by srinivamd · Pull Request #3255 · ROCm/pytorch

srinivamd · 2026-05-26T03:39:02Z

Summary

Fixes ROCM-23829: Megatron-DeepSpeed LLaMA2 pretraining crashes during TorchScript JIT BF16 warmup with __float2bfloat16 return-type overload conflict.

Root Cause

On ROCm >= 7.13 (rocm-systems#4727), HIPRTC headers now bundle amd_hip_bf16.h which defines __float2bfloat16(float) returning __hip_bfloat16. PyTorch's TorchScript JIT fuser (resource_strings.h) emits its own inline __float2bfloat16(const float) returning __nv_bfloat16 into every JIT-generated GPU kernel. Since __hip_bfloat16 and __nv_bfloat16 are different types, these two definitions differ only in return type, producing a fatal HIPRTC compile error:

error: functions that differ only in their return type cannot be overloaded
   51 | __attribute__((device)) __nv_bfloat16 __float2bfloat16(const float a) {
      |                         ~~~~~~~~~~~~~ ^
hiprtc_runtime.h:13560: note: previous definition is here
13560 | __attribute__((device)) static inline __hip_bfloat16 __float2bfloat16(float f) {
      |                                       ~~~~~~~~~~~~~~ ^

This blocks all BF16 JIT fusion workloads (Megatron-DeepSpeed bias_gelu warmup) on MI300X/MI350X.

Fix

In the emitted JIT kernel string (bfloat16_support_literal), detect the HIP bf16 header guard (_HIP_INCLUDE_HIP_AMD_DETAIL_HIP_BF16_H_) at HIPRTC compile time:

When present (ROCm >= 7.13 with HIPRTC bf16 headers): typedef __hip_bfloat16 __nv_bfloat16 — alias to the native HIP type. HIP-provided __float2bfloat16 / __bfloat162float intrinsics work transparently since __nv_bfloat16 IS __hip_bfloat16.
When absent (older ROCm): preserve existing inline struct + intrinsic definitions unchanged.

This is backward-compatible — on older ROCm where HIPRTC does not include bf16 headers, the guard is false and behavior is identical to before.

Reproducer

import torch
torch._C._jit_set_profiling_mode(False)
torch._C._jit_set_profiling_executor(False)
torch._C._jit_override_can_fuse_on_cpu(True)
torch._C._jit_override_can_fuse_on_gpu(True)

@torch.jit.script
def bias_gelu(bias, y):
    x = bias + y
    return x * 0.5 * (1.0 + torch.tanh(0.79788456 * x * (1 + 0.044715 * x * x)))

bias = torch.rand(128, dtype=torch.bfloat16, device=\"cuda\")
inp  = torch.rand((32, 4, 128), dtype=torch.bfloat16, device=\"cuda\")
bias.requires_grad, inp.requires_grad = True, True
for _ in range(5):
    out = bias_gelu(bias, inp)
print(\"OK\")

Affected Models

pyt_deepspeed_megatron_llama2_7b
pyt_deepspeed_megatron_llama2_13b
pyt_deepspeed_megatron_llama2_70b
pyt_deepspeed_megatron_gpt3_13b

Test Plan

Run reproducer script above on ROCm 7.13 with MI350X (gfx950)
Verify Megatron-DeepSpeed LLaMA2-70B pretraining starts successfully
Verify no regression on ROCm 7.12 (where HIPRTC bf16 headers are absent)

Cherry-pick targets

release/2.11
release/2.10

Co-Authored-By: Claude Opus 4 (1M context) noreply@anthropic.com

(cherry picked from commit a66eeda) Fixes #ISSUE_NUMBER Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>

========================================== Triton build conditionalized on ROCM_VERSION Include the ROCm version in triton version (cherry picked from commit 7d33910) (cherry picked from commit 0412eb4) Update triton-rocm.txt to triton.txt (cherry picked from commit 0ce9f6e) Use ROCm/triton for install_triton.sh (cherry picked from commit 6e9714b) update triton commit Revert "Use ROCm/triton for install_triton.sh" This reverts commit 81b0cbc8435122030044049c661f252ee8aa7ae5. change triton repo Update triton.txt to use release/internal/3.3.x branch Use ROCm/triton Use ROCm/triton for install_triton.sh (cherry picked from commit 0036db5)

…on (#2482) Related to https://github.com/ROCm/builder/pull/90/files http://rocm-ci.amd.com/job/mainline-pytorch_internal-manylinux-wheels/305/ PyTorch wheel installs successfully when building torchvision/torchaudio (cherry picked from commit c1ee54d)

Fixes #ISSUE_NUMBER (cherry picked from commit 0ea0592)

…A helper functions ======================================================================================= Implementation of PyTorch ut parsing script - QA helper function (#1386) * Initial implementation of PyTorch ut parsing script * Extracted path variables * Use nested dict to save results * Fixes typo * Cleanup * Fixes several issues * Minor name change * Update run_pytorch_unit_tests.py * Added file banners * Supported running from API * Added more help info * Consistent naming * Format help text --------- Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com> Co-authored-by: Jithun Nair <jithun.nair@amd.com> Print consolidated log file for pytorch unit test automation scripts (#1433) * Print consolidated log file for pytorch uts * Update run_entire_tests subprocess call as well * lint * Add ERROR string [SWDEV-466849] Enhancements for PyTorch UT helper scripts (#1491) * Check that >1 GPUs are visible when running TEST_CONFIG=distributed * Add EXECUTION_TIME to file-level and aggregate statistics PyTorch unit test helper scripts enhancements (#1517) * Fail earlier for distributed-on-1-GPU scenario * print cmd in consolidated log with prettier formatting * python->python3 Fixes https://ontrack-internal.amd.com/browse/SWDEV-477264 --------- Co-authored-by: blorange-amd <bo.li2@amd.com> Several issues fix of QA helper script (#1564) Fixes SWDEV-475071: https://ontrack-internal.amd.com/browse/SWDEV-475071 Removed args inside function (#1595) Fixes SWDEV-475071 (cherry picked from commit 041aa1b47978154de63edc6b7ffcdea218a847a3) QA script - Added multi gpu check with priority_tests (#1604) Fixes SWDEV-487907. Verified throwing exception for distributed is working correctly on single gpu with command: python .automation_scripts/run_pytorch_unit_tests.py --priority_test (cherry picked from commit 57cc742271cbf4547f9213710e57f6444bbc983e) (cherry picked from commit 6d5c3dc) (cherry picked from commit 2ee3aa2)

* Use triton commit same as that used for release/2.6 branch since both are triton version 3.2.0, so assuming they're compatible. Relates to: https://github.com/ROCm/rocAutomation/pull/660/files https://github.com/ROCm/builder/pull/70/files Validation http://ml-ci-internal.amd.com:8080/job/pytorch/job/manylinux_rocm_wheels/568/ --------- Co-authored-by: Jithun Nair <jithun.nair@amd.com> Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com> (cherry picked from commit 14c1417) (cherry picked from commit c20a8f8)

* Add trailing comma for consistency in gfx architecture list Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com> * ROCm: Enable tf32 testing on test_nn Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com> --------- Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com> (cherry picked from commit c113e14)

…-deps flags (#2121) Cherry-pick of #2103 Co-authored-by: Ethan Wee <Ethan.Wee@amd.com> (cherry picked from commit 1dea6e8)

Relates to: ROCm/builder#82 Validation: http://rocm-ci.amd.com/job/mainline-pytorch_internal-manylinux-wheels/98/ Using `registry-sc-harbor.amd.com/framework/compute-rocm-dkms-no-npi-hipclang:16180_ubuntu24.04_py3.12_pytorch_lw_rocm7.0_IT_upgrade_numpy_452f3df6`: ``` root@d92befdbb2a6:/# pip list | egrep "numpy|pandas" numpy 2.1.2 pandas 2.2.3 root@d92befdbb2a6:/# python3 Python 3.12.3 (main, Feb 4 2025, 14:48:35) [GCC 13.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import pandas >>> import torch >>> import numpy >>> exit() root@d92befdbb2a6:/data/pytorch-micro-benchmarking# HIP_VISIBLE_DEVICES=1 python3 micro_benchmarking_pytorch.py --network resnet50 INFO: running forward and backward for warmup. INFO: running the benchmark.. OK: finished running benchmark.. --------------------SUMMARY-------------------------- Microbenchmark for network : resnet50 Num devices: 1 Dtype: FP32 Mini batch size [img] : 64 Time per mini-batch : 0.11369450092315674 Throughput [img/sec] : 562.9120096428937 ``` --------- Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com> (cherry picked from commit cf32479)

…2269) Fixes SWDEV-536456 Fixes error post-#2256: ``` 00:12:44.248 #22 155.3 ERROR: Ignored the following versions that require a different python version: 0.52.0 Requires-Python >=3.6,<3.9; 0.52.0rc3 Requires-Python >=3.6,<3.9; 0.61.0 Requires-Python >=3.10; 0.61.0rc1 Requires-Python >=3.10; 0.61.0rc2 Requires-Python >=3.10; 0.61.1rc1 Requires-Python >=3.10; 0.61.2 Requires-Python >=3.10; 3.3 Requires-Python >=3.10; 3.3rc0 Requires-Python >=3.10; 3.4 Requires-Python >=3.10; 3.4.1 Requires-Python >=3.10; 3.4.2 Requires-Python >=3.10; 3.4rc0 Requires-Python >=3.10; 3.5 Requires-Python >=3.11; 3.5rc0 Requires-Python >=3.11; 8.2.0 Requires-Python >=3.10; 8.2.1 Requires-Python >=3.10 00:12:44.248 #22 155.3 ERROR: Could not find a version that satisfies the requirement numba==0.61.2 (from versions: 0.1, 0.2, 0.3, 0.5.0, 0.6.0, 0.7.0, 0.7.1, 0.7.2, 0.8.0, 0.8.1, 0.9.0, 0.10.0, 0.10.1, 0.11.0, 0.12.0, 0.12.1, 0.12.2, 0.13.0, 0.13.2, 0.13.3, 0.13.4, 0.14.0, 0.15.1, 0.16.0, 0.17.0, 0.18.1, 0.18.2, 0.19.1, 0.19.2, 0.20.0, 0.21.0, 0.22.0, 0.22.1, 0.23.0, 0.23.1, 0.24.0, 0.25.0, 0.26.0, 0.27.0, 0.28.1, 0.29.0, 0.30.0, 0.30.1, 0.31.0, 0.32.0, 0.33.0, 0.34.0, 0.35.0, 0.36.1, 0.36.2, 0.37.0, 0.38.0, 0.38.1, 0.39.0, 0.40.0, 0.40.1, 0.41.0, 0.42.0, 0.42.1, 0.43.0, 0.43.1, 0.44.0, 0.44.1, 0.45.0, 0.45.1, 0.46.0, 0.47.0, 0.48.0, 0.49.0, 0.49.1rc1, 0.49.1, 0.50.0rc1, 0.50.0, 0.50.1, 0.51.0rc1, 0.51.0, 0.51.1, 0.51.2, 0.52.0rc2, 0.53.0rc1.post1, 0.53.0rc2, 0.53.0rc3, 0.53.0, 0.53.1, 0.54.0rc2, 0.54.0rc3, 0.54.0, 0.54.1rc1, 0.54.1, 0.55.0rc1, 0.55.0, 0.55.1, 0.55.2, 0.56.0rc1, 0.56.0, 0.56.2, 0.56.3, 0.56.4, 0.57.0rc1, 0.57.0, 0.57.1rc1, 0.57.1, 0.58.0rc1, 0.58.0rc2, 0.58.0, 0.58.1, 0.59.0rc1, 0.59.0, 0.59.1, 0.60.0rc1, 0.60.0) 00:12:44.248 #22 155.3 ERROR: No matching distribution found for numba==0.61.2 ``` Validation: * Docker image: http://rocm-ci.amd.com/job/mainline-framework-pytorch-internal-cs9-ci/132 * Wheels: http://rocm-ci.amd.com/job/mainline-pytorch_internal-manylinux-wheels/102/ From `registry-sc-harbor.amd.com/framework/compute-rocm-dkms-no-npi-hipclang:16180_ubuntu22.04_py3.9_pytorch_lw_rocm7.0_IT_py3.9_a11d94ad`: ``` root@f43861a0a856:/# pip list | egrep "numpy|pandas" numpy 2.0.2 pandas 2.2.3 root@f43861a0a856:/# python Python 3.9.23 (main, Jun 4 2025, 08:55:38) [GCC 11.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import torch >>> import numpy >>> import pandas root@f43861a0a856:/data/pytorch-micro-benchmarking# HIP_VISIBLE_DEVICES=1 python3 micro_benchmarking_pytorch.py --network resnet50 INFO: running forward and backward for warmup. INFO: running the benchmark.. OK: finished running benchmark.. --------------------SUMMARY-------------------------- Microbenchmark for network : resnet50 Num devices: 1 Dtype: FP32 Mini batch size [img] : 64 Time per mini-batch : 0.11354223489761353 Throughput [img/sec] : 563.6669038416574 ``` (cherry picked from commit a0a9d81)

…cm7.0/7.1 (#2239) Revamped version of #2108 PR to: - enable complex data types for sparse matmul on ROCm - fix sparse addmm/baddbmm on ROCm - fix sparse hipification for ROCm - fix/enable sparse tests on ROCm (~50 tests total for non-fp16/bf16): - enable fp16/bf16 sparse path for rocm7.0 - enable fp16/bf16 sparse tests for rocm7.0/7.1 ``` test_sparse_csr.py::TestSparseCSRCUDA::test_bmm_cuda_* test_sparse.py::TestSparseCUDA::test_sparse_matmul_cuda_* test_sparse_csr.py::TestSparseCSRCUDA::test_mm_cuda_float64 test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_all_sparse_csr_SparseCS* test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_* test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_addmm_cuda_float16 ``` (cherry picked from commit cc2a69c)

#2326) Fixes https://ontrack-internal.amd.com/browse/SWDEV-541809 Upgrading tensorboard after numpy upgrade Ran in **registry-sc-harbor.amd.com/framework/compute-rocm-dkms-no-npi-hipclang:16381_ubuntu24.04_py3.12_pytorch_lw_rocm7.0_internal_testing_afe8b782** ``` 7 git checkout rocm7.0_IT_upgrade_tensorboard 8 pip install .ci/docker/requirements-ci.txt 9 pip install -r .ci/docker/requirements-ci.txt 10 PYTORCH_TEST_WITH_ROCM=1 python test/test_monitor.py TestMonitorTensorboard.test_event_handler root@ubb4-rack-22:/var/lib/jenkins/pytorch# PYTORCH_TEST_WITH_ROCM=1 python test/test_monitor.py TestMonitorTensorboard.test_event_handler /opt/venv/lib/python3.12/site-packages/google/protobuf/internal/well_known_types.py:91: DeprecationWarning: datetime.datetime.utcfromtimestamp() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.fromtimestamp(timestamp, datetime.UTC). _EPOCH_DATETIME_NAIVE = datetime.datetime.utcfromtimestamp(0) . ---------------------------------------------------------------------- Ran 1 test in 0.327s OK root@ubb4-rack-22:/var/lib/jenkins/pytorch# ``` (cherry picked from commit c7f61f4)

Tested locally successfully ``` root@rocm-framework-47:/var/lib/jenkins/pytorch# pip install -r requirements.txt Ignoring numpy: markers 'python_version == "3.9"' don't match your environment Requirement already satisfied: setuptools<80.0,>=70.1.0 in /opt/venv/lib/python3.10/site-packages (from -r /var/lib/jenkins/pytorch/requirements-build.txt (line 2)) (79.0.1) Requirement already satisfied: cmake>=3.31.4 in /opt/venv/lib/python3.10/site-packages (from -r /var/lib/jenkins/pytorch/requirements-build.txt (line 3)) (4.0.0) Requirement already satisfied: ninja==1.11.1.3 in /opt/venv/lib/python3.10/site-packages (from -r /var/lib/jenkins/pytorch/requirements-build.txt (line 4)) (1.11.1.3) Requirement already satisfied: numpy==2.1.2 in /opt/venv/lib/python3.10/site-packages (from -r /var/lib/jenkins/pytorch/requirements-build.txt (line 5)) (2.1.2) Requirement already satisfied: packaging==25.0 in /opt/venv/lib/python3.10/site-packages (from -r /var/lib/jenkins/pytorch/requirements-build.txt (line 6)) (25.0) Requirement already satisfied: pyyaml==6.0.2 in /opt/venv/lib/python3.10/site-packages (from -r /var/lib/jenkins/pytorch/requirements-build.txt (line 7)) (6.0.2) Requirement already satisfied: requests==2.32.4 in /opt/venv/lib/python3.10/site-packages (from -r /var/lib/jenkins/pytorch/requirements-build.txt (line 8)) (2.32.4) Requirement already satisfied: six==1.17.0 in /opt/venv/lib/python3.10/site-packages (from -r /var/lib/jenkins/pytorch/requirements-build.txt (line 9)) (1.17.0) Requirement already satisfied: typing-extensions==4.14.1 in /opt/venv/lib/python3.10/site-packages (from -r /var/lib/jenkins/pytorch/requirements-build.txt (line 10)) (4.14.1) Requirement already satisfied: expecttest==0.3.0 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 8)) (0.3.0) Requirement already satisfied: filelock==3.18.0 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 9)) (3.18.0) Requirement already satisfied: fsspec==2025.7.0 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 10)) (2025.7.0) Requirement already satisfied: hypothesis==5.35.1 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 11)) (5.35.1) Requirement already satisfied: jinja2==3.1.6 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 12)) (3.1.6) Requirement already satisfied: lintrunner==0.12.7 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 13)) (0.12.7) Requirement already satisfied: networkx==2.8.8 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 14)) (2.8.8) Requirement already satisfied: optree==0.13.0 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 18)) (0.13.0) Requirement already satisfied: psutil==7.0.0 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 19)) (7.0.0) Requirement already satisfied: sympy==1.13.3 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 20)) (1.13.3) Requirement already satisfied: wheel==0.45.1 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 22)) (0.45.1) Requirement already satisfied: build[uv] in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 7)) (1.3.0) Requirement already satisfied: charset_normalizer<4,>=2 in /opt/venv/lib/python3.10/site-packages (from requests==2.32.4->-r /var/lib/jenkins/pytorch/requirements-build.txt (line 8)) (3.4.3) Requirement already satisfied: idna<4,>=2.5 in /opt/venv/lib/python3.10/site-packages (from requests==2.32.4->-r /var/lib/jenkins/pytorch/requirements-build.txt (line 8)) (3.10) Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/venv/lib/python3.10/site-packages (from requests==2.32.4->-r /var/lib/jenkins/pytorch/requirements-build.txt (line 8)) (2.5.0) Requirement already satisfied: certifi>=2017.4.17 in /opt/venv/lib/python3.10/site-packages (from requests==2.32.4->-r /var/lib/jenkins/pytorch/requirements-build.txt (line 8)) (2025.8.3) Requirement already satisfied: attrs>=19.2.0 in /opt/venv/lib/python3.10/site-packages (from hypothesis==5.35.1->-r requirements.txt (line 11)) (25.3.0) Requirement already satisfied: sortedcontainers<3.0.0,>=2.1.0 in /opt/venv/lib/python3.10/site-packages (from hypothesis==5.35.1->-r requirements.txt (line 11)) (2.4.0) Requirement already satisfied: MarkupSafe>=2.0 in /opt/venv/lib/python3.10/site-packages (from jinja2==3.1.6->-r requirements.txt (line 12)) (3.0.2) Requirement already satisfied: mpmath<1.4,>=1.1.0 in /opt/venv/lib/python3.10/site-packages (from sympy==1.13.3->-r requirements.txt (line 20)) (1.3.0) Requirement already satisfied: pyproject_hooks in /opt/venv/lib/python3.10/site-packages (from build[uv]->-r requirements.txt (line 7)) (1.2.0) Requirement already satisfied: tomli>=1.1.0 in /opt/venv/lib/python3.10/site-packages (from build[uv]->-r requirements.txt (line 7)) (2.2.1) Requirement already satisfied: uv>=0.1.18 in /opt/venv/lib/python3.10/site-packages (from build[uv]->-r requirements.txt (line 7)) (0.8.10) root@rocm-framework-47:/var/lib/jenkins/pytorch# pip install -r requirements-build.txt ``` (cherry picked from commit 6e6e454)

This also fixes a problem in gesvd driver when UV is not needed. (cherry picked from commit 4ce57ec) (cherry picked from commit 167b4c1)

(cherry picked from commit d6879fa) (cherry picked from commit 123a164)

Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com> (cherry picked from commit 1ad5bb95d796283d5f56ac1edd16f1731d24a49d) (cherry picked from commit 519160d)

Fixes #ISSUE_NUMBER

- Need to use upstream/main for rocm/pytorch's develop branch. For release branches, `github.event.pull_request.base.ref` should work as is. - Need to remove any trailing space in PR TITTLE so branch name can be formed correctly Fixes #ISSUE_NUMBER

# Conflicts: # .ci/docker/requirements-ci.txt

[AUTOGENERATED] develop_IFU_20251104

# Conflicts: # .ci/docker/ci_commit_pins/triton.txt # requirements.txt

To keep triton version consistent with what is in rocm/triton's release/internal/3.5.x branch, we need to keep triton_version.txt at 3.5.0 and move triton hash to ToT of that branch.

[AUTOGENERATED] develop_IFU_20251118

[AUTOGENERATED] develop_IFU_20251124

# Conflicts: # .ci/docker/ci_commit_pins/triton.txt # .ci/docker/requirements-ci.txt # .ci/docker/triton_version.txt # .circleci/scripts/binary_populate_env.sh # .github/scripts/build_triton_wheel.py # test/test_sparse_csr.py

[AUTOGENERATED] develop_IFU_20260211

Adds workflow automation so IFU merges generate issues for commits in range and assign them to commit authors. Includes cold-start handling for first IFU on a branch, normal case when previous IFU tags exist, and dedupe logic to prevent duplicate issues on reruns.

[AUTOGENERATED] develop_IFU_20260218

)

# Conflicts: # CMakeLists.txt

[AUTOGENERATED] develop_IFU_20260316

…um (#3076) In case of github workflow failing when it gets triggered via PR merge of an IFU PR, we want to be able to run workflow manually to debug and correctly create tags and issues. For this purpose, I have changed the workflow file to take in rocm/pytorch's branch and PR number and run the entire workflow on that. Action Running: https://github.com/ROCm/pytorch/actions/runs/23174239617 IFU PR: #3069

## Summary - Add `pytorch-unit-test-scripts/` directory with all parity scripts (download_testlogs, summarize_xml_testreports, parity.sh, and supporting utilities) - Add `parity.yml` GitHub Actions workflow that can be manually triggered to download CI artifacts and generate parity CSVs - All `download_testlogs` and `summarize_xml_testreports.py` flags are exposed as workflow inputs (SHA, PR ID, arch, exclude flags, filter, set names, etc.) - Architectures are configurable via comma-separated input (default: mi200,mi300,mi355) - Generated CSVs and logs are uploaded as downloadable workflow artifacts ## Setup Requires these repository secrets: - [x] - `IFU_GITHUB_TOKEN` (already exists) - [x] - `AWS_ACCESS_KEY_ID` - [x] - `AWS_SECRET_ACCESS_KEY` ## Test plan - [x] Trigger workflow via Actions tab or `gh workflow run parity.yml --ref add-parity-scripts-dashboard` - [x] Verify artifacts download and CSVs generate for each architecture - [x] Verify CSV artifacts are downloadable from the workflow run https://github.com/ethanwee1/pytorch/actions/runs/23413634454 --------- Co-authored-by: Jithun Nair <jithun.nair@amd.com>

…arity workflow (#3147) ## Summary Adds log-based failure detection to the parity workflow. Tests that timeout (exit code 124), crash (SIGIOT, SIGSEGV), hit Fatal Python errors, or OOM never produce JUnit XML output, so they are invisible to the existing XML-based parity report. This PR closes that gap. ### Changes - **New script: `detect_log_failures.py`** — Parses raw CI `.txt` log files to detect test failures not captured in XML reports. Classifies failures as TIMEOUT, CRASH, CONSISTENT_FAILURE, or NON_ZERO_EXIT. Outputs a CSV with platform, workflow, test file, category, and reason. - **`generate_summary.py`** — Adds `--log-failures` argument to accept CSV(s) from `detect_log_failures.py`. Appends a "LOG-BASED FAILURES (not in XML)" section to both CSV and markdown output. - **`parity.yml`** — Adds a "Detect log-based failures" step after XML processing (runs when `include_logs` is enabled). Wires the resulting CSV into the summarize job via `--log-failures`. - Adding in shard information - Also adding in which workflow we are downloading for in download testlogs ### How it works 1. `detect_log_failures.py` scans `.txt` log files for patterns like: - `Got exit code 124` (timeout) - `Segmentation fault`, `SIGSEGV`, `SIGIOT`, `Fatal Python error` (crash) - `FAILED CONSISTENTLY` - `OutOfMemoryError`, `bad_alloc` (OOM) 2. Results are saved as `log_failures_<arch>.csv` and uploaded as part of the per-arch artifact 3. The summarize job collects all log failure CSVs and passes them to `generate_summary.py` 4. The final parity report includes a dedicated section listing these failures ## Test plan - [x] Syntax-checked both Python files (`py_compile`) - [x] Validated `parity.yml` YAML syntax - [x] Tested `detect_log_failures.py` against actual CI log files from parity runs - [x] Verified all files match fork/main (with correct `.automation_scripts/` paths) - [x] Run parity workflow with `include_logs: true` to verify end-to-end Validation: https://github.com/ethanwee1/pytorch/actions/runs/24352395766 --------- Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>

Copied from https://github.com/AMD-ROCm-Internal/rocm-npi-dev/actions/workflows/build_portable_linux_pytorch_dockers.yml Latest run and docker generated: docker.io/rocm/pytorch-private:pytorch-nightly-f8d08404-rocm7.13.0a20260413-ubuntu24.04-py3.12-gfx950-dcgpu https://github.com/ethanwee1/pytorch/actions/runs/24441876981

…3159) Docker credentials were using the ones from my fork and not rocm/pytorch credentials: https://github.com/ROCm/pytorch/actions/runs/24479854145/job/71541505148 Latest build https://github.com/ROCm/pytorch/actions/runs/24480169722/job/71542549933

…sting on that arch

…umn (#3153) ## Summary - Only display tests where ROCm status is FAILED in the summary (CUDA status shown as a context column alongside). Previously both ROCm and CUDA failures were shown. - Add "Also Failing In" column that shows which other architectures have the same test tuple (test_file, test_class, test_name) failing, making it easy to distinguish all-ROCm issues from architecture-specific ones. - Includes count of failed tests in the section header. - Add job-level and test-level shard info to "LOG-BASED FAILURES (not in XML)" and "FAILED TESTS" section - Includes flaky tests in "LOG-BASED FAILURES (not in XML)" section for any tests that pass when run in new process ## Test plan - [x] Cross-arch detection confirmed: tests failing on all 3 archs show the other 2 in "Also Failing In"; single-arch failures show empty - [x] CSV and Markdown output both updated consistently Latest run https://github.com/ROCm/pytorch/actions/runs/24798004968 Run without this PR on the same commit: https://github.com/ROCm/pytorch/actions/runs/24796654604

Repro job without this PR's change: https://github.com/ROCm/pytorch/actions/runs/25342470426/job/74303089638 Validation run with this PR's change: https://github.com/ROCm/pytorch/actions/runs/25342235984 Current issue: existing testing is not able to pick up the CUDA artifacts because the CUDA job and artifact names changed from `test` to `test-osdc` for default and distributed shards. Repro inputs: `sha=b1b5b61ddb689ea65aab0915ecfac5cc459b92fb`, `arch=mi355`, `skip_rocm=false`, `csv_name=pr3199-pre-change-repro`. CUDA job names now use `test-osdc` for default and distributed shards, for example: `linux-jammy-cuda13.0-py3.10-gcc11 / test-osdc (default, 1, 5, ...)` `linux-jammy-cuda13.0-py3.10-gcc11 / test-osdc (distributed, 1, 3, ...)` CUDA artifact names now look like: `test-reports-test-osdc-default-1-5` `test-reports-test-osdc-distributed-1-3`

## Summary - Update MI355 parity report shard counts to match current CI artifacts. - Change default shards from 6 to 10 and distributed shards from 3 to 4. ## Validation * Combined parity workflow for `5b9a4786ea4b1a6170c6e5a4878269e7f591224b` on `mi300, mi355`: <https://github.com/ROCm/pytorch/actions/runs/25738157290> --------- Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>

## Motivation Old IFU_GITHUB_TOKEN [seems to have expired](https://github.com/ROCm/pytorch/actions/runs/25856299592/job/75974982737) ## Technical Details Replace with PARITY_GITHUB_TOKEN (meant specifically for this workflow) ## Test Plan Run parity.yml with this PR branch and see if it still gives credential error. ## Test Result "Download artifacts" step succeeded in https://github.com/ROCm/pytorch/actions/runs/25857211908/job/75978008711 ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

## Summary - Select the CUDA test artifact kind from the jobs present for the target SHA. - Detect whether the target SHA uses test-osdc or legacy test CUDA jobs, then use the detected kind when building log keys and artifact prefixes. - Apply the same dynamic selection to CUDA inductor jobs. - Treat missing per-arch summary buckets as zero so mixed ROCm/CUDA coverage does not crash report generation. ## Validation - PR/ciflow case: dispatched `Parity Report` on this branch with `sha=386f38175e3aaee2dadb36b5c364deff0869664d` and `arch=mi355, mi300, mi200, navi31`. CUDA default/distributed and inductor selected `test`. - Run: https://github.com/ROCm/pytorch/actions/runs/25866762885 - Main branch case: dispatched `Parity Report` on this branch with `sha=f38b1ec280bafa2ad11f6e767558e73e9eb508a6`, `arch=mi300`, `skip_rocm=true`, and `exclude_distributed=true`. CUDA default and inductor selected `test-osdc`. - Run: https://github.com/ROCm/pytorch/actions/runs/25867046276 - Local syntax check: `python3 -m py_compile .automation_scripts/pytorch-unit-test-scripts/download_testlogs .automation_scripts/pytorch-unit-test-scripts/generate_summary.py`.

## Summary - Prefer the arch-specific MI200 workflows in `download_testlogs`: `rocm-mi200`, `periodic-rocm-mi200`, and `inductor-rocm-mi200`. - Match arch-specific MI200 test jobs with the `linux-jammy-rocm-py3.10-mi200` prefix for default, distributed, and inductor shards. - Keep `trunk-rocm-sandbox` as the fallback workflow for older SHAs that do not have the MI200-specific workflows, using the legacy `linux-jammy-rocm-py3.10` prefix in that fallback path. ## Motivation A parity run for `50d07a990e33f9822ae4d48bed2d7f06c96522d0` tried to collect MI200 distributed jobs with: `linux-jammy-rocm-py3.10 / test (distributed, ...)` The upstream jobs for this SHA are arch-specific and include `-mi200`, so the log lookup missed all three shards and XML artifact collection fell through to empty results. The script should look for the MI200-specific workflows first, then fall back to `trunk-rocm-sandbox` for older commits. ## Validation - `python3 -m py_compile .automation_scripts/pytorch-unit-test-scripts/download_testlogs` - Confirmed the fixed prefix matches upstream jobs for `50d07a990e33f9822ae4d48bed2d7f06c96522d0`: - `rocm-mi200`: 6 default shard matches - `periodic-rocm-mi200`: 3 distributed shard matches - `inductor-rocm-mi200`: 2 inductor shard matches - Dispatched `Parity Report` on this branch with `sha=50d07a990e33f9822ae4d48bed2d7f06c96522d0`, `arch=mi200`, and `skip_cuda=true` to validate collection end-to-end. - Initial run before fallback commit: https://github.com/ROCm/pytorch/actions/runs/25920564353 (success) - Current branch run after fallback commit: https://github.com/ROCm/pytorch/actions/runs/25920808611 (queued) Made with [Cursor](https://cursor.com)

## Summary - Raise the Python CSV parser field limit in `generate_summary.py` so large parity CSV diagnostic fields can be read. - Truncate oversized diagnostic text fields while loading rows so long failure/skip messages do not make summary generation or output unwieldy. - Preserve test identity, status, timing, and shard fields used by the parity report tables. ## Root Cause A parity run failed in the `summarize` job when Python's default CSV field limit rejected a generated-code assertion message larger than 131,072 bytes: https://github.com/ROCm/pytorch/actions/runs/26168276671/job/76979094769 The first offending row was `inductor.test_torchinductor_codegen_dynamic_shapes::DynamicShapesCodegenGPUTests::test_vmap_dot_decomposes_bmm_dynamic_shapes_cuda`, where `message_rocm` was 145,748 bytes. ## Test plan - `python3 -m py_compile .automation_scripts/pytorch-unit-test-scripts/generate_summary.py` - Re-ran `generate_summary.py` locally against the artifact from the failed run: - Input: `20260520_all_tests_status_mi355.csv` from run `26168276671` - Output: summary CSV and markdown generated successfully instead of failing with `_csv.Error: field larger than field limit (131072)`. - Triggered `parity.yml` on this branch with the same upstream commit and arch as the failing run: - SHA: `27f2e80e30fb950bc455c777a5e8079e9657a157` - Arch: `mi355` - Validation run: https://github.com/ROCm/pytorch/actions/runs/26175417191 - Result: `setup-matrix`, `generate-parity (mi355)`, and `summarize` all completed successfully. - The summarize log shows `CSV written to 27f2e80e30fb950bc455c777a5e8079e9657a157_summary.csv` and `Markdown written to 27f2e80e30fb950bc455c777a5e8079e9657a157_summary.md`.

On ROCm >= 7.13 (rocm-systems PR pytorch#4727), HIPRTC headers now bundle amd_hip_bf16.h which defines __float2bfloat16(float) returning __hip_bfloat16. PyTorch's TorchScript JIT fuser emits its own inline __float2bfloat16(const float) returning __nv_bfloat16 into every JIT-generated kernel. These two definitions differ only in return type, causing a fatal HIPRTC compile error: "functions that differ only in their return type cannot be overloaded" This breaks all Megatron-DeepSpeed / BF16 JIT fusion workloads (bias_gelu warmup) at training startup on MI300X/MI350X. Fix: detect the HIP bf16 header guard (_HIP_INCLUDE_HIP_AMD_DETAIL_HIP_BF16_H_) in the emitted JIT string. When present, typedef __nv_bfloat16 to the native __hip_bfloat16 type and skip inline intrinsic definitions. When absent (older ROCm), preserve existing inline definitions for backward compatibility. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>

rocm-repo-management-api · 2026-05-26T03:52:01Z

Jenkins build for 2a10123010ff117f03ba3c6b0a9d616633ab9b17 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

pragupta and others added 30 commits October 29, 2025 17:24

Add github workflows to automate IFU (#2688) (#2748)

b97cff1

(cherry picked from commit a66eeda) Fixes #ISSUE_NUMBER Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>

[rocm7.1_internal_testing] Add triton_kernels wheel generation (#2566)

6b3a141

Fixes #ISSUE_NUMBER (cherry picked from commit 0ea0592)

[AUTOGENERATED] [rocm6.5_internal_testing] Remove --no-index and --no…

d14e5a9

…-deps flags (#2121) Cherry-pick of #2103 Co-authored-by: Ethan Wee <Ethan.Wee@amd.com> (cherry picked from commit 1dea6e8)

Enable gesvda for ROCM >= 6.1 (#1339)

11ca2d0

This also fixes a problem in gesvd driver when UV is not needed. (cherry picked from commit 4ce57ec) (cherry picked from commit 167b4c1)

Remove ROCmloops specific test

629e824

(cherry picked from commit d6879fa) (cherry picked from commit 123a164)

Bump triton to 3.5.x and update related_commits

ab4714d

Revert to prev sccache by ROCm

2536631

Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com> (cherry picked from commit 1ad5bb95d796283d5f56ac1edd16f1731d24a49d) (cherry picked from commit 519160d)

pytorch_ifu.yml: Change date format (#2776)

777e73c

Fixes #ISSUE_NUMBER

Merge remote-tracking branch 'upstream/main' into develop_IFU_20251104

223b9c5

# Conflicts: # .ci/docker/requirements-ci.txt

Fix merge conflict

b4c1e1e

Merge pull request #2784 from ROCm/develop_IFU_20251104

3d74218

[AUTOGENERATED] develop_IFU_20251104

Merge remote-tracking branch 'upstream/main' into develop_IFU_20251118

da5ac4a

# Conflicts: # .ci/docker/ci_commit_pins/triton.txt # requirements.txt

Fix conflicts and move triton ver to 3.5.0

a3c49a9

To keep triton version consistent with what is in rocm/triton's release/internal/3.5.x branch, we need to keep triton_version.txt at 3.5.0 and move triton hash to ToT of that branch.

Merge pull request #2812 from ROCm/develop_IFU_20251118

5ca076d

[AUTOGENERATED] develop_IFU_20251118

Merge remote-tracking branch 'upstream/main' into develop_IFU_20251124

ecdea86

Merge pull request #2827 from ROCm/develop_IFU_20251124

f742da3

[AUTOGENERATED] develop_IFU_20251124

Fix merge conflicts + bump triton to 3.6.x branch

7e17fb9

Remove stale opentelemetry-cpp submodule

4d67363

pragupta and others added 25 commits February 12, 2026 03:50

Fix merge conflicts

3ee04a9

Merge pull request #2969 from ROCm/develop_IFU_20260211

cc3acaf

[AUTOGENERATED] develop_IFU_20260211

Merge remote-tracking branch 'upstream/main' into develop_IFU_20260218

3fb1b1c

Merge pull request #2989 from ROCm/develop_IFU_20260218

a0f3692

[AUTOGENERATED] develop_IFU_20260218

Fix Issue creation workflow to filter ROCM-only commits (#3017)

af4f7f5

Fix automatic issue creation workflow to filter ROCM-only commits (#3018

7735e5b

)

Merge remote-tracking branch 'upstream/main' into develop_IFU_20260316

44ed9df

# Conflicts: # CMakeLists.txt

Fix merge conflicts

0168e75

Merge pull request #3069 from ROCm/develop_IFU_20260316

ebc32c3

[AUTOGENERATED] develop_IFU_20260316

Make gfx94x-dcgpu the default since theRock CI currently runs full te…

293ee53

…sting on that arch

Ensure one of pr_id/sha1 is provided to download_testlogs

f401954

srinivamd requested review from jataylo, jeffdaily, jithunnair-amd and pruthvistony as code owners May 26, 2026 03:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm] Fix TorchScript JIT BF16 HIPRTC overload conflict (ROCM-23829)#3255

[ROCm] Fix TorchScript JIT BF16 HIPRTC overload conflict (ROCM-23829)#3255
srinivamd wants to merge 57 commits into
release/2.12from
fix/rocm-23829-bf16-jit-hiprtc

srinivamd commented May 26, 2026

Uh oh!

rocm-repo-management-api Bot commented May 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

Conversation

srinivamd commented May 26, 2026

Summary

Root Cause

Fix

Reproducer

Affected Models

Test Plan

Cherry-pick targets

Uh oh!

rocm-repo-management-api Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

rocm-repo-management-api Bot commented May 26, 2026 •

edited

Loading