Skip to content

[ROCm] Fix TorchScript JIT BF16 HIPRTC overload conflict (ROCM-23829)#3255

Open
srinivamd wants to merge 57 commits into
release/2.12from
fix/rocm-23829-bf16-jit-hiprtc
Open

[ROCm] Fix TorchScript JIT BF16 HIPRTC overload conflict (ROCM-23829)#3255
srinivamd wants to merge 57 commits into
release/2.12from
fix/rocm-23829-bf16-jit-hiprtc

Conversation

@srinivamd
Copy link
Copy Markdown

Summary

Fixes ROCM-23829: Megatron-DeepSpeed LLaMA2 pretraining crashes during TorchScript JIT BF16 warmup with __float2bfloat16 return-type overload conflict.

Root Cause

On ROCm >= 7.13 (rocm-systems#4727), HIPRTC headers now bundle amd_hip_bf16.h which defines __float2bfloat16(float) returning __hip_bfloat16. PyTorch's TorchScript JIT fuser (resource_strings.h) emits its own inline __float2bfloat16(const float) returning __nv_bfloat16 into every JIT-generated GPU kernel. Since __hip_bfloat16 and __nv_bfloat16 are different types, these two definitions differ only in return type, producing a fatal HIPRTC compile error:

error: functions that differ only in their return type cannot be overloaded
   51 | __attribute__((device)) __nv_bfloat16 __float2bfloat16(const float a) {
      |                         ~~~~~~~~~~~~~ ^
hiprtc_runtime.h:13560: note: previous definition is here
13560 | __attribute__((device)) static inline __hip_bfloat16 __float2bfloat16(float f) {
      |                                       ~~~~~~~~~~~~~~ ^

This blocks all BF16 JIT fusion workloads (Megatron-DeepSpeed bias_gelu warmup) on MI300X/MI350X.

Fix

In the emitted JIT kernel string (bfloat16_support_literal), detect the HIP bf16 header guard (_HIP_INCLUDE_HIP_AMD_DETAIL_HIP_BF16_H_) at HIPRTC compile time:

  • When present (ROCm >= 7.13 with HIPRTC bf16 headers): typedef __hip_bfloat16 __nv_bfloat16 — alias to the native HIP type. HIP-provided __float2bfloat16 / __bfloat162float intrinsics work transparently since __nv_bfloat16 IS __hip_bfloat16.
  • When absent (older ROCm): preserve existing inline struct + intrinsic definitions unchanged.

This is backward-compatible — on older ROCm where HIPRTC does not include bf16 headers, the guard is false and behavior is identical to before.

Reproducer

import torch
torch._C._jit_set_profiling_mode(False)
torch._C._jit_set_profiling_executor(False)
torch._C._jit_override_can_fuse_on_cpu(True)
torch._C._jit_override_can_fuse_on_gpu(True)

@torch.jit.script
def bias_gelu(bias, y):
    x = bias + y
    return x * 0.5 * (1.0 + torch.tanh(0.79788456 * x * (1 + 0.044715 * x * x)))

bias = torch.rand(128, dtype=torch.bfloat16, device=\"cuda\")
inp  = torch.rand((32, 4, 128), dtype=torch.bfloat16, device=\"cuda\")
bias.requires_grad, inp.requires_grad = True, True
for _ in range(5):
    out = bias_gelu(bias, inp)
print(\"OK\")

Affected Models

  • pyt_deepspeed_megatron_llama2_7b
  • pyt_deepspeed_megatron_llama2_13b
  • pyt_deepspeed_megatron_llama2_70b
  • pyt_deepspeed_megatron_gpt3_13b

Test Plan

  • Run reproducer script above on ROCm 7.13 with MI350X (gfx950)
  • Verify Megatron-DeepSpeed LLaMA2-70B pretraining starts successfully
  • Verify no regression on ROCm 7.12 (where HIPRTC bf16 headers are absent)

Cherry-pick targets

  • release/2.11
  • release/2.10

Co-Authored-By: Claude Opus 4 (1M context) noreply@anthropic.com

pragupta and others added 30 commits October 29, 2025 17:24
(cherry picked from commit a66eeda)

Fixes #ISSUE_NUMBER

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
==========================================

Triton build conditionalized on ROCM_VERSION

Include the ROCm version in triton version

(cherry picked from commit 7d33910)
(cherry picked from commit 0412eb4)

Update triton-rocm.txt to triton.txt

(cherry picked from commit 0ce9f6e)

Use ROCm/triton for install_triton.sh

(cherry picked from commit 6e9714b)

update triton commit

Revert "Use ROCm/triton for install_triton.sh"

This reverts commit 81b0cbc8435122030044049c661f252ee8aa7ae5.

change triton repo

Update triton.txt to use release/internal/3.3.x branch

Use ROCm/triton

Use ROCm/triton for install_triton.sh

(cherry picked from commit 0036db5)
…A helper functions

=======================================================================================

Implementation of PyTorch ut parsing script - QA helper function (#1386)

* Initial implementation of PyTorch ut parsing script

* Extracted path variables

* Use nested dict to save results

* Fixes typo

* Cleanup

* Fixes several issues

* Minor name change

* Update run_pytorch_unit_tests.py

* Added file banners

* Supported running from API

* Added more help info

* Consistent naming

* Format help text

---------

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
Co-authored-by: Jithun Nair <jithun.nair@amd.com>

Print consolidated log file for pytorch unit test automation scripts (#1433)

* Print consolidated log file for pytorch uts

* Update run_entire_tests subprocess call as well

* lint

* Add ERROR string

[SWDEV-466849] Enhancements for PyTorch UT helper scripts (#1491)

* Check that >1 GPUs are visible when running TEST_CONFIG=distributed

* Add EXECUTION_TIME to file-level and aggregate statistics

PyTorch unit test helper scripts enhancements (#1517)

* Fail earlier for distributed-on-1-GPU scenario
* print cmd in consolidated log with prettier formatting
* python->python3

Fixes https://ontrack-internal.amd.com/browse/SWDEV-477264

---------

Co-authored-by: blorange-amd <bo.li2@amd.com>

Several issues fix of QA helper script (#1564)

Fixes SWDEV-475071: https://ontrack-internal.amd.com/browse/SWDEV-475071

Removed args inside function (#1595)

Fixes SWDEV-475071

(cherry picked from commit 041aa1b47978154de63edc6b7ffcdea218a847a3)

QA script - Added multi gpu check with priority_tests (#1604)

Fixes SWDEV-487907. Verified throwing exception for distributed is
working correctly on single gpu with command: python
.automation_scripts/run_pytorch_unit_tests.py --priority_test

(cherry picked from commit 57cc742271cbf4547f9213710e57f6444bbc983e)
(cherry picked from commit 6d5c3dc)
(cherry picked from commit 2ee3aa2)
* Use triton commit same as that used for release/2.6 branch since both
are triton version 3.2.0, so assuming they're compatible.

Relates to:
https://github.com/ROCm/rocAutomation/pull/660/files
https://github.com/ROCm/builder/pull/70/files

Validation

http://ml-ci-internal.amd.com:8080/job/pytorch/job/manylinux_rocm_wheels/568/

---------

Co-authored-by: Jithun Nair <jithun.nair@amd.com>
Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
(cherry picked from commit 14c1417)
(cherry picked from commit c20a8f8)
* Add trailing comma for consistency in gfx architecture list

Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>

* ROCm: Enable tf32 testing on test_nn

Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>

---------

Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>
(cherry picked from commit c113e14)
…-deps flags (#2121)

Cherry-pick of #2103

Co-authored-by: Ethan Wee <Ethan.Wee@amd.com>
(cherry picked from commit 1dea6e8)
Relates to: ROCm/builder#82

Validation:
http://rocm-ci.amd.com/job/mainline-pytorch_internal-manylinux-wheels/98/

Using
`registry-sc-harbor.amd.com/framework/compute-rocm-dkms-no-npi-hipclang:16180_ubuntu24.04_py3.12_pytorch_lw_rocm7.0_IT_upgrade_numpy_452f3df6`:
```
root@d92befdbb2a6:/# pip list | egrep "numpy|pandas"
numpy                   2.1.2
pandas                  2.2.3
root@d92befdbb2a6:/# python3
Python 3.12.3 (main, Feb  4 2025, 14:48:35) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
>>> import torch
>>> import numpy
>>> exit()
root@d92befdbb2a6:/data/pytorch-micro-benchmarking# HIP_VISIBLE_DEVICES=1 python3 micro_benchmarking_pytorch.py --network resnet50
INFO: running forward and backward for warmup.
INFO: running the benchmark..
OK: finished running benchmark..
--------------------SUMMARY--------------------------
Microbenchmark for network : resnet50
Num devices: 1
Dtype: FP32
Mini batch size [img] : 64
Time per mini-batch : 0.11369450092315674
Throughput [img/sec] : 562.9120096428937
```

---------

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
(cherry picked from commit cf32479)
…2269)

Fixes SWDEV-536456

Fixes error post-#2256:
```
00:12:44.248  #22 155.3 ERROR: Ignored the following versions that require a different python version: 0.52.0 Requires-Python >=3.6,<3.9; 0.52.0rc3 Requires-Python >=3.6,<3.9; 0.61.0 Requires-Python >=3.10; 0.61.0rc1 Requires-Python >=3.10; 0.61.0rc2 Requires-Python >=3.10; 0.61.1rc1 Requires-Python >=3.10; 0.61.2 Requires-Python >=3.10; 3.3 Requires-Python >=3.10; 3.3rc0 Requires-Python >=3.10; 3.4 Requires-Python >=3.10; 3.4.1 Requires-Python >=3.10; 3.4.2 Requires-Python >=3.10; 3.4rc0 Requires-Python >=3.10; 3.5 Requires-Python >=3.11; 3.5rc0 Requires-Python >=3.11; 8.2.0 Requires-Python >=3.10; 8.2.1 Requires-Python >=3.10
00:12:44.248  #22 155.3 ERROR: Could not find a version that satisfies the requirement numba==0.61.2 (from versions: 0.1, 0.2, 0.3, 0.5.0, 0.6.0, 0.7.0, 0.7.1, 0.7.2, 0.8.0, 0.8.1, 0.9.0, 0.10.0, 0.10.1, 0.11.0, 0.12.0, 0.12.1, 0.12.2, 0.13.0, 0.13.2, 0.13.3, 0.13.4, 0.14.0, 0.15.1, 0.16.0, 0.17.0, 0.18.1, 0.18.2, 0.19.1, 0.19.2, 0.20.0, 0.21.0, 0.22.0, 0.22.1, 0.23.0, 0.23.1, 0.24.0, 0.25.0, 0.26.0, 0.27.0, 0.28.1, 0.29.0, 0.30.0, 0.30.1, 0.31.0, 0.32.0, 0.33.0, 0.34.0, 0.35.0, 0.36.1, 0.36.2, 0.37.0, 0.38.0, 0.38.1, 0.39.0, 0.40.0, 0.40.1, 0.41.0, 0.42.0, 0.42.1, 0.43.0, 0.43.1, 0.44.0, 0.44.1, 0.45.0, 0.45.1, 0.46.0, 0.47.0, 0.48.0, 0.49.0, 0.49.1rc1, 0.49.1, 0.50.0rc1, 0.50.0, 0.50.1, 0.51.0rc1, 0.51.0, 0.51.1, 0.51.2, 0.52.0rc2, 0.53.0rc1.post1, 0.53.0rc2, 0.53.0rc3, 0.53.0, 0.53.1, 0.54.0rc2, 0.54.0rc3, 0.54.0, 0.54.1rc1, 0.54.1, 0.55.0rc1, 0.55.0, 0.55.1, 0.55.2, 0.56.0rc1, 0.56.0, 0.56.2, 0.56.3, 0.56.4, 0.57.0rc1, 0.57.0, 0.57.1rc1, 0.57.1, 0.58.0rc1, 0.58.0rc2, 0.58.0, 0.58.1, 0.59.0rc1, 0.59.0, 0.59.1, 0.60.0rc1, 0.60.0)
00:12:44.248  #22 155.3 ERROR: No matching distribution found for numba==0.61.2
```

Validation:
* Docker image:
http://rocm-ci.amd.com/job/mainline-framework-pytorch-internal-cs9-ci/132
* Wheels:
http://rocm-ci.amd.com/job/mainline-pytorch_internal-manylinux-wheels/102/

From
`registry-sc-harbor.amd.com/framework/compute-rocm-dkms-no-npi-hipclang:16180_ubuntu22.04_py3.9_pytorch_lw_rocm7.0_IT_py3.9_a11d94ad`:
```
root@f43861a0a856:/# pip list | egrep "numpy|pandas"
numpy                   2.0.2
pandas                  2.2.3
root@f43861a0a856:/# python
Python 3.9.23 (main, Jun  4 2025, 08:55:38)
[GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> import numpy
>>> import pandas
root@f43861a0a856:/data/pytorch-micro-benchmarking# HIP_VISIBLE_DEVICES=1 python3 micro_benchmarking_pytorch.py --network resnet50
INFO: running forward and backward for warmup.
INFO: running the benchmark..
OK: finished running benchmark..
--------------------SUMMARY--------------------------
Microbenchmark for network : resnet50
Num devices: 1
Dtype: FP32
Mini batch size [img] : 64
Time per mini-batch : 0.11354223489761353
Throughput [img/sec] : 563.6669038416574
```

(cherry picked from commit a0a9d81)
…cm7.0/7.1 (#2239)

Revamped version of #2108

PR to:
- enable complex data types for sparse matmul on ROCm
- fix sparse addmm/baddbmm on ROCm
- fix sparse hipification for ROCm
- fix/enable sparse tests on ROCm (~50 tests total for non-fp16/bf16):
- enable fp16/bf16 sparse path for rocm7.0
- enable fp16/bf16 sparse tests for rocm7.0/7.1
```
test_sparse_csr.py::TestSparseCSRCUDA::test_bmm_cuda_*
test_sparse.py::TestSparseCUDA::test_sparse_matmul_cuda_*
test_sparse_csr.py::TestSparseCSRCUDA::test_mm_cuda_float64
test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_all_sparse_csr_SparseCS*
test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_*
test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_addmm_cuda_float16
```

(cherry picked from commit cc2a69c)
#2326)

Fixes https://ontrack-internal.amd.com/browse/SWDEV-541809

Upgrading tensorboard after numpy upgrade
Ran in
**registry-sc-harbor.amd.com/framework/compute-rocm-dkms-no-npi-hipclang:16381_ubuntu24.04_py3.12_pytorch_lw_rocm7.0_internal_testing_afe8b782**

```
    7  git checkout rocm7.0_IT_upgrade_tensorboard
    8  pip install .ci/docker/requirements-ci.txt
    9  pip install -r .ci/docker/requirements-ci.txt
   10  PYTORCH_TEST_WITH_ROCM=1 python test/test_monitor.py TestMonitorTensorboard.test_event_handler

root@ubb4-rack-22:/var/lib/jenkins/pytorch# PYTORCH_TEST_WITH_ROCM=1 python test/test_monitor.py TestMonitorTensorboard.test_event_handler
/opt/venv/lib/python3.12/site-packages/google/protobuf/internal/well_known_types.py:91: DeprecationWarning: datetime.datetime.utcfromtimestamp() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.fromtimestamp(timestamp, datetime.UTC).
  _EPOCH_DATETIME_NAIVE = datetime.datetime.utcfromtimestamp(0)
.
----------------------------------------------------------------------
Ran 1 test in 0.327s

OK
root@ubb4-rack-22:/var/lib/jenkins/pytorch#

```

(cherry picked from commit c7f61f4)
Tested locally successfully
```
root@rocm-framework-47:/var/lib/jenkins/pytorch# pip install -r requirements.txt
Ignoring numpy: markers 'python_version == "3.9"' don't match your environment
Requirement already satisfied: setuptools<80.0,>=70.1.0 in /opt/venv/lib/python3.10/site-packages (from -r /var/lib/jenkins/pytorch/requirements-build.txt (line 2)) (79.0.1)
Requirement already satisfied: cmake>=3.31.4 in /opt/venv/lib/python3.10/site-packages (from -r /var/lib/jenkins/pytorch/requirements-build.txt (line 3)) (4.0.0)
Requirement already satisfied: ninja==1.11.1.3 in /opt/venv/lib/python3.10/site-packages (from -r /var/lib/jenkins/pytorch/requirements-build.txt (line 4)) (1.11.1.3)
Requirement already satisfied: numpy==2.1.2 in /opt/venv/lib/python3.10/site-packages (from -r /var/lib/jenkins/pytorch/requirements-build.txt (line 5)) (2.1.2)
Requirement already satisfied: packaging==25.0 in /opt/venv/lib/python3.10/site-packages (from -r /var/lib/jenkins/pytorch/requirements-build.txt (line 6)) (25.0)
Requirement already satisfied: pyyaml==6.0.2 in /opt/venv/lib/python3.10/site-packages (from -r /var/lib/jenkins/pytorch/requirements-build.txt (line 7)) (6.0.2)
Requirement already satisfied: requests==2.32.4 in /opt/venv/lib/python3.10/site-packages (from -r /var/lib/jenkins/pytorch/requirements-build.txt (line 8)) (2.32.4)
Requirement already satisfied: six==1.17.0 in /opt/venv/lib/python3.10/site-packages (from -r /var/lib/jenkins/pytorch/requirements-build.txt (line 9)) (1.17.0)
Requirement already satisfied: typing-extensions==4.14.1 in /opt/venv/lib/python3.10/site-packages (from -r /var/lib/jenkins/pytorch/requirements-build.txt (line 10)) (4.14.1)
Requirement already satisfied: expecttest==0.3.0 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 8)) (0.3.0)
Requirement already satisfied: filelock==3.18.0 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 9)) (3.18.0)
Requirement already satisfied: fsspec==2025.7.0 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 10)) (2025.7.0)
Requirement already satisfied: hypothesis==5.35.1 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 11)) (5.35.1)
Requirement already satisfied: jinja2==3.1.6 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 12)) (3.1.6)
Requirement already satisfied: lintrunner==0.12.7 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 13)) (0.12.7)
Requirement already satisfied: networkx==2.8.8 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 14)) (2.8.8)
Requirement already satisfied: optree==0.13.0 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 18)) (0.13.0)
Requirement already satisfied: psutil==7.0.0 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 19)) (7.0.0)
Requirement already satisfied: sympy==1.13.3 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 20)) (1.13.3)
Requirement already satisfied: wheel==0.45.1 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 22)) (0.45.1)
Requirement already satisfied: build[uv] in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 7)) (1.3.0)
Requirement already satisfied: charset_normalizer<4,>=2 in /opt/venv/lib/python3.10/site-packages (from requests==2.32.4->-r /var/lib/jenkins/pytorch/requirements-build.txt (line 8)) (3.4.3)
Requirement already satisfied: idna<4,>=2.5 in /opt/venv/lib/python3.10/site-packages (from requests==2.32.4->-r /var/lib/jenkins/pytorch/requirements-build.txt (line 8)) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/venv/lib/python3.10/site-packages (from requests==2.32.4->-r /var/lib/jenkins/pytorch/requirements-build.txt (line 8)) (2.5.0)
Requirement already satisfied: certifi>=2017.4.17 in /opt/venv/lib/python3.10/site-packages (from requests==2.32.4->-r /var/lib/jenkins/pytorch/requirements-build.txt (line 8)) (2025.8.3)
Requirement already satisfied: attrs>=19.2.0 in /opt/venv/lib/python3.10/site-packages (from hypothesis==5.35.1->-r requirements.txt (line 11)) (25.3.0)
Requirement already satisfied: sortedcontainers<3.0.0,>=2.1.0 in /opt/venv/lib/python3.10/site-packages (from hypothesis==5.35.1->-r requirements.txt (line 11)) (2.4.0)
Requirement already satisfied: MarkupSafe>=2.0 in /opt/venv/lib/python3.10/site-packages (from jinja2==3.1.6->-r requirements.txt (line 12)) (3.0.2)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in /opt/venv/lib/python3.10/site-packages (from sympy==1.13.3->-r requirements.txt (line 20)) (1.3.0)
Requirement already satisfied: pyproject_hooks in /opt/venv/lib/python3.10/site-packages (from build[uv]->-r requirements.txt (line 7)) (1.2.0)
Requirement already satisfied: tomli>=1.1.0 in /opt/venv/lib/python3.10/site-packages (from build[uv]->-r requirements.txt (line 7)) (2.2.1)
Requirement already satisfied: uv>=0.1.18 in /opt/venv/lib/python3.10/site-packages (from build[uv]->-r requirements.txt (line 7)) (0.8.10)
root@rocm-framework-47:/var/lib/jenkins/pytorch# pip install -r requirements-build.txt

```

(cherry picked from commit 6e6e454)
This also fixes a problem in gesvd driver when UV is not needed.

(cherry picked from commit 4ce57ec)
(cherry picked from commit 167b4c1)
(cherry picked from commit d6879fa)
(cherry picked from commit 123a164)
Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>

(cherry picked from commit 1ad5bb95d796283d5f56ac1edd16f1731d24a49d)
(cherry picked from commit 519160d)
- Need to use upstream/main for rocm/pytorch's develop branch. For
  release branches, `github.event.pull_request.base.ref` should work as
  is.

- Need to remove any trailing space in PR TITTLE so branch name can be
  formed correctly

Fixes #ISSUE_NUMBER
# Conflicts:
#	.ci/docker/requirements-ci.txt
[AUTOGENERATED] develop_IFU_20251104
# Conflicts:
#	.ci/docker/ci_commit_pins/triton.txt
#	requirements.txt
To keep triton version consistent with what is in rocm/triton's
release/internal/3.5.x branch, we need to keep triton_version.txt at
3.5.0 and move triton hash to ToT of that branch.
[AUTOGENERATED] develop_IFU_20251118
[AUTOGENERATED] develop_IFU_20251124
# Conflicts:
#	.ci/docker/ci_commit_pins/triton.txt
#	.ci/docker/requirements-ci.txt
#	.ci/docker/triton_version.txt
#	.circleci/scripts/binary_populate_env.sh
#	.github/scripts/build_triton_wheel.py
#	test/test_sparse_csr.py
pragupta and others added 25 commits February 12, 2026 03:50
[AUTOGENERATED] develop_IFU_20260211
Adds workflow automation so IFU merges generate issues for commits in
range and assign them to commit authors. Includes cold-start handling
for first IFU on a branch, normal case when previous IFU tags exist, and
dedupe logic to prevent duplicate issues on reruns.
[AUTOGENERATED] develop_IFU_20260218
[AUTOGENERATED] develop_IFU_20260316
…um (#3076)

In case of github workflow failing when it gets triggered via PR merge
of an IFU PR, we want to be able to run workflow manually to debug and
correctly create tags and issues. For this purpose, I have changed the
workflow file to take in rocm/pytorch's branch and PR number and run the
entire workflow on that.

Action Running: https://github.com/ROCm/pytorch/actions/runs/23174239617
IFU PR: #3069
## Summary
- Add `pytorch-unit-test-scripts/` directory with all parity scripts
(download_testlogs, summarize_xml_testreports, parity.sh, and supporting
utilities)
- Add `parity.yml` GitHub Actions workflow that can be manually
triggered to download CI artifacts and generate parity CSVs
- All `download_testlogs` and `summarize_xml_testreports.py` flags are
exposed as workflow inputs (SHA, PR ID, arch, exclude flags, filter, set
names, etc.)
- Architectures are configurable via comma-separated input (default:
mi200,mi300,mi355)
- Generated CSVs and logs are uploaded as downloadable workflow
artifacts

## Setup
Requires these repository secrets:

- [x] - `IFU_GITHUB_TOKEN` (already exists)
- [x] - `AWS_ACCESS_KEY_ID`
- [x] - `AWS_SECRET_ACCESS_KEY`

## Test plan
- [x] Trigger workflow via Actions tab or `gh workflow run parity.yml
--ref add-parity-scripts-dashboard`
- [x] Verify artifacts download and CSVs generate for each architecture
- [x] Verify CSV artifacts are downloadable from the workflow run
https://github.com/ethanwee1/pytorch/actions/runs/23413634454

---------

Co-authored-by: Jithun Nair <jithun.nair@amd.com>
…arity workflow (#3147)

## Summary

Adds log-based failure detection to the parity workflow. Tests that
timeout (exit code 124), crash (SIGIOT, SIGSEGV), hit Fatal Python
errors, or OOM never produce JUnit XML output, so they are invisible to
the existing XML-based parity report. This PR closes that gap.

### Changes

- **New script: `detect_log_failures.py`** — Parses raw CI `.txt` log
files to detect test failures not captured in XML reports. Classifies
failures as TIMEOUT, CRASH, CONSISTENT_FAILURE, or NON_ZERO_EXIT.
Outputs a CSV with platform, workflow, test file, category, and reason.
- **`generate_summary.py`** — Adds `--log-failures` argument to accept
CSV(s) from `detect_log_failures.py`. Appends a "LOG-BASED FAILURES (not
in XML)" section to both CSV and markdown output.
- **`parity.yml`** — Adds a "Detect log-based failures" step after XML
processing (runs when `include_logs` is enabled). Wires the resulting
CSV into the summarize job via `--log-failures`.
- Adding in shard information 
- Also adding in which workflow we are downloading for in download
testlogs

### How it works

1. `detect_log_failures.py` scans `.txt` log files for patterns like:
   - `Got exit code 124` (timeout)
- `Segmentation fault`, `SIGSEGV`, `SIGIOT`, `Fatal Python error`
(crash)
   - `FAILED CONSISTENTLY` 
   - `OutOfMemoryError`, `bad_alloc` (OOM)
2. Results are saved as `log_failures_<arch>.csv` and uploaded as part
of the per-arch artifact
3. The summarize job collects all log failure CSVs and passes them to
`generate_summary.py`
4. The final parity report includes a dedicated section listing these
failures

## Test plan

- [x] Syntax-checked both Python files (`py_compile`)
- [x] Validated `parity.yml` YAML syntax
- [x] Tested `detect_log_failures.py` against actual CI log files from
parity runs
- [x] Verified all files match fork/main (with correct
`.automation_scripts/` paths)
- [x] Run parity workflow with `include_logs: true` to verify end-to-end
Validation:
https://github.com/ethanwee1/pytorch/actions/runs/24352395766

---------

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
Copied from
https://github.com/AMD-ROCm-Internal/rocm-npi-dev/actions/workflows/build_portable_linux_pytorch_dockers.yml

Latest run and docker generated:
docker.io/rocm/pytorch-private:pytorch-nightly-f8d08404-rocm7.13.0a20260413-ubuntu24.04-py3.12-gfx950-dcgpu
https://github.com/ethanwee1/pytorch/actions/runs/24441876981
…umn (#3153)

## Summary
- Only display tests where ROCm status is FAILED in the summary (CUDA
status shown as a context column alongside). Previously both ROCm and
CUDA failures were shown.
- Add "Also Failing In" column that shows which other architectures have
the same test tuple (test_file, test_class, test_name) failing, making
it easy to distinguish all-ROCm issues from architecture-specific ones.
- Includes count of failed tests in the section header.
- Add job-level and test-level shard info to "LOG-BASED FAILURES (not in
XML)" and "FAILED TESTS" section
- Includes flaky tests in "LOG-BASED FAILURES (not in XML)" section for
any tests that pass when run in new process

## Test plan

- [x] Cross-arch detection confirmed: tests failing on all 3 archs show
the other 2 in "Also Failing In"; single-arch failures show empty
- [x] CSV and Markdown output both updated consistently
Latest run https://github.com/ROCm/pytorch/actions/runs/24798004968
Run without this PR on the same commit:
https://github.com/ROCm/pytorch/actions/runs/24796654604
Repro job without this PR's change:
https://github.com/ROCm/pytorch/actions/runs/25342470426/job/74303089638

Validation run with this PR's change:
https://github.com/ROCm/pytorch/actions/runs/25342235984

Current issue: existing testing is not able to pick up the CUDA
artifacts because the CUDA job and artifact names changed from `test` to
`test-osdc` for default and distributed shards.

Repro inputs: `sha=b1b5b61ddb689ea65aab0915ecfac5cc459b92fb`,
`arch=mi355`, `skip_rocm=false`, `csv_name=pr3199-pre-change-repro`.

CUDA job names now use `test-osdc` for default and distributed shards,
for example:

`linux-jammy-cuda13.0-py3.10-gcc11 / test-osdc (default, 1, 5, ...)`
`linux-jammy-cuda13.0-py3.10-gcc11 / test-osdc (distributed, 1, 3, ...)`

CUDA artifact names now look like:

`test-reports-test-osdc-default-1-5`
`test-reports-test-osdc-distributed-1-3`
## Summary
- Update MI355 parity report shard counts to match current CI artifacts.
- Change default shards from 6 to 10 and distributed shards from 3 to 4.

## Validation
* Combined parity workflow for
`5b9a4786ea4b1a6170c6e5a4878269e7f591224b` on `mi300, mi355`:
<https://github.com/ROCm/pytorch/actions/runs/25738157290>

---------

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
## Motivation

Old IFU_GITHUB_TOKEN [seems to have
expired](https://github.com/ROCm/pytorch/actions/runs/25856299592/job/75974982737)

## Technical Details

Replace with PARITY_GITHUB_TOKEN (meant specifically for this workflow)

## Test Plan

Run parity.yml with this PR branch and see if it still gives credential
error.

## Test Result

"Download artifacts" step succeeded in
https://github.com/ROCm/pytorch/actions/runs/25857211908/job/75978008711

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Summary
- Select the CUDA test artifact kind from the jobs present for the
target SHA.
- Detect whether the target SHA uses test-osdc or legacy test CUDA jobs,
then use the detected kind when building log keys and artifact prefixes.
- Apply the same dynamic selection to CUDA inductor jobs.
- Treat missing per-arch summary buckets as zero so mixed ROCm/CUDA
coverage does not crash report generation.

## Validation
- PR/ciflow case: dispatched `Parity Report` on this branch with
`sha=386f38175e3aaee2dadb36b5c364deff0869664d` and `arch=mi355, mi300,
mi200, navi31`. CUDA default/distributed and inductor selected `test`.
  - Run: https://github.com/ROCm/pytorch/actions/runs/25866762885
- Main branch case: dispatched `Parity Report` on this branch with
`sha=f38b1ec280bafa2ad11f6e767558e73e9eb508a6`, `arch=mi300`,
`skip_rocm=true`, and `exclude_distributed=true`. CUDA default and
inductor selected `test-osdc`.
  - Run: https://github.com/ROCm/pytorch/actions/runs/25867046276
- Local syntax check: `python3 -m py_compile
.automation_scripts/pytorch-unit-test-scripts/download_testlogs
.automation_scripts/pytorch-unit-test-scripts/generate_summary.py`.
## Summary
- Prefer the arch-specific MI200 workflows in `download_testlogs`:
`rocm-mi200`, `periodic-rocm-mi200`, and `inductor-rocm-mi200`.
- Match arch-specific MI200 test jobs with the
`linux-jammy-rocm-py3.10-mi200` prefix for default, distributed, and
inductor shards.
- Keep `trunk-rocm-sandbox` as the fallback workflow for older SHAs that
do not have the MI200-specific workflows, using the legacy
`linux-jammy-rocm-py3.10` prefix in that fallback path.

## Motivation
A parity run for `50d07a990e33f9822ae4d48bed2d7f06c96522d0` tried to
collect MI200 distributed jobs with:

`linux-jammy-rocm-py3.10 / test (distributed, ...)`

The upstream jobs for this SHA are arch-specific and include `-mi200`,
so the log lookup missed all three shards and XML artifact collection
fell through to empty results. The script should look for the
MI200-specific workflows first, then fall back to `trunk-rocm-sandbox`
for older commits.

## Validation
- `python3 -m py_compile
.automation_scripts/pytorch-unit-test-scripts/download_testlogs`
- Confirmed the fixed prefix matches upstream jobs for
`50d07a990e33f9822ae4d48bed2d7f06c96522d0`:
  - `rocm-mi200`: 6 default shard matches
  - `periodic-rocm-mi200`: 3 distributed shard matches
  - `inductor-rocm-mi200`: 2 inductor shard matches
- Dispatched `Parity Report` on this branch with
`sha=50d07a990e33f9822ae4d48bed2d7f06c96522d0`, `arch=mi200`, and
`skip_cuda=true` to validate collection end-to-end.
- Initial run before fallback commit:
https://github.com/ROCm/pytorch/actions/runs/25920564353 (success)
- Current branch run after fallback commit:
https://github.com/ROCm/pytorch/actions/runs/25920808611 (queued)

Made with [Cursor](https://cursor.com)
## Summary
- Raise the Python CSV parser field limit in `generate_summary.py` so
large parity CSV diagnostic fields can be read.
- Truncate oversized diagnostic text fields while loading rows so long
failure/skip messages do not make summary generation or output unwieldy.
- Preserve test identity, status, timing, and shard fields used by the
parity report tables.

## Root Cause
A parity run failed in the `summarize` job when Python's default CSV
field limit rejected a generated-code assertion message larger than
131,072 bytes:
https://github.com/ROCm/pytorch/actions/runs/26168276671/job/76979094769

The first offending row was
`inductor.test_torchinductor_codegen_dynamic_shapes::DynamicShapesCodegenGPUTests::test_vmap_dot_decomposes_bmm_dynamic_shapes_cuda`,
where `message_rocm` was 145,748 bytes.

## Test plan
- `python3 -m py_compile
.automation_scripts/pytorch-unit-test-scripts/generate_summary.py`
- Re-ran `generate_summary.py` locally against the artifact from the
failed run:
  - Input: `20260520_all_tests_status_mi355.csv` from run `26168276671`
- Output: summary CSV and markdown generated successfully instead of
failing with `_csv.Error: field larger than field limit (131072)`.
- Triggered `parity.yml` on this branch with the same upstream commit
and arch as the failing run:
  - SHA: `27f2e80e30fb950bc455c777a5e8079e9657a157`
  - Arch: `mi355`
- Validation run:
https://github.com/ROCm/pytorch/actions/runs/26175417191
- Result: `setup-matrix`, `generate-parity (mi355)`, and `summarize` all
completed successfully.
- The summarize log shows `CSV written to
27f2e80e30fb950bc455c777a5e8079e9657a157_summary.csv` and `Markdown
written to 27f2e80e30fb950bc455c777a5e8079e9657a157_summary.md`.
On ROCm >= 7.13 (rocm-systems PR pytorch#4727), HIPRTC headers now bundle
amd_hip_bf16.h which defines __float2bfloat16(float) returning
__hip_bfloat16. PyTorch's TorchScript JIT fuser emits its own inline
__float2bfloat16(const float) returning __nv_bfloat16 into every
JIT-generated kernel. These two definitions differ only in return type,
causing a fatal HIPRTC compile error:

  "functions that differ only in their return type cannot be overloaded"

This breaks all Megatron-DeepSpeed / BF16 JIT fusion workloads
(bias_gelu warmup) at training startup on MI300X/MI350X.

Fix: detect the HIP bf16 header guard
(_HIP_INCLUDE_HIP_AMD_DETAIL_HIP_BF16_H_) in the emitted JIT string.
When present, typedef __nv_bfloat16 to the native __hip_bfloat16 type
and skip inline intrinsic definitions. When absent (older ROCm),
preserve existing inline definitions for backward compatibility.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
@rocm-repo-management-api
Copy link
Copy Markdown

rocm-repo-management-api Bot commented May 26, 2026

Jenkins build for 2a10123010ff117f03ba3c6b0a9d616633ab9b17 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.