[ROCm] Fix TorchScript JIT BF16 HIPRTC overload conflict (ROCM-23829)#3255
Open
srinivamd wants to merge 57 commits into
Open
[ROCm] Fix TorchScript JIT BF16 HIPRTC overload conflict (ROCM-23829)#3255srinivamd wants to merge 57 commits into
srinivamd wants to merge 57 commits into
Conversation
========================================== Triton build conditionalized on ROCM_VERSION Include the ROCm version in triton version (cherry picked from commit 7d33910) (cherry picked from commit 0412eb4) Update triton-rocm.txt to triton.txt (cherry picked from commit 0ce9f6e) Use ROCm/triton for install_triton.sh (cherry picked from commit 6e9714b) update triton commit Revert "Use ROCm/triton for install_triton.sh" This reverts commit 81b0cbc8435122030044049c661f252ee8aa7ae5. change triton repo Update triton.txt to use release/internal/3.3.x branch Use ROCm/triton Use ROCm/triton for install_triton.sh (cherry picked from commit 0036db5)
…on (#2482) Related to https://github.com/ROCm/builder/pull/90/files http://rocm-ci.amd.com/job/mainline-pytorch_internal-manylinux-wheels/305/ PyTorch wheel installs successfully when building torchvision/torchaudio (cherry picked from commit c1ee54d)
Fixes #ISSUE_NUMBER (cherry picked from commit 0ea0592)
…A helper functions ======================================================================================= Implementation of PyTorch ut parsing script - QA helper function (#1386) * Initial implementation of PyTorch ut parsing script * Extracted path variables * Use nested dict to save results * Fixes typo * Cleanup * Fixes several issues * Minor name change * Update run_pytorch_unit_tests.py * Added file banners * Supported running from API * Added more help info * Consistent naming * Format help text --------- Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com> Co-authored-by: Jithun Nair <jithun.nair@amd.com> Print consolidated log file for pytorch unit test automation scripts (#1433) * Print consolidated log file for pytorch uts * Update run_entire_tests subprocess call as well * lint * Add ERROR string [SWDEV-466849] Enhancements for PyTorch UT helper scripts (#1491) * Check that >1 GPUs are visible when running TEST_CONFIG=distributed * Add EXECUTION_TIME to file-level and aggregate statistics PyTorch unit test helper scripts enhancements (#1517) * Fail earlier for distributed-on-1-GPU scenario * print cmd in consolidated log with prettier formatting * python->python3 Fixes https://ontrack-internal.amd.com/browse/SWDEV-477264 --------- Co-authored-by: blorange-amd <bo.li2@amd.com> Several issues fix of QA helper script (#1564) Fixes SWDEV-475071: https://ontrack-internal.amd.com/browse/SWDEV-475071 Removed args inside function (#1595) Fixes SWDEV-475071 (cherry picked from commit 041aa1b47978154de63edc6b7ffcdea218a847a3) QA script - Added multi gpu check with priority_tests (#1604) Fixes SWDEV-487907. Verified throwing exception for distributed is working correctly on single gpu with command: python .automation_scripts/run_pytorch_unit_tests.py --priority_test (cherry picked from commit 57cc742271cbf4547f9213710e57f6444bbc983e) (cherry picked from commit 6d5c3dc) (cherry picked from commit 2ee3aa2)
* Use triton commit same as that used for release/2.6 branch since both are triton version 3.2.0, so assuming they're compatible. Relates to: https://github.com/ROCm/rocAutomation/pull/660/files https://github.com/ROCm/builder/pull/70/files Validation http://ml-ci-internal.amd.com:8080/job/pytorch/job/manylinux_rocm_wheels/568/ --------- Co-authored-by: Jithun Nair <jithun.nair@amd.com> Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com> (cherry picked from commit 14c1417) (cherry picked from commit c20a8f8)
* Add trailing comma for consistency in gfx architecture list Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com> * ROCm: Enable tf32 testing on test_nn Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com> --------- Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com> (cherry picked from commit c113e14)
Relates to: ROCm/builder#82 Validation: http://rocm-ci.amd.com/job/mainline-pytorch_internal-manylinux-wheels/98/ Using `registry-sc-harbor.amd.com/framework/compute-rocm-dkms-no-npi-hipclang:16180_ubuntu24.04_py3.12_pytorch_lw_rocm7.0_IT_upgrade_numpy_452f3df6`: ``` root@d92befdbb2a6:/# pip list | egrep "numpy|pandas" numpy 2.1.2 pandas 2.2.3 root@d92befdbb2a6:/# python3 Python 3.12.3 (main, Feb 4 2025, 14:48:35) [GCC 13.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import pandas >>> import torch >>> import numpy >>> exit() root@d92befdbb2a6:/data/pytorch-micro-benchmarking# HIP_VISIBLE_DEVICES=1 python3 micro_benchmarking_pytorch.py --network resnet50 INFO: running forward and backward for warmup. INFO: running the benchmark.. OK: finished running benchmark.. --------------------SUMMARY-------------------------- Microbenchmark for network : resnet50 Num devices: 1 Dtype: FP32 Mini batch size [img] : 64 Time per mini-batch : 0.11369450092315674 Throughput [img/sec] : 562.9120096428937 ``` --------- Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com> (cherry picked from commit cf32479)
…2269) Fixes SWDEV-536456 Fixes error post-#2256: ``` 00:12:44.248 #22 155.3 ERROR: Ignored the following versions that require a different python version: 0.52.0 Requires-Python >=3.6,<3.9; 0.52.0rc3 Requires-Python >=3.6,<3.9; 0.61.0 Requires-Python >=3.10; 0.61.0rc1 Requires-Python >=3.10; 0.61.0rc2 Requires-Python >=3.10; 0.61.1rc1 Requires-Python >=3.10; 0.61.2 Requires-Python >=3.10; 3.3 Requires-Python >=3.10; 3.3rc0 Requires-Python >=3.10; 3.4 Requires-Python >=3.10; 3.4.1 Requires-Python >=3.10; 3.4.2 Requires-Python >=3.10; 3.4rc0 Requires-Python >=3.10; 3.5 Requires-Python >=3.11; 3.5rc0 Requires-Python >=3.11; 8.2.0 Requires-Python >=3.10; 8.2.1 Requires-Python >=3.10 00:12:44.248 #22 155.3 ERROR: Could not find a version that satisfies the requirement numba==0.61.2 (from versions: 0.1, 0.2, 0.3, 0.5.0, 0.6.0, 0.7.0, 0.7.1, 0.7.2, 0.8.0, 0.8.1, 0.9.0, 0.10.0, 0.10.1, 0.11.0, 0.12.0, 0.12.1, 0.12.2, 0.13.0, 0.13.2, 0.13.3, 0.13.4, 0.14.0, 0.15.1, 0.16.0, 0.17.0, 0.18.1, 0.18.2, 0.19.1, 0.19.2, 0.20.0, 0.21.0, 0.22.0, 0.22.1, 0.23.0, 0.23.1, 0.24.0, 0.25.0, 0.26.0, 0.27.0, 0.28.1, 0.29.0, 0.30.0, 0.30.1, 0.31.0, 0.32.0, 0.33.0, 0.34.0, 0.35.0, 0.36.1, 0.36.2, 0.37.0, 0.38.0, 0.38.1, 0.39.0, 0.40.0, 0.40.1, 0.41.0, 0.42.0, 0.42.1, 0.43.0, 0.43.1, 0.44.0, 0.44.1, 0.45.0, 0.45.1, 0.46.0, 0.47.0, 0.48.0, 0.49.0, 0.49.1rc1, 0.49.1, 0.50.0rc1, 0.50.0, 0.50.1, 0.51.0rc1, 0.51.0, 0.51.1, 0.51.2, 0.52.0rc2, 0.53.0rc1.post1, 0.53.0rc2, 0.53.0rc3, 0.53.0, 0.53.1, 0.54.0rc2, 0.54.0rc3, 0.54.0, 0.54.1rc1, 0.54.1, 0.55.0rc1, 0.55.0, 0.55.1, 0.55.2, 0.56.0rc1, 0.56.0, 0.56.2, 0.56.3, 0.56.4, 0.57.0rc1, 0.57.0, 0.57.1rc1, 0.57.1, 0.58.0rc1, 0.58.0rc2, 0.58.0, 0.58.1, 0.59.0rc1, 0.59.0, 0.59.1, 0.60.0rc1, 0.60.0) 00:12:44.248 #22 155.3 ERROR: No matching distribution found for numba==0.61.2 ``` Validation: * Docker image: http://rocm-ci.amd.com/job/mainline-framework-pytorch-internal-cs9-ci/132 * Wheels: http://rocm-ci.amd.com/job/mainline-pytorch_internal-manylinux-wheels/102/ From `registry-sc-harbor.amd.com/framework/compute-rocm-dkms-no-npi-hipclang:16180_ubuntu22.04_py3.9_pytorch_lw_rocm7.0_IT_py3.9_a11d94ad`: ``` root@f43861a0a856:/# pip list | egrep "numpy|pandas" numpy 2.0.2 pandas 2.2.3 root@f43861a0a856:/# python Python 3.9.23 (main, Jun 4 2025, 08:55:38) [GCC 11.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import torch >>> import numpy >>> import pandas root@f43861a0a856:/data/pytorch-micro-benchmarking# HIP_VISIBLE_DEVICES=1 python3 micro_benchmarking_pytorch.py --network resnet50 INFO: running forward and backward for warmup. INFO: running the benchmark.. OK: finished running benchmark.. --------------------SUMMARY-------------------------- Microbenchmark for network : resnet50 Num devices: 1 Dtype: FP32 Mini batch size [img] : 64 Time per mini-batch : 0.11354223489761353 Throughput [img/sec] : 563.6669038416574 ``` (cherry picked from commit a0a9d81)
…cm7.0/7.1 (#2239) Revamped version of #2108 PR to: - enable complex data types for sparse matmul on ROCm - fix sparse addmm/baddbmm on ROCm - fix sparse hipification for ROCm - fix/enable sparse tests on ROCm (~50 tests total for non-fp16/bf16): - enable fp16/bf16 sparse path for rocm7.0 - enable fp16/bf16 sparse tests for rocm7.0/7.1 ``` test_sparse_csr.py::TestSparseCSRCUDA::test_bmm_cuda_* test_sparse.py::TestSparseCUDA::test_sparse_matmul_cuda_* test_sparse_csr.py::TestSparseCSRCUDA::test_mm_cuda_float64 test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_all_sparse_csr_SparseCS* test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_* test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_addmm_cuda_float16 ``` (cherry picked from commit cc2a69c)
#2326) Fixes https://ontrack-internal.amd.com/browse/SWDEV-541809 Upgrading tensorboard after numpy upgrade Ran in **registry-sc-harbor.amd.com/framework/compute-rocm-dkms-no-npi-hipclang:16381_ubuntu24.04_py3.12_pytorch_lw_rocm7.0_internal_testing_afe8b782** ``` 7 git checkout rocm7.0_IT_upgrade_tensorboard 8 pip install .ci/docker/requirements-ci.txt 9 pip install -r .ci/docker/requirements-ci.txt 10 PYTORCH_TEST_WITH_ROCM=1 python test/test_monitor.py TestMonitorTensorboard.test_event_handler root@ubb4-rack-22:/var/lib/jenkins/pytorch# PYTORCH_TEST_WITH_ROCM=1 python test/test_monitor.py TestMonitorTensorboard.test_event_handler /opt/venv/lib/python3.12/site-packages/google/protobuf/internal/well_known_types.py:91: DeprecationWarning: datetime.datetime.utcfromtimestamp() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.fromtimestamp(timestamp, datetime.UTC). _EPOCH_DATETIME_NAIVE = datetime.datetime.utcfromtimestamp(0) . ---------------------------------------------------------------------- Ran 1 test in 0.327s OK root@ubb4-rack-22:/var/lib/jenkins/pytorch# ``` (cherry picked from commit c7f61f4)
Tested locally successfully ``` root@rocm-framework-47:/var/lib/jenkins/pytorch# pip install -r requirements.txt Ignoring numpy: markers 'python_version == "3.9"' don't match your environment Requirement already satisfied: setuptools<80.0,>=70.1.0 in /opt/venv/lib/python3.10/site-packages (from -r /var/lib/jenkins/pytorch/requirements-build.txt (line 2)) (79.0.1) Requirement already satisfied: cmake>=3.31.4 in /opt/venv/lib/python3.10/site-packages (from -r /var/lib/jenkins/pytorch/requirements-build.txt (line 3)) (4.0.0) Requirement already satisfied: ninja==1.11.1.3 in /opt/venv/lib/python3.10/site-packages (from -r /var/lib/jenkins/pytorch/requirements-build.txt (line 4)) (1.11.1.3) Requirement already satisfied: numpy==2.1.2 in /opt/venv/lib/python3.10/site-packages (from -r /var/lib/jenkins/pytorch/requirements-build.txt (line 5)) (2.1.2) Requirement already satisfied: packaging==25.0 in /opt/venv/lib/python3.10/site-packages (from -r /var/lib/jenkins/pytorch/requirements-build.txt (line 6)) (25.0) Requirement already satisfied: pyyaml==6.0.2 in /opt/venv/lib/python3.10/site-packages (from -r /var/lib/jenkins/pytorch/requirements-build.txt (line 7)) (6.0.2) Requirement already satisfied: requests==2.32.4 in /opt/venv/lib/python3.10/site-packages (from -r /var/lib/jenkins/pytorch/requirements-build.txt (line 8)) (2.32.4) Requirement already satisfied: six==1.17.0 in /opt/venv/lib/python3.10/site-packages (from -r /var/lib/jenkins/pytorch/requirements-build.txt (line 9)) (1.17.0) Requirement already satisfied: typing-extensions==4.14.1 in /opt/venv/lib/python3.10/site-packages (from -r /var/lib/jenkins/pytorch/requirements-build.txt (line 10)) (4.14.1) Requirement already satisfied: expecttest==0.3.0 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 8)) (0.3.0) Requirement already satisfied: filelock==3.18.0 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 9)) (3.18.0) Requirement already satisfied: fsspec==2025.7.0 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 10)) (2025.7.0) Requirement already satisfied: hypothesis==5.35.1 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 11)) (5.35.1) Requirement already satisfied: jinja2==3.1.6 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 12)) (3.1.6) Requirement already satisfied: lintrunner==0.12.7 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 13)) (0.12.7) Requirement already satisfied: networkx==2.8.8 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 14)) (2.8.8) Requirement already satisfied: optree==0.13.0 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 18)) (0.13.0) Requirement already satisfied: psutil==7.0.0 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 19)) (7.0.0) Requirement already satisfied: sympy==1.13.3 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 20)) (1.13.3) Requirement already satisfied: wheel==0.45.1 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 22)) (0.45.1) Requirement already satisfied: build[uv] in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 7)) (1.3.0) Requirement already satisfied: charset_normalizer<4,>=2 in /opt/venv/lib/python3.10/site-packages (from requests==2.32.4->-r /var/lib/jenkins/pytorch/requirements-build.txt (line 8)) (3.4.3) Requirement already satisfied: idna<4,>=2.5 in /opt/venv/lib/python3.10/site-packages (from requests==2.32.4->-r /var/lib/jenkins/pytorch/requirements-build.txt (line 8)) (3.10) Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/venv/lib/python3.10/site-packages (from requests==2.32.4->-r /var/lib/jenkins/pytorch/requirements-build.txt (line 8)) (2.5.0) Requirement already satisfied: certifi>=2017.4.17 in /opt/venv/lib/python3.10/site-packages (from requests==2.32.4->-r /var/lib/jenkins/pytorch/requirements-build.txt (line 8)) (2025.8.3) Requirement already satisfied: attrs>=19.2.0 in /opt/venv/lib/python3.10/site-packages (from hypothesis==5.35.1->-r requirements.txt (line 11)) (25.3.0) Requirement already satisfied: sortedcontainers<3.0.0,>=2.1.0 in /opt/venv/lib/python3.10/site-packages (from hypothesis==5.35.1->-r requirements.txt (line 11)) (2.4.0) Requirement already satisfied: MarkupSafe>=2.0 in /opt/venv/lib/python3.10/site-packages (from jinja2==3.1.6->-r requirements.txt (line 12)) (3.0.2) Requirement already satisfied: mpmath<1.4,>=1.1.0 in /opt/venv/lib/python3.10/site-packages (from sympy==1.13.3->-r requirements.txt (line 20)) (1.3.0) Requirement already satisfied: pyproject_hooks in /opt/venv/lib/python3.10/site-packages (from build[uv]->-r requirements.txt (line 7)) (1.2.0) Requirement already satisfied: tomli>=1.1.0 in /opt/venv/lib/python3.10/site-packages (from build[uv]->-r requirements.txt (line 7)) (2.2.1) Requirement already satisfied: uv>=0.1.18 in /opt/venv/lib/python3.10/site-packages (from build[uv]->-r requirements.txt (line 7)) (0.8.10) root@rocm-framework-47:/var/lib/jenkins/pytorch# pip install -r requirements-build.txt ``` (cherry picked from commit 6e6e454)
Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com> (cherry picked from commit 1ad5bb95d796283d5f56ac1edd16f1731d24a49d) (cherry picked from commit 519160d)
Fixes #ISSUE_NUMBER
- Need to use upstream/main for rocm/pytorch's develop branch. For release branches, `github.event.pull_request.base.ref` should work as is. - Need to remove any trailing space in PR TITTLE so branch name can be formed correctly Fixes #ISSUE_NUMBER
# Conflicts: # .ci/docker/requirements-ci.txt
[AUTOGENERATED] develop_IFU_20251104
# Conflicts: # .ci/docker/ci_commit_pins/triton.txt # requirements.txt
To keep triton version consistent with what is in rocm/triton's release/internal/3.5.x branch, we need to keep triton_version.txt at 3.5.0 and move triton hash to ToT of that branch.
[AUTOGENERATED] develop_IFU_20251118
[AUTOGENERATED] develop_IFU_20251124
# Conflicts: # .ci/docker/ci_commit_pins/triton.txt # .ci/docker/requirements-ci.txt # .ci/docker/triton_version.txt # .circleci/scripts/binary_populate_env.sh # .github/scripts/build_triton_wheel.py # test/test_sparse_csr.py
[AUTOGENERATED] develop_IFU_20260211
Adds workflow automation so IFU merges generate issues for commits in range and assign them to commit authors. Includes cold-start handling for first IFU on a branch, normal case when previous IFU tags exist, and dedupe logic to prevent duplicate issues on reruns.
[AUTOGENERATED] develop_IFU_20260218
# Conflicts: # CMakeLists.txt
[AUTOGENERATED] develop_IFU_20260316
…um (#3076) In case of github workflow failing when it gets triggered via PR merge of an IFU PR, we want to be able to run workflow manually to debug and correctly create tags and issues. For this purpose, I have changed the workflow file to take in rocm/pytorch's branch and PR number and run the entire workflow on that. Action Running: https://github.com/ROCm/pytorch/actions/runs/23174239617 IFU PR: #3069
## Summary - Add `pytorch-unit-test-scripts/` directory with all parity scripts (download_testlogs, summarize_xml_testreports, parity.sh, and supporting utilities) - Add `parity.yml` GitHub Actions workflow that can be manually triggered to download CI artifacts and generate parity CSVs - All `download_testlogs` and `summarize_xml_testreports.py` flags are exposed as workflow inputs (SHA, PR ID, arch, exclude flags, filter, set names, etc.) - Architectures are configurable via comma-separated input (default: mi200,mi300,mi355) - Generated CSVs and logs are uploaded as downloadable workflow artifacts ## Setup Requires these repository secrets: - [x] - `IFU_GITHUB_TOKEN` (already exists) - [x] - `AWS_ACCESS_KEY_ID` - [x] - `AWS_SECRET_ACCESS_KEY` ## Test plan - [x] Trigger workflow via Actions tab or `gh workflow run parity.yml --ref add-parity-scripts-dashboard` - [x] Verify artifacts download and CSVs generate for each architecture - [x] Verify CSV artifacts are downloadable from the workflow run https://github.com/ethanwee1/pytorch/actions/runs/23413634454 --------- Co-authored-by: Jithun Nair <jithun.nair@amd.com>
…arity workflow (#3147) ## Summary Adds log-based failure detection to the parity workflow. Tests that timeout (exit code 124), crash (SIGIOT, SIGSEGV), hit Fatal Python errors, or OOM never produce JUnit XML output, so they are invisible to the existing XML-based parity report. This PR closes that gap. ### Changes - **New script: `detect_log_failures.py`** — Parses raw CI `.txt` log files to detect test failures not captured in XML reports. Classifies failures as TIMEOUT, CRASH, CONSISTENT_FAILURE, or NON_ZERO_EXIT. Outputs a CSV with platform, workflow, test file, category, and reason. - **`generate_summary.py`** — Adds `--log-failures` argument to accept CSV(s) from `detect_log_failures.py`. Appends a "LOG-BASED FAILURES (not in XML)" section to both CSV and markdown output. - **`parity.yml`** — Adds a "Detect log-based failures" step after XML processing (runs when `include_logs` is enabled). Wires the resulting CSV into the summarize job via `--log-failures`. - Adding in shard information - Also adding in which workflow we are downloading for in download testlogs ### How it works 1. `detect_log_failures.py` scans `.txt` log files for patterns like: - `Got exit code 124` (timeout) - `Segmentation fault`, `SIGSEGV`, `SIGIOT`, `Fatal Python error` (crash) - `FAILED CONSISTENTLY` - `OutOfMemoryError`, `bad_alloc` (OOM) 2. Results are saved as `log_failures_<arch>.csv` and uploaded as part of the per-arch artifact 3. The summarize job collects all log failure CSVs and passes them to `generate_summary.py` 4. The final parity report includes a dedicated section listing these failures ## Test plan - [x] Syntax-checked both Python files (`py_compile`) - [x] Validated `parity.yml` YAML syntax - [x] Tested `detect_log_failures.py` against actual CI log files from parity runs - [x] Verified all files match fork/main (with correct `.automation_scripts/` paths) - [x] Run parity workflow with `include_logs: true` to verify end-to-end Validation: https://github.com/ethanwee1/pytorch/actions/runs/24352395766 --------- Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
Copied from https://github.com/AMD-ROCm-Internal/rocm-npi-dev/actions/workflows/build_portable_linux_pytorch_dockers.yml Latest run and docker generated: docker.io/rocm/pytorch-private:pytorch-nightly-f8d08404-rocm7.13.0a20260413-ubuntu24.04-py3.12-gfx950-dcgpu https://github.com/ethanwee1/pytorch/actions/runs/24441876981
…3159) Docker credentials were using the ones from my fork and not rocm/pytorch credentials: https://github.com/ROCm/pytorch/actions/runs/24479854145/job/71541505148 Latest build https://github.com/ROCm/pytorch/actions/runs/24480169722/job/71542549933
…sting on that arch
…umn (#3153) ## Summary - Only display tests where ROCm status is FAILED in the summary (CUDA status shown as a context column alongside). Previously both ROCm and CUDA failures were shown. - Add "Also Failing In" column that shows which other architectures have the same test tuple (test_file, test_class, test_name) failing, making it easy to distinguish all-ROCm issues from architecture-specific ones. - Includes count of failed tests in the section header. - Add job-level and test-level shard info to "LOG-BASED FAILURES (not in XML)" and "FAILED TESTS" section - Includes flaky tests in "LOG-BASED FAILURES (not in XML)" section for any tests that pass when run in new process ## Test plan - [x] Cross-arch detection confirmed: tests failing on all 3 archs show the other 2 in "Also Failing In"; single-arch failures show empty - [x] CSV and Markdown output both updated consistently Latest run https://github.com/ROCm/pytorch/actions/runs/24798004968 Run without this PR on the same commit: https://github.com/ROCm/pytorch/actions/runs/24796654604
Repro job without this PR's change: https://github.com/ROCm/pytorch/actions/runs/25342470426/job/74303089638 Validation run with this PR's change: https://github.com/ROCm/pytorch/actions/runs/25342235984 Current issue: existing testing is not able to pick up the CUDA artifacts because the CUDA job and artifact names changed from `test` to `test-osdc` for default and distributed shards. Repro inputs: `sha=b1b5b61ddb689ea65aab0915ecfac5cc459b92fb`, `arch=mi355`, `skip_rocm=false`, `csv_name=pr3199-pre-change-repro`. CUDA job names now use `test-osdc` for default and distributed shards, for example: `linux-jammy-cuda13.0-py3.10-gcc11 / test-osdc (default, 1, 5, ...)` `linux-jammy-cuda13.0-py3.10-gcc11 / test-osdc (distributed, 1, 3, ...)` CUDA artifact names now look like: `test-reports-test-osdc-default-1-5` `test-reports-test-osdc-distributed-1-3`
## Summary - Update MI355 parity report shard counts to match current CI artifacts. - Change default shards from 6 to 10 and distributed shards from 3 to 4. ## Validation * Combined parity workflow for `5b9a4786ea4b1a6170c6e5a4878269e7f591224b` on `mi300, mi355`: <https://github.com/ROCm/pytorch/actions/runs/25738157290> --------- Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
## Motivation Old IFU_GITHUB_TOKEN [seems to have expired](https://github.com/ROCm/pytorch/actions/runs/25856299592/job/75974982737) ## Technical Details Replace with PARITY_GITHUB_TOKEN (meant specifically for this workflow) ## Test Plan Run parity.yml with this PR branch and see if it still gives credential error. ## Test Result "Download artifacts" step succeeded in https://github.com/ROCm/pytorch/actions/runs/25857211908/job/75978008711 ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Summary - Select the CUDA test artifact kind from the jobs present for the target SHA. - Detect whether the target SHA uses test-osdc or legacy test CUDA jobs, then use the detected kind when building log keys and artifact prefixes. - Apply the same dynamic selection to CUDA inductor jobs. - Treat missing per-arch summary buckets as zero so mixed ROCm/CUDA coverage does not crash report generation. ## Validation - PR/ciflow case: dispatched `Parity Report` on this branch with `sha=386f38175e3aaee2dadb36b5c364deff0869664d` and `arch=mi355, mi300, mi200, navi31`. CUDA default/distributed and inductor selected `test`. - Run: https://github.com/ROCm/pytorch/actions/runs/25866762885 - Main branch case: dispatched `Parity Report` on this branch with `sha=f38b1ec280bafa2ad11f6e767558e73e9eb508a6`, `arch=mi300`, `skip_rocm=true`, and `exclude_distributed=true`. CUDA default and inductor selected `test-osdc`. - Run: https://github.com/ROCm/pytorch/actions/runs/25867046276 - Local syntax check: `python3 -m py_compile .automation_scripts/pytorch-unit-test-scripts/download_testlogs .automation_scripts/pytorch-unit-test-scripts/generate_summary.py`.
## Summary - Prefer the arch-specific MI200 workflows in `download_testlogs`: `rocm-mi200`, `periodic-rocm-mi200`, and `inductor-rocm-mi200`. - Match arch-specific MI200 test jobs with the `linux-jammy-rocm-py3.10-mi200` prefix for default, distributed, and inductor shards. - Keep `trunk-rocm-sandbox` as the fallback workflow for older SHAs that do not have the MI200-specific workflows, using the legacy `linux-jammy-rocm-py3.10` prefix in that fallback path. ## Motivation A parity run for `50d07a990e33f9822ae4d48bed2d7f06c96522d0` tried to collect MI200 distributed jobs with: `linux-jammy-rocm-py3.10 / test (distributed, ...)` The upstream jobs for this SHA are arch-specific and include `-mi200`, so the log lookup missed all three shards and XML artifact collection fell through to empty results. The script should look for the MI200-specific workflows first, then fall back to `trunk-rocm-sandbox` for older commits. ## Validation - `python3 -m py_compile .automation_scripts/pytorch-unit-test-scripts/download_testlogs` - Confirmed the fixed prefix matches upstream jobs for `50d07a990e33f9822ae4d48bed2d7f06c96522d0`: - `rocm-mi200`: 6 default shard matches - `periodic-rocm-mi200`: 3 distributed shard matches - `inductor-rocm-mi200`: 2 inductor shard matches - Dispatched `Parity Report` on this branch with `sha=50d07a990e33f9822ae4d48bed2d7f06c96522d0`, `arch=mi200`, and `skip_cuda=true` to validate collection end-to-end. - Initial run before fallback commit: https://github.com/ROCm/pytorch/actions/runs/25920564353 (success) - Current branch run after fallback commit: https://github.com/ROCm/pytorch/actions/runs/25920808611 (queued) Made with [Cursor](https://cursor.com)
## Summary - Raise the Python CSV parser field limit in `generate_summary.py` so large parity CSV diagnostic fields can be read. - Truncate oversized diagnostic text fields while loading rows so long failure/skip messages do not make summary generation or output unwieldy. - Preserve test identity, status, timing, and shard fields used by the parity report tables. ## Root Cause A parity run failed in the `summarize` job when Python's default CSV field limit rejected a generated-code assertion message larger than 131,072 bytes: https://github.com/ROCm/pytorch/actions/runs/26168276671/job/76979094769 The first offending row was `inductor.test_torchinductor_codegen_dynamic_shapes::DynamicShapesCodegenGPUTests::test_vmap_dot_decomposes_bmm_dynamic_shapes_cuda`, where `message_rocm` was 145,748 bytes. ## Test plan - `python3 -m py_compile .automation_scripts/pytorch-unit-test-scripts/generate_summary.py` - Re-ran `generate_summary.py` locally against the artifact from the failed run: - Input: `20260520_all_tests_status_mi355.csv` from run `26168276671` - Output: summary CSV and markdown generated successfully instead of failing with `_csv.Error: field larger than field limit (131072)`. - Triggered `parity.yml` on this branch with the same upstream commit and arch as the failing run: - SHA: `27f2e80e30fb950bc455c777a5e8079e9657a157` - Arch: `mi355` - Validation run: https://github.com/ROCm/pytorch/actions/runs/26175417191 - Result: `setup-matrix`, `generate-parity (mi355)`, and `summarize` all completed successfully. - The summarize log shows `CSV written to 27f2e80e30fb950bc455c777a5e8079e9657a157_summary.csv` and `Markdown written to 27f2e80e30fb950bc455c777a5e8079e9657a157_summary.md`.
On ROCm >= 7.13 (rocm-systems PR pytorch#4727), HIPRTC headers now bundle amd_hip_bf16.h which defines __float2bfloat16(float) returning __hip_bfloat16. PyTorch's TorchScript JIT fuser emits its own inline __float2bfloat16(const float) returning __nv_bfloat16 into every JIT-generated kernel. These two definitions differ only in return type, causing a fatal HIPRTC compile error: "functions that differ only in their return type cannot be overloaded" This breaks all Megatron-DeepSpeed / BF16 JIT fusion workloads (bias_gelu warmup) at training startup on MI300X/MI350X. Fix: detect the HIP bf16 header guard (_HIP_INCLUDE_HIP_AMD_DETAIL_HIP_BF16_H_) in the emitted JIT string. When present, typedef __nv_bfloat16 to the native __hip_bfloat16 type and skip inline intrinsic definitions. When absent (older ROCm), preserve existing inline definitions for backward compatibility. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
|
Jenkins build for 2a10123010ff117f03ba3c6b0a9d616633ab9b17 commit finished as FAILURE |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes ROCM-23829: Megatron-DeepSpeed LLaMA2 pretraining crashes during TorchScript JIT BF16 warmup with
__float2bfloat16return-type overload conflict.Root Cause
On ROCm >= 7.13 (rocm-systems#4727), HIPRTC headers now bundle
amd_hip_bf16.hwhich defines__float2bfloat16(float)returning__hip_bfloat16. PyTorch's TorchScript JIT fuser (resource_strings.h) emits its own inline__float2bfloat16(const float)returning__nv_bfloat16into every JIT-generated GPU kernel. Since__hip_bfloat16and__nv_bfloat16are different types, these two definitions differ only in return type, producing a fatal HIPRTC compile error:This blocks all BF16 JIT fusion workloads (Megatron-DeepSpeed bias_gelu warmup) on MI300X/MI350X.
Fix
In the emitted JIT kernel string (
bfloat16_support_literal), detect the HIP bf16 header guard (_HIP_INCLUDE_HIP_AMD_DETAIL_HIP_BF16_H_) at HIPRTC compile time:typedef __hip_bfloat16 __nv_bfloat16— alias to the native HIP type. HIP-provided__float2bfloat16/__bfloat162floatintrinsics work transparently since__nv_bfloat16IS__hip_bfloat16.This is backward-compatible — on older ROCm where HIPRTC does not include bf16 headers, the guard is false and behavior is identical to before.
Reproducer
Affected Models
pyt_deepspeed_megatron_llama2_7bpyt_deepspeed_megatron_llama2_13bpyt_deepspeed_megatron_llama2_70bpyt_deepspeed_megatron_gpt3_13bTest Plan
Cherry-pick targets
release/2.11release/2.10Co-Authored-By: Claude Opus 4 (1M context) noreply@anthropic.com