[CI] Add parity auto-trigger workflow#3231
Conversation
|
Jenkins build for f4dfbd8845f2d05dd28225ca78af48c1926d9e31 commit finished as FAILURE |
There was a problem hiding this comment.
Pull request overview
Adds a new GitHub Actions workflow to automatically scan recent completed trunk.yml push runs on main, determine when the relevant ROCm + CUDA check-runs for a given upstream SHA are fully complete, and then dispatch parity.yml (with a PR-only dry-run mode).
Changes:
- Introduces a scheduled (every 10 minutes) + manual + PR dry-run “parity auto-trigger” workflow.
- Implements SHA deduplication by checking existing
parity.ymlrun titles in the current repo. - Gates dispatch on completion of ROCm arch shard check-runs (only for arch workflows detected as having run) plus specific CUDA check-runs used by parity.
| COMMITS=$(gh api \ | ||
| "repos/$UPSTREAM/actions/workflows/trunk.yml/runs?branch=$BRANCH&event=push&status=completed&per_page=$MAX_COMMITS" \ | ||
| --jq ' | ||
| reduce .workflow_runs[] as $run ({seen:{}, rows:[]}; | ||
| if .seen[$run.head_sha] then . | ||
| else .seen[$run.head_sha] = true | .rows += [$run] | ||
| end | ||
| ) | ||
| | .rows[] | ||
| | "\(.head_sha) \(.created_at)" | ||
| ') |
There was a problem hiding this comment.
This is a valid point. per_page is capped at 100.
| description: 'JSON: arch -> PCRE regex that matches the check-run names of that arch''s ROCm test shards on pytorch/pytorch. An arch is considered "ready" only when every check-run whose name matches has status=completed (so we wait for all test shards, not just workflow completion).' | ||
| required: false | ||
| default: '{"mi355":"rocm.*mi355.*/ test [(](default|distributed|inductor),","mi300":"rocm.*mi300.*/ test [(](default|distributed|inductor),","mi200":"(rocm.*(mi200|mi210).*/ test [(](default|distributed|inductor),|linux-jammy-rocm-py3[.]10 / test [(](default|distributed|inductor),)","navi31":"rocm.*navi31.*/ test [(]default,","nightly":"rocm-nightly.*/ test [(](default|distributed|inductor),"}' | ||
| type: string | ||
| arch_workflow_regex_map: | ||
| description: 'JSON: arch -> PCRE regex that matches workflow file paths for upstream ROCm workflows that mean this arch ran on the SHA. Missing workflows mean the arch is not expected for that commit.' | ||
| required: false | ||
| default: '{"mi355":"(^|/)(trunk|rocm-mi355|periodic-rocm-mi355|inductor-rocm-mi355)[.]yml$","mi300":"(^|/)(rocm-mi300|periodic-rocm-mi300|inductor-rocm-mi300)[.]yml$","mi200":"(^|/)(trunk-rocm-sandbox|rocm-mi200|periodic-rocm-mi200|inductor-rocm-mi200)[.]yml$","navi31":"(^|/)(rocm-navi31|periodic-rocm-navi31|inductor-rocm-navi31)[.]yml$","nightly":"(^|/)rocm-nightly[.]yml$"}' | ||
| type: string |
There was a problem hiding this comment.
Same thoughts as https://github.com/ROCm/pytorch/pull/3231/changes#r3303391052
| # Pull recent parity runs. Run titles look like: | ||
| # "<csv_name or SHA> · mi355, mi300, mi200" | ||
| # Once any parity run exists for a SHA, we do not dispatch another | ||
| # report for that SHA. This keeps the dashboard to one report per | ||
| # upstream commit. | ||
| EXISTING=$(gh run list \ | ||
| --repo "$GITHUB_REPOSITORY" \ | ||
| --workflow parity.yml \ | ||
| --limit 1000 \ | ||
| --json displayTitle 2>/dev/null || echo '[]') | ||
|
|
||
| sha_already_dispatched() { | ||
| local sha="$1" | ||
| echo "$EXISTING" | jq -e --arg sha "$sha" \ | ||
| 'any(.[]; .displayTitle | contains($sha))' >/dev/null |
There was a problem hiding this comment.
Since the same parity.yml can be triggered outside of the auto-parity, we should narrow this down to consider only the auto-parity-triggered runs of parity.yml. In addition to SHA, one more "variable" is the arch list. For simplicity, let us get rid of the arch list workflow input entirely in auto-parity, so that we don't have to worry about whether a particular run covered all the archs we wanted.
There was a problem hiding this comment.
That being said, I do prefer the suggested version by Copilot review above, since it gets rid of the magic number of 1000 and seems to make the window logic more robust by using the right API parameters.
jithunnair-amd
left a comment
There was a problem hiding this comment.
@ethanwee1 I may not be able to help get this merged once you address the review comments, so please take Pruthvi's help for that. However, please do fix some of the issues pointed out in the review comments and make the inputs interface a bit simpler for the average user (please work with @pablo-garay to iterate on this).
I'm also okay with having the cron-scheduled runs only covering trunk in this PR, since that's our top requirement, and handle the complexity of other archs/workflows in a follow-up PR.
Thanks!
| default: 'mi355, mi300, mi200' | ||
| type: string | ||
| arch_jobname_regex_map: | ||
| description: 'JSON: arch -> PCRE regex that matches the check-run names of that arch''s ROCm test shards on pytorch/pytorch. An arch is considered "ready" only when every check-run whose name matches has status=completed (so we wait for all test shards, not just workflow completion).' |
There was a problem hiding this comment.
"not just workflow completion"... why not wait for workflow completion?
| arch_jobname_regex_map: | ||
| description: 'JSON: arch -> PCRE regex that matches the check-run names of that arch''s ROCm test shards on pytorch/pytorch. An arch is considered "ready" only when every check-run whose name matches has status=completed (so we wait for all test shards, not just workflow completion).' | ||
| required: false | ||
| default: '{"mi355":"rocm.*mi355.*/ test [(](default|distributed|inductor),","mi300":"rocm.*mi300.*/ test [(](default|distributed|inductor),","mi200":"(rocm.*(mi200|mi210).*/ test [(](default|distributed|inductor),|linux-jammy-rocm-py3[.]10 / test [(](default|distributed|inductor),)","navi31":"rocm.*navi31.*/ test [(]default,","nightly":"rocm-nightly.*/ test [(](default|distributed|inductor),"}' |
There was a problem hiding this comment.
This logic seems to repeat (at the risk of conflicting or being out-of-sync) with the download_testlogs logic that determines which workflows and jobs to download. We should avoid that.
There was a problem hiding this comment.
Also, this being a user input for workflow_dispatch seems very untenable, for a user to provide it. I'd recommend we remove these inputs, as the default value is set later anyway. And then we need to iterate to figure out how to avoid the apparent duplication vis-a-vis download_testlogs
| # Pull recent parity runs. Run titles look like: | ||
| # "<csv_name or SHA> · mi355, mi300, mi200" | ||
| # Once any parity run exists for a SHA, we do not dispatch another | ||
| # report for that SHA. This keeps the dashboard to one report per | ||
| # upstream commit. | ||
| EXISTING=$(gh run list \ | ||
| --repo "$GITHUB_REPOSITORY" \ | ||
| --workflow parity.yml \ | ||
| --limit 1000 \ | ||
| --json displayTitle 2>/dev/null || echo '[]') | ||
|
|
||
| sha_already_dispatched() { | ||
| local sha="$1" | ||
| echo "$EXISTING" | jq -e --arg sha "$sha" \ | ||
| 'any(.[]; .displayTitle | contains($sha))' >/dev/null |
There was a problem hiding this comment.
Since the same parity.yml can be triggered outside of the auto-parity, we should narrow this down to consider only the auto-parity-triggered runs of parity.yml. In addition to SHA, one more "variable" is the arch list. For simplicity, let us get rid of the arch list workflow input entirely in auto-parity, so that we don't have to worry about whether a particular run covered all the archs we wanted.
| COMMITS=$(gh api \ | ||
| "repos/$UPSTREAM/actions/workflows/trunk.yml/runs?branch=$BRANCH&event=push&status=completed&per_page=$MAX_COMMITS" \ | ||
| --jq ' | ||
| reduce .workflow_runs[] as $run ({seen:{}, rows:[]}; | ||
| if .seen[$run.head_sha] then . | ||
| else .seen[$run.head_sha] = true | .rows += [$run] | ||
| end | ||
| ) | ||
| | .rows[] | ||
| | "\(.head_sha) \(.created_at)" | ||
| ') |
There was a problem hiding this comment.
This is a valid point. per_page is capped at 100.
| description: 'JSON: arch -> PCRE regex that matches the check-run names of that arch''s ROCm test shards on pytorch/pytorch. An arch is considered "ready" only when every check-run whose name matches has status=completed (so we wait for all test shards, not just workflow completion).' | ||
| required: false | ||
| default: '{"mi355":"rocm.*mi355.*/ test [(](default|distributed|inductor),","mi300":"rocm.*mi300.*/ test [(](default|distributed|inductor),","mi200":"(rocm.*(mi200|mi210).*/ test [(](default|distributed|inductor),|linux-jammy-rocm-py3[.]10 / test [(](default|distributed|inductor),)","navi31":"rocm.*navi31.*/ test [(]default,","nightly":"rocm-nightly.*/ test [(](default|distributed|inductor),"}' | ||
| type: string | ||
| arch_workflow_regex_map: | ||
| description: 'JSON: arch -> PCRE regex that matches workflow file paths for upstream ROCm workflows that mean this arch ran on the SHA. Missing workflows mean the arch is not expected for that commit.' | ||
| required: false | ||
| default: '{"mi355":"(^|/)(trunk|rocm-mi355|periodic-rocm-mi355|inductor-rocm-mi355)[.]yml$","mi300":"(^|/)(rocm-mi300|periodic-rocm-mi300|inductor-rocm-mi300)[.]yml$","mi200":"(^|/)(trunk-rocm-sandbox|rocm-mi200|periodic-rocm-mi200|inductor-rocm-mi200)[.]yml$","navi31":"(^|/)(rocm-navi31|periodic-rocm-navi31|inductor-rocm-navi31)[.]yml$","nightly":"(^|/)rocm-nightly[.]yml$"}' | ||
| type: string |
There was a problem hiding this comment.
Same thoughts as https://github.com/ROCm/pytorch/pull/3231/changes#r3303391052
| # Pull recent parity runs. Run titles look like: | ||
| # "<csv_name or SHA> · mi355, mi300, mi200" | ||
| # Once any parity run exists for a SHA, we do not dispatch another | ||
| # report for that SHA. This keeps the dashboard to one report per | ||
| # upstream commit. | ||
| EXISTING=$(gh run list \ | ||
| --repo "$GITHUB_REPOSITORY" \ | ||
| --workflow parity.yml \ | ||
| --limit 1000 \ | ||
| --json displayTitle 2>/dev/null || echo '[]') | ||
|
|
||
| sha_already_dispatched() { | ||
| local sha="$1" | ||
| echo "$EXISTING" | jq -e --arg sha "$sha" \ | ||
| 'any(.[]; .displayTitle | contains($sha))' >/dev/null |
There was a problem hiding this comment.
That being said, I do prefer the suggested version by Copilot review above, since it gets rid of the magic number of 1000 and seems to make the window logic more robust by using the right API parameters.
|
Jenkins build for 1375a6adeea743262fe99f276cb8afb5c511af86 commit finished as NOT_BUILT |
|
Addressed the latest review feedback from Jithun:
Validation:
|
|
Jenkins build for 1375a6adeea743262fe99f276cb8afb5c511af86 commit finished as FAILURE |
Add a scheduled scanner that dispatches one parity report per ready upstream PyTorch main commit, with PR dry-runs to validate readiness without creating reports.
Scope auto parity to completed trunk mi355 default reports, remove user-facing regex inputs, and use paginated workflow APIs for candidate and dedupe windows.
Fetch completed trunk runs page by page only until the configured unique SHA limit is reached, avoiding full workflow-history scans.
1375a6a to
a2b268a
Compare
|
Jenkins build for d7ff112ed2257d50c0031bdd24d25675b3f89e52 commit finished as NOT_BUILT |
Stop passing csv_name from parity-auto so auto-dispatched parity reports use the same output names as direct parity.yml runs.
Restore the auto parity ready-arch dispatch behavior while preserving parity.yml's default CSV naming by omitting csv_name.
|
Jenkins build for d7ff112ed2257d50c0031bdd24d25675b3f89e52 commit finished as FAILURE |
Summary
pytorch/pytorchmaintrunk.ymlpushes and dispatchesparity.ymlonce per ready upstream SHA.pull_requestdry-run path with a smaller scan window to validate the scanner without creating parity reports from PR CI.How it works
pytorch/pytorchtrunk.ymlpush runs onmain. Those trunk runs provide the candidate upstream SHAs to evaluate.ROCm/pytorchparity.ymlrun titles. If any existing parity run already contains that SHA, the SHA is skipped so we keep one report per upstream commit.mainbranch of pytorch/pytorch in any 10-minute intervalstatus=completed. It also waits for the CUDA default, distributed, and inductor check-runs consumed by parity.mem_leak_checkandrerun_disabled_testsare ignored because the parity report does not consume them.parity.ymlwith the ready arch list and a CSV prefix containing the upstream SHA, for exampleautoparity-YYYYMMDD-<sha>.dry_run=true, so they exercise the scanner and log would-be dispatches without creating reports. Scheduled and manually dispatched runs can create real parity reports.Test plan
yaml.BaseLoaderandbash -n.dry_run=false, scanned 20 recent upstream trunk runs, skipped SHAs with pending parity check-runs, dispatched 5 ready SHAs, and stopped atmax_dispatches=5.d76e83ef/mi355: https://github.com/ROCm/pytorch/actions/runs/26041518406457e1890/mi355: https://github.com/ROCm/pytorch/actions/runs/2604152899660f38508/mi355: https://github.com/ROCm/pytorch/actions/runs/26041541647d1d96569/mi355: https://github.com/ROCm/pytorch/actions/runs/260415518546e3cf2e4/mi355, mi300, mi200: https://github.com/ROCm/pytorch/actions/runs/26041618237Dispatch cadence note
max_dispatches=5only to avoid flooding ROCm/pytorch during manual testing.max_dispatches=50,max_commits=200, andmax_age_hours=72unless manually overridden.