[CI] Refactor Docker workflow to single source of truth for config#3187
[CI] Refactor Docker workflow to single source of truth for config#3187ethanwee1 wants to merge 5 commits into
Conversation
Move all configurable defaults into the workflow-level env block using inputs || 'fallback' expressions. Every job and step now references env.XXX instead of the previous DEFAULT_* variables or scattered inputs.xxx || env.DEFAULT_XXX patterns. - Consolidate defaults: PYTORCH_REPO, PYTORCH_BRANCH, PYTHON_VERSION, AMDGPU_FAMILY, ROCM_VERSION, INDEX_URL, BASE_IMAGE all live in env - Remove the "Resolve config" / "Resolve inputs with defaults" steps; only truly computed values (commit SHA, discovered ROCm version, docker tag) remain as step outputs - Change default AMDGPU family from gfx950-dcgpu to gfx94X-dcgpu - Add GFX arch and Python version to nightly matrix job names - Add comments explaining that run-name and job name fields cannot access the env context (GitHub Actions limitation)
|
Jenkins build for 376dc82bd4b24a394e14c4603a3691dc500a0ea7 commit finished as FAILURE |
- Split the Docker workflow into a parent file for scheduling and manual triggers, and a reusable callee for build logic. - Added comments to clarify the purpose of each workflow and the handling of inputs. - Ensured that all required inputs are supplied by the parent workflow for better maintainability.
- Replace inline Python script with a dedicated script (`discover_rocm_from_torch_index.py`) for better maintainability and clarity. - Update environment variable handling to streamline the discovery process for ROCm versions in the Docker workflow. - Ensure consistent usage of environment variables across different jobs for improved readability.
Reduce the parent Docker workflow to schedule and workflow_dispatch callers that pass all required inputs into the reusable workflow. This completes the Strategy B split so default literals live in the parent while build logic uses inputs.* in the callee.
|
Jenkins build for 8818b6db1cbb70255f361ebf57f898fbe932ad2e commit finished as FAILURE |
| @@ -0,0 +1,231 @@ | |||
| # Reusable workflow: build and push one Portable Linux PyTorch Docker image. | |||
| # Triggered only via workflow_call (see build_portable_linux_pytorch_dockers.yml parent). | |||
| # Strategy B — inputs are required on the callee (no default: here); callers supply literals. | |||
There was a problem hiding this comment.
| # Strategy B — inputs are required on the callee (no default: here); callers supply literals. | |
| Workflow inputs are required (no defaults here); callers supply literals. |
| # Reusable workflow: build and push one Portable Linux PyTorch Docker image. | ||
| # Triggered only via workflow_call (see build_portable_linux_pytorch_dockers.yml parent). | ||
| # Strategy B — inputs are required on the callee (no default: here); callers supply literals. | ||
| # Use rocm_version: auto to run index discovery (parent normalizes empty dispatch input to auto). |
There was a problem hiding this comment.
@ethanwee1 If it's not infeasible for some reason, I'd prefer we do the "auto" resolution in the caller as well, so that the callee always sees a properly-formed ROCm version input. And since this workflow is meant to be dispatched via workflow_call anyway, it's not really a user-friendly feature.
| echo "No prefix (nightly/main branch)" | ||
| fi | ||
|
|
||
| - name: Discover ROCm version from index |
There was a problem hiding this comment.
Let's move this to the caller workflow
|
|
||
| build-docker: | ||
| if: github.event_name == 'workflow_dispatch' | ||
| name: "Build | ${{ inputs.amdgpu_family || 'gfx94X-dcgpu' }} | ${{ inputs.pytorch_repo || 'pytorch/pytorch' }}@${{ inputs.pytorch_branch || 'nightly' }}" |
There was a problem hiding this comment.
We can get rid of every single hardcoded default here, because this entire section is only run for workflow_dispatch trigger, which specifies the same defaults for each of these inputs, except rocm_version, which I suggest defaulting to auto as mentioned in another comment. Let's keep it as DRY as we can.
There was a problem hiding this comment.
Please align the naming with the other job (order of inputs, include python_version for eg.). Please also include rocm_version. In case you have strong reasons to keep the order different, please elaborate in response here.
| pytorch_branch: ${{ inputs.pytorch_branch || 'nightly' }} | ||
| python_version: ${{ inputs.python_version || '3.12' }} | ||
| amdgpu_family: ${{ inputs.amdgpu_family || 'gfx94X-dcgpu' }} | ||
| rocm_version: ${{ inputs.rocm_version || 'auto' }} |
There was a problem hiding this comment.
In fact, once we move the discover step to this caller workflow, this would always take the output value from that step, IIUC.
| - pytorch_repo: ROCm/pytorch | ||
| pytorch_branch: release/2.9 | ||
| label: "2.9" | ||
| name: "Nightly | torch ${{ matrix.label }} | py3.12 | gfx94X-dcgpu" |
There was a problem hiding this comment.
| name: "Nightly | torch ${{ matrix.label }} | py3.12 | gfx94X-dcgpu" | |
| name: "Build | torch ${{ matrix.label }} | py3.12 | gfx94X-dcgpu | ROCm ${{ inputs.rocm_version }}" |
Not sure why "Nightly" makes more sense in the job name here, it's just a schedule detail. If you really want it to dedup vs the next job name (once you align naming as suggested in other comment), keep it.
There was a problem hiding this comment.
I'm not sure how relevant all of this will be in light of the ufb PR, esp. since we will want to use the same Dockerfile here as well, which leads to a public->private repo crossing. :(
A few points that may justify the continued existence of these workflows (despite the extra maintenance burden):
- This workflow could use single-arch index to have some canary testing of that, since ufb will use multi-arch index (unless single-arch index is being deprecated)
- This could be what our devs use for on-demand docker image generation
Please discuss with @pragupta if above (or other) reasons are strong enough to keep these workflows, or if the ufb one will satisfy all our dev requirements, before putting more effort into this PR.
If you decide to abandon this PR, please clean up any unneeded files (workflows/Dockerfile etc.) from the ROCm/pytorch repo to avoid confusion.
As for the PR itself, looks good mostly. Some important changes I'd like to see are:
- Moving the rocm_version auto-discovery logic here, so callee always sees properly-formed rocm_version
- Mention rocm_version in job names
- Remove hardcoded defaults in job for workflow_dispatch event
Refactors the
build_portable_linux_pytorch_dockersworkflow into a parent workflow plus a reusableworkflow_callcallee so defaults are owned by the parent and build logic uses explicit inputs.Changes
.github/workflows/build_portable_linux_pytorch_dockers.ymlforscheduleandworkflow_dispatchentrypoints..github/workflows/_build_portable_linux_pytorch_docker.ymlfor single-image build logic viaworkflow_call.required: truewith no callee defaults; the parent passes every value explicitly..github/scripts/discover_rocm_from_torch_index.pyand keeprocm_version: autoresolution inside the callee.gfx94X-dcgpuand include Python/GFX arch in scheduled job display names.Plan status
inputs.*for build logic: complete.rocm_versionhandling: complete via parent-providedautoand callee-side discovery.Testing
Successful upstream workflow run on the current PR head (
8818b6db1cbb70255f361ebf57f898fbe932ad2e):https://github.com/ROCm/pytorch/actions/runs/25687082662
The run completed Docker login, image build, package check, image push, and post-build summary successfully.