Skip to content

[CI] Refactor Docker workflow to single source of truth for config#3187

Open
ethanwee1 wants to merge 5 commits into
ROCm:developfrom
ethanwee1:refactor-docker-env-defaults
Open

[CI] Refactor Docker workflow to single source of truth for config#3187
ethanwee1 wants to merge 5 commits into
ROCm:developfrom
ethanwee1:refactor-docker-env-defaults

Conversation

@ethanwee1
Copy link
Copy Markdown

@ethanwee1 ethanwee1 commented Apr 27, 2026

Refactors the build_portable_linux_pytorch_dockers workflow into a parent workflow plus a reusable workflow_call callee so defaults are owned by the parent and build logic uses explicit inputs.

Changes

  • Split the workflow into:
    • Parent: .github/workflows/build_portable_linux_pytorch_dockers.yml for schedule and workflow_dispatch entrypoints.
    • Callee: .github/workflows/_build_portable_linux_pytorch_docker.yml for single-image build logic via workflow_call.
  • Use Strategy B from the plan: all callee inputs are required: true with no callee defaults; the parent passes every value explicitly.
  • Keep default literals in the parent workflow for manual dispatch UX and scheduled matrix calls.
  • Move ROCm discovery into .github/scripts/discover_rocm_from_torch_index.py and keep rocm_version: auto resolution inside the callee.
  • Preserve the default GPU family as gfx94X-dcgpu and include Python/GFX arch in scheduled job display names.

Plan status

  • Parent vs callee workflow split: complete.
  • Callable workflow using inputs.* for build logic: complete.
  • Parent workflow with scheduled matrix and manual dispatch passthrough: complete.
  • Required input forwarding validation: complete.
  • rocm_version handling: complete via parent-provided auto and callee-side discovery.

Testing

Successful upstream workflow run on the current PR head (8818b6db1cbb70255f361ebf57f898fbe932ad2e):
https://github.com/ROCm/pytorch/actions/runs/25687082662

The run completed Docker login, image build, package check, image push, and post-build summary successfully.

Move all configurable defaults into the workflow-level env block using
inputs || 'fallback' expressions.  Every job and step now references
env.XXX instead of the previous DEFAULT_* variables or scattered
inputs.xxx || env.DEFAULT_XXX patterns.

- Consolidate defaults: PYTORCH_REPO, PYTORCH_BRANCH, PYTHON_VERSION,
  AMDGPU_FAMILY, ROCM_VERSION, INDEX_URL, BASE_IMAGE all live in env
- Remove the "Resolve config" / "Resolve inputs with defaults" steps;
  only truly computed values (commit SHA, discovered ROCm version,
  docker tag) remain as step outputs
- Change default AMDGPU family from gfx950-dcgpu to gfx94X-dcgpu
- Add GFX arch and Python version to nightly matrix job names
- Add comments explaining that run-name and job name fields cannot
  access the env context (GitHub Actions limitation)
@rocm-repo-management-api
Copy link
Copy Markdown

rocm-repo-management-api Bot commented Apr 27, 2026

Jenkins build for 376dc82bd4b24a394e14c4603a3691dc500a0ea7 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

@ethanwee1 ethanwee1 marked this pull request as ready for review April 28, 2026 14:27
jithunnair-amd and others added 4 commits May 6, 2026 04:10
- Split the Docker workflow into a parent file for scheduling and manual triggers, and a reusable callee for build logic.
- Added comments to clarify the purpose of each workflow and the handling of inputs.
- Ensured that all required inputs are supplied by the parent workflow for better maintainability.
- Replace inline Python script with a dedicated script (`discover_rocm_from_torch_index.py`) for better maintainability and clarity.
- Update environment variable handling to streamline the discovery process for ROCm versions in the Docker workflow.
- Ensure consistent usage of environment variables across different jobs for improved readability.
Reduce the parent Docker workflow to schedule and workflow_dispatch callers that pass all required inputs into the reusable workflow. This completes the Strategy B split so default literals live in the parent while build logic uses inputs.* in the callee.
@rocm-repo-management-api
Copy link
Copy Markdown

rocm-repo-management-api Bot commented May 11, 2026

Jenkins build for 8818b6db1cbb70255f361ebf57f898fbe932ad2e commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

@@ -0,0 +1,231 @@
# Reusable workflow: build and push one Portable Linux PyTorch Docker image.
# Triggered only via workflow_call (see build_portable_linux_pytorch_dockers.yml parent).
# Strategy B — inputs are required on the callee (no default: here); callers supply literals.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Strategy B — inputs are required on the callee (no default: here); callers supply literals.
Workflow inputs are required (no defaults here); callers supply literals.

# Reusable workflow: build and push one Portable Linux PyTorch Docker image.
# Triggered only via workflow_call (see build_portable_linux_pytorch_dockers.yml parent).
# Strategy B — inputs are required on the callee (no default: here); callers supply literals.
# Use rocm_version: auto to run index discovery (parent normalizes empty dispatch input to auto).
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ethanwee1 If it's not infeasible for some reason, I'd prefer we do the "auto" resolution in the caller as well, so that the callee always sees a properly-formed ROCm version input. And since this workflow is meant to be dispatched via workflow_call anyway, it's not really a user-friendly feature.

echo "No prefix (nightly/main branch)"
fi

- name: Discover ROCm version from index
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this to the caller workflow


build-docker:
if: github.event_name == 'workflow_dispatch'
name: "Build | ${{ inputs.amdgpu_family || 'gfx94X-dcgpu' }} | ${{ inputs.pytorch_repo || 'pytorch/pytorch' }}@${{ inputs.pytorch_branch || 'nightly' }}"
Copy link
Copy Markdown
Collaborator

@jithunnair-amd jithunnair-amd May 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can get rid of every single hardcoded default here, because this entire section is only run for workflow_dispatch trigger, which specifies the same defaults for each of these inputs, except rocm_version, which I suggest defaulting to auto as mentioned in another comment. Let's keep it as DRY as we can.

Copy link
Copy Markdown
Collaborator

@jithunnair-amd jithunnair-amd May 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please align the naming with the other job (order of inputs, include python_version for eg.). Please also include rocm_version. In case you have strong reasons to keep the order different, please elaborate in response here.

pytorch_branch: ${{ inputs.pytorch_branch || 'nightly' }}
python_version: ${{ inputs.python_version || '3.12' }}
amdgpu_family: ${{ inputs.amdgpu_family || 'gfx94X-dcgpu' }}
rocm_version: ${{ inputs.rocm_version || 'auto' }}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, once we move the discover step to this caller workflow, this would always take the output value from that step, IIUC.

- pytorch_repo: ROCm/pytorch
pytorch_branch: release/2.9
label: "2.9"
name: "Nightly | torch ${{ matrix.label }} | py3.12 | gfx94X-dcgpu"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
name: "Nightly | torch ${{ matrix.label }} | py3.12 | gfx94X-dcgpu"
name: "Build | torch ${{ matrix.label }} | py3.12 | gfx94X-dcgpu | ROCm ${{ inputs.rocm_version }}"

Not sure why "Nightly" makes more sense in the job name here, it's just a schedule detail. If you really want it to dedup vs the next job name (once you align naming as suggested in other comment), keep it.

Copy link
Copy Markdown
Collaborator

@jithunnair-amd jithunnair-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how relevant all of this will be in light of the ufb PR, esp. since we will want to use the same Dockerfile here as well, which leads to a public->private repo crossing. :(

A few points that may justify the continued existence of these workflows (despite the extra maintenance burden):

  • This workflow could use single-arch index to have some canary testing of that, since ufb will use multi-arch index (unless single-arch index is being deprecated)
  • This could be what our devs use for on-demand docker image generation

Please discuss with @pragupta if above (or other) reasons are strong enough to keep these workflows, or if the ufb one will satisfy all our dev requirements, before putting more effort into this PR.

If you decide to abandon this PR, please clean up any unneeded files (workflows/Dockerfile etc.) from the ROCm/pytorch repo to avoid confusion.

As for the PR itself, looks good mostly. Some important changes I'd like to see are:

  • Moving the rocm_version auto-discovery logic here, so callee always sees properly-formed rocm_version
  • Mention rocm_version in job names
  • Remove hardcoded defaults in job for workflow_dispatch event

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants