fix(ci): separate CUDA buildcache tags and add ignore-error to cache push by xmfcx · Pull Request #6960 · autowarefoundation/autoware

xmfcx · 2026-03-28T08:20:11Z

Closes fix(ci): separate buildcache tags per variant and add ignore-error to prevent flaky failures #6959
Separate CUDA builds into their own buildcache registry tag ({platform}-cuda-*) to prevent concurrent write races with non-CUDA/tools builds
Add ignore-error=true to all cache-to lines so cache push failures don't fail the entire build
- Reference: https://github.com/moby/buildkit/blob/8a412488df124db59682464791d8e3a29de6ecdd/README.md#registry-push-image-and-cache-separately
Add missing jazzy and CUDA cache tags to keep-build-cache-small.yaml

Why

The CUDA and non-CUDA humble builds were writing to the same {platform}-main buildcache tag concurrently. This corrupted the cache manifest, causing GHCR to return 400 Bad Request on layer blob uploads. Once corrupted, subsequent builds also failed until the buildcache images were manually deleted. Adding ignore-error=true ensures that even transient GHCR issues during cache export never fail the build (since the actual images are already pushed at that point).

The keep-build-cache-small workflow was also only monitoring amd64-main and arm64-main, missing jazzy and CUDA caches entirely.

Test plan

Trigger docker-build-and-push workflow manually and verify all jobs pass
- Test run: https://github.com/autowarefoundation/autoware/actions/runs/23685481890
Verify CUDA builds write to {platform}-cuda-main cache tags (visible in the docker buildx bake logs under cache-to)
Verify non-CUDA and tools builds still write to {platform}-main
Trigger keep-build-cache-small workflow manually and verify it checks all 8 cache tags (visible in workflow summary)

github-actions · 2026-03-28T08:20:20Z

Thank you for contributing to the Autoware project!

🚧 If your pull request is in progress, switch it to draft mode.

Please ensure:

You've checked our contribution guidelines.
Your PR follows our pull request guidelines.
All required CI checks pass before marking the PR ready for review.

xmfcx · 2026-03-28T08:29:07Z

+          - image_tag: amd64-jazzy-main
+            cache_threshold_gb: 7.0
+          - image_tag: arm64-jazzy-main
+            cache_threshold_gb: 9.0
+          - image_tag: amd64-jazzy-cuda-main
+            cache_threshold_gb: 7.0
+          - image_tag: arm64-jazzy-cuda-main
+            cache_threshold_gb: 9.0


This also fixes a bug where we forgot to keep "jazzy" cache small.

xmfcx · 2026-03-28T10:03:37Z

The test run failed on docker-build-and-push-jazzy-cuda (amd64) with:

target universe-sensing-perception-cuda: failed to solve: DeadlineExceeded: failed to compute cache key:
failed to copy: httpReadSeeker: failed open: failed to authorize: no active session for ...: context deadline exceeded

This is a different problem from what the original commit fixes. The original commit addresses cache-to (cache push) races and failures via separate CUDA tags and ignore-error=true. This failure is on cache-from (cache read) -- a transient GHCR timeout caused BuildKit's session to expire while resolving registry cache, and unlike cache-to, there is no ignore-error option for cache-from.

Added a second commit that retries CUDA builds without cache-from lines on failure. If the first attempt fails for any transient registry reason, the retry builds from scratch (skipping registry cache) so GHCR flakiness can't block the workflow. The cache-to with ignore-error=true is still included on retry so the cache gets updated on success.

New test run: https://github.com/autowarefoundation/autoware/actions/runs/23682906816

xmfcx · 2026-03-28T11:14:44Z

Extended the retry-without-cache pattern to all build jobs (not just CUDA).

A separate run on main (without this PR's changes) hit the same DeadlineExceeded / no active session error on docker-build-and-push-jazzy (amd64) -- a non-CUDA build on a GitHub-hosted runner:

target universe-devel: failed to solve: DeadlineExceeded: failed to compute cache key:
failed to copy: httpReadSeeker: failed open: failed to authorize: no active session for ...: context deadline exceeded

This confirms the transient GHCR session timeout is not specific to CUDA or self-hosted runners. Since cache-from has no ignore-error option (unlike cache-to), the only way to handle it is to retry the build without registry cache. All five build jobs (humble, tools, humble-cuda, jazzy, jazzy-cuda) now have this fallback.

Test run: https://github.com/autowarefoundation/autoware/actions/runs/23685481890

…push CUDA and non-CUDA builds were writing to the same buildcache registry tag concurrently, corrupting the cache manifest and causing GHCR to return 400 Bad Request. Give CUDA its own cache tag and add ignore-error=true to all cache-to lines so cache push failures don't fail the build. Signed-off-by: Mete Fatih Cırıt <mfc@autoware.org>

The test run showed a BuildKit session timeout on cache-from during the jazzy-cuda build (DeadlineExceeded / no active session). Unlike cache-to, cache-from has no ignore-error option. On failure, retry the build without cache-from lines so transient GHCR issues don't block the entire workflow. Signed-off-by: Mete Fatih Cırıt <mfc@autoware.org>

Extend the cache-from retry pattern to all build jobs, not just CUDA. The transient GHCR session timeout (DeadlineExceeded / no active session) can hit any build since cache-from has no ignore-error option. On failure, retry without cache-from lines so the build succeeds regardless of registry availability. Signed-off-by: Mete Fatih Cırıt <mfc@autoware.org>

Retry steps should be fully no-cache (no cache-from, no cache-to) since they exist as a fallback for transient GHCR failures. Writing to cache during a retry could propagate corrupt state. Signed-off-by: Mete Fatih Cırıt <mfc@autoware.org>

The retry steps were referencing needs.load-env*.outputs.* which no longer exist after the load-env to derive-image-names refactor. Use steps.derive.outputs.* to match the initial build steps. Signed-off-by: Mete Fatih Cırıt <mfc@autoware.org>

xmfcx · 2026-03-29T18:41:47Z

https://github.com/autowarefoundation/autoware/actions/runs/23685481890 has completed ✅.

xmfcx requested review from isamu-takagi, oguzkaganozt and youtalk as code owners March 28, 2026 08:20

xmfcx added the type:ci Continuous Integration (CI) processes and testing. label Mar 28, 2026

xmfcx self-assigned this Mar 28, 2026

xmfcx requested a review from mitsudome-r March 28, 2026 08:23

xmfcx added the run:health-check Run health-check label Mar 28, 2026

xmfcx commented Mar 28, 2026

View reviewed changes

xmfcx mentioned this pull request Mar 28, 2026

refactor(ci): pass rosdistro instead of env files to workflows #6952

Merged

4 tasks

mitsudome-r approved these changes Mar 28, 2026

View reviewed changes

xmfcx force-pushed the fix/separate-buildcache-tags branch from 27af20c to 81ad548 Compare March 28, 2026 11:16

xmfcx mentioned this pull request Mar 28, 2026

refactor(ci): convert load-env workflow to derive-image-names composite action #6961

Merged

4 tasks

xmfcx added 3 commits March 28, 2026 14:21

xmfcx force-pushed the fix/separate-buildcache-tags branch from 81ad548 to 317ff9c Compare March 28, 2026 11:22

xmfcx added 2 commits March 28, 2026 14:43

xmfcx merged commit 7404aa9 into main Mar 29, 2026
32 of 33 checks passed

xmfcx deleted the fix/separate-buildcache-tags branch March 29, 2026 18:42

tier4-autoware-public-bot Bot mentioned this pull request Mar 30, 2026

chore: sync upstream tier4/autoware#5

Open

xmfcx mentioned this pull request Mar 30, 2026

fix(ci): separate buildcache tags per variant and add ignore-error to prevent flaky failures #6959

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ci): separate CUDA buildcache tags and add ignore-error to cache push#6960

fix(ci): separate CUDA buildcache tags and add ignore-error to cache push#6960
xmfcx merged 5 commits intomainfrom
fix/separate-buildcache-tags

xmfcx commented Mar 28, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Mar 28, 2026 •

edited

Loading

Uh oh!

xmfcx Mar 28, 2026

Uh oh!

xmfcx commented Mar 28, 2026 •

edited

Loading

Uh oh!

xmfcx commented Mar 28, 2026 •

edited

Loading

Uh oh!

xmfcx commented Mar 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

xmfcx commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

Test plan

Uh oh!

github-actions Bot commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xmfcx Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

xmfcx commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xmfcx commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xmfcx commented Mar 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xmfcx commented Mar 28, 2026 •

edited

Loading

github-actions Bot commented Mar 28, 2026 •

edited

Loading

xmfcx commented Mar 28, 2026 •

edited

Loading

xmfcx commented Mar 28, 2026 •

edited

Loading