Skip to content

fix(ci): separate CUDA buildcache tags and add ignore-error to cache push#6960

Merged
xmfcx merged 5 commits intomainfrom
fix/separate-buildcache-tags
Mar 29, 2026
Merged

fix(ci): separate CUDA buildcache tags and add ignore-error to cache push#6960
xmfcx merged 5 commits intomainfrom
fix/separate-buildcache-tags

Conversation

@xmfcx
Copy link
Copy Markdown
Contributor

@xmfcx xmfcx commented Mar 28, 2026

Why

The CUDA and non-CUDA humble builds were writing to the same {platform}-main buildcache tag concurrently. This corrupted the cache manifest, causing GHCR to return 400 Bad Request on layer blob uploads. Once corrupted, subsequent builds also failed until the buildcache images were manually deleted. Adding ignore-error=true ensures that even transient GHCR issues during cache export never fail the build (since the actual images are already pushed at that point).


  • The keep-build-cache-small workflow was also only monitoring amd64-main and arm64-main, missing jazzy and CUDA caches entirely.

Test plan

  • Trigger docker-build-and-push workflow manually and verify all jobs pass
  • Verify CUDA builds write to {platform}-cuda-main cache tags (visible in the docker buildx bake logs under cache-to)
  • Verify non-CUDA and tools builds still write to {platform}-main
  • Trigger keep-build-cache-small workflow manually and verify it checks all 8 cache tags (visible in workflow summary)

@xmfcx xmfcx added the type:ci Continuous Integration (CI) processes and testing. label Mar 28, 2026
@xmfcx xmfcx self-assigned this Mar 28, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 28, 2026

Thank you for contributing to the Autoware project!

🚧 If your pull request is in progress, switch it to draft mode.

Please ensure:

@xmfcx xmfcx requested a review from mitsudome-r March 28, 2026 08:23
@xmfcx xmfcx added the run:health-check Run health-check label Mar 28, 2026
Comment on lines +26 to +33
- image_tag: amd64-jazzy-main
cache_threshold_gb: 7.0
- image_tag: arm64-jazzy-main
cache_threshold_gb: 9.0
- image_tag: amd64-jazzy-cuda-main
cache_threshold_gb: 7.0
- image_tag: arm64-jazzy-cuda-main
cache_threshold_gb: 9.0
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also fixes a bug where we forgot to keep "jazzy" cache small.

@xmfcx
Copy link
Copy Markdown
Contributor Author

xmfcx commented Mar 28, 2026

The test run failed on docker-build-and-push-jazzy-cuda (amd64) with:

target universe-sensing-perception-cuda: failed to solve: DeadlineExceeded: failed to compute cache key:
failed to copy: httpReadSeeker: failed open: failed to authorize: no active session for ...: context deadline exceeded

This is a different problem from what the original commit fixes. The original commit addresses cache-to (cache push) races and failures via separate CUDA tags and ignore-error=true. This failure is on cache-from (cache read) -- a transient GHCR timeout caused BuildKit's session to expire while resolving registry cache, and unlike cache-to, there is no ignore-error option for cache-from.

Added a second commit that retries CUDA builds without cache-from lines on failure. If the first attempt fails for any transient registry reason, the retry builds from scratch (skipping registry cache) so GHCR flakiness can't block the workflow. The cache-to with ignore-error=true is still included on retry so the cache gets updated on success.

New test run: https://github.com/autowarefoundation/autoware/actions/runs/23682906816

@xmfcx
Copy link
Copy Markdown
Contributor Author

xmfcx commented Mar 28, 2026

Extended the retry-without-cache pattern to all build jobs (not just CUDA).

A separate run on main (without this PR's changes) hit the same DeadlineExceeded / no active session error on docker-build-and-push-jazzy (amd64) -- a non-CUDA build on a GitHub-hosted runner:

target universe-devel: failed to solve: DeadlineExceeded: failed to compute cache key:
failed to copy: httpReadSeeker: failed open: failed to authorize: no active session for ...: context deadline exceeded

This confirms the transient GHCR session timeout is not specific to CUDA or self-hosted runners. Since cache-from has no ignore-error option (unlike cache-to), the only way to handle it is to retry the build without registry cache. All five build jobs (humble, tools, humble-cuda, jazzy, jazzy-cuda) now have this fallback.

Test run: https://github.com/autowarefoundation/autoware/actions/runs/23685481890

xmfcx added 3 commits March 28, 2026 14:21
…push

CUDA and non-CUDA builds were writing to the same buildcache registry
tag concurrently, corrupting the cache manifest and causing GHCR to
return 400 Bad Request. Give CUDA its own cache tag and add
ignore-error=true to all cache-to lines so cache push failures don't
fail the build.

Signed-off-by: Mete Fatih Cırıt <mfc@autoware.org>
The test run showed a BuildKit session timeout on cache-from during the
jazzy-cuda build (DeadlineExceeded / no active session). Unlike
cache-to, cache-from has no ignore-error option. On failure, retry the
build without cache-from lines so transient GHCR issues don't block the
entire workflow.

Signed-off-by: Mete Fatih Cırıt <mfc@autoware.org>
Extend the cache-from retry pattern to all build jobs, not just CUDA.
The transient GHCR session timeout (DeadlineExceeded / no active
session) can hit any build since cache-from has no ignore-error option.
On failure, retry without cache-from lines so the build succeeds
regardless of registry availability.

Signed-off-by: Mete Fatih Cırıt <mfc@autoware.org>
@xmfcx xmfcx force-pushed the fix/separate-buildcache-tags branch from 81ad548 to 317ff9c Compare March 28, 2026 11:22
xmfcx added 2 commits March 28, 2026 14:43
Retry steps should be fully no-cache (no cache-from, no cache-to) since
they exist as a fallback for transient GHCR failures. Writing to cache
during a retry could propagate corrupt state.

Signed-off-by: Mete Fatih Cırıt <mfc@autoware.org>
The retry steps were referencing needs.load-env*.outputs.* which no
longer exist after the load-env to derive-image-names refactor. Use
steps.derive.outputs.* to match the initial build steps.

Signed-off-by: Mete Fatih Cırıt <mfc@autoware.org>
@xmfcx
Copy link
Copy Markdown
Contributor Author

xmfcx commented Mar 29, 2026

@xmfcx xmfcx merged commit 7404aa9 into main Mar 29, 2026
32 of 33 checks passed
@xmfcx xmfcx deleted the fix/separate-buildcache-tags branch March 29, 2026 18:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run:health-check Run health-check type:ci Continuous Integration (CI) processes and testing.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants