fix(ci): separate CUDA buildcache tags and add ignore-error to cache push#6960
fix(ci): separate CUDA buildcache tags and add ignore-error to cache push#6960
Conversation
|
Thank you for contributing to the Autoware project! 🚧 If your pull request is in progress, switch it to draft mode. Please ensure:
|
| - image_tag: amd64-jazzy-main | ||
| cache_threshold_gb: 7.0 | ||
| - image_tag: arm64-jazzy-main | ||
| cache_threshold_gb: 9.0 | ||
| - image_tag: amd64-jazzy-cuda-main | ||
| cache_threshold_gb: 7.0 | ||
| - image_tag: arm64-jazzy-cuda-main | ||
| cache_threshold_gb: 9.0 |
There was a problem hiding this comment.
This also fixes a bug where we forgot to keep "jazzy" cache small.
|
The test run failed on This is a different problem from what the original commit fixes. The original commit addresses Added a second commit that retries CUDA builds without New test run: https://github.com/autowarefoundation/autoware/actions/runs/23682906816 |
|
Extended the retry-without-cache pattern to all build jobs (not just CUDA). A separate run on main (without this PR's changes) hit the same This confirms the transient GHCR session timeout is not specific to CUDA or self-hosted runners. Since Test run: https://github.com/autowarefoundation/autoware/actions/runs/23685481890 |
27af20c to
81ad548
Compare
…push CUDA and non-CUDA builds were writing to the same buildcache registry tag concurrently, corrupting the cache manifest and causing GHCR to return 400 Bad Request. Give CUDA its own cache tag and add ignore-error=true to all cache-to lines so cache push failures don't fail the build. Signed-off-by: Mete Fatih Cırıt <mfc@autoware.org>
The test run showed a BuildKit session timeout on cache-from during the jazzy-cuda build (DeadlineExceeded / no active session). Unlike cache-to, cache-from has no ignore-error option. On failure, retry the build without cache-from lines so transient GHCR issues don't block the entire workflow. Signed-off-by: Mete Fatih Cırıt <mfc@autoware.org>
Extend the cache-from retry pattern to all build jobs, not just CUDA. The transient GHCR session timeout (DeadlineExceeded / no active session) can hit any build since cache-from has no ignore-error option. On failure, retry without cache-from lines so the build succeeds regardless of registry availability. Signed-off-by: Mete Fatih Cırıt <mfc@autoware.org>
81ad548 to
317ff9c
Compare
Retry steps should be fully no-cache (no cache-from, no cache-to) since they exist as a fallback for transient GHCR failures. Writing to cache during a retry could propagate corrupt state. Signed-off-by: Mete Fatih Cırıt <mfc@autoware.org>
The retry steps were referencing needs.load-env*.outputs.* which no longer exist after the load-env to derive-image-names refactor. Use steps.derive.outputs.* to match the initial build steps. Signed-off-by: Mete Fatih Cırıt <mfc@autoware.org>
{platform}-cuda-*) to prevent concurrent write races with non-CUDA/tools buildsignore-error=trueto allcache-tolines so cache push failures don't fail the entire buildkeep-build-cache-small.yamlWhy
The CUDA and non-CUDA humble builds were writing to the same
{platform}-mainbuildcache tag concurrently. This corrupted the cache manifest, causing GHCR to return400 Bad Requeston layer blob uploads. Once corrupted, subsequent builds also failed until the buildcache images were manually deleted. Addingignore-error=trueensures that even transient GHCR issues during cache export never fail the build (since the actual images are already pushed at that point).keep-build-cache-smallworkflow was also only monitoringamd64-mainandarm64-main, missing jazzy and CUDA caches entirely.Test plan
docker-build-and-pushworkflow manually and verify all jobs pass{platform}-cuda-maincache tags (visible in thedocker buildx bakelogs undercache-to){platform}-mainkeep-build-cache-smallworkflow manually and verify it checks all 8 cache tags (visible in workflow summary)