ci: retry submodule fetch on transient github.com 500s#5969
Merged
Conversation
…ansient github.com 500s
…flow git submodule loop
2 tasks
…checkout # Conflicts: # .github/workflows/build-test-windows.yml
retry.sh landed on master with the same 3x/30s defaults the inline loop was hand-rolling. Calling it removes ~13 lines of bookkeeping and the long retry-rationale comment.
Grantim
approved these changes
Apr 27, 2026
oitel
approved these changes
Apr 27, 2026
Comment on lines
-57
to
-58
| # Selective init -- parent Checkout drops submodules:true. | ||
| # https://github.com/actions/checkout/issues/1779 |
Contributor
Author
There was a problem hiding this comment.
Restored — moved the comment back inside the run: block in 80d1f23.
| thirdparty/cpp-httplib \ | ||
| thirdparty/mrbind \ | ||
| thirdparty/mrbind-pybind11 | ||
| # mrbind needs deps/cppdecl; recurse only there |
Contributor
Author
There was a problem hiding this comment.
Good catch — restored in 80d1f23 right above the cppdecl line.
Per review (@oitel): keep the master-era "Selective init" / mrbind-deps comments inside the run: block where they were, not lifted to YAML level.
Came along with the original retry-loop draft as defensive cover for "partial clone left by a failed prior attempt". On a fresh CI runner that case is hypothetical, and dropping it narrows the diff vs master to exactly "wrap in retry.sh".
Same source as the dropped `--force`: came from the PR's original loop draft. Functionally identical to `--depth=1`, but master's selective init writes the space-separated form.
Same change as windows now applied to macos, ubuntu-x64, ubuntu-arm64, linux-vcpkg, emscripten, pip-build, and update-docs-manual: wrap the two submodule-update calls in bash scripts/retry.sh -- so a transient github.com 500 on any submodule clone retries 3x at 30s intervals instead of failing the job on first try. Also drops the leftover -c protocol.version=2 from the windows step so all 8 workflows share the exact same wrapper form.
The wrapper-extension commit (8b80226) landed while the disable-build-* labels were still set, so non-windows workflows were skipped. The labels have since been removed; this empty commit re-triggers the matrix so the new retry.sh wrapper actually exercises macos / ubuntu / linux-vcpkg / emscripten / pip-build.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Submodule clones over github.com periodically fail with HTTP 500 in CI. Captured in run 24845654039 on Windows:
Six different third-party submodules failed inside ~25 s on that run (openvdb, parallel-hashmap, tinygltf, tinyxml2, zlib-ng, openvdb/v10) — pure github.com flakes, not infra on our side. The hit was on Windows but the same risk exists on every workflow that does
git submodule update --initfor our third-party tree.actions/checkout@v6does have built-in retry logic, but it only retries each submodule once and that wasn't enough to ride out the ~25 s spike.Change
Wrap the existing selective
git submodule updatecalls inscripts/retry.sh— the 3x/30s retry helper master already ships and that the Rocky vcpkg Dockerfiles already use:Three attempts with a 30 s cooldown — enough to ride out the ~25 s github.com 500 spike observed in the failing run. Only behavioral delta vs master (#5992 for the selective-list pattern itself) is the
retry.shwrapper; the submodule lists,--init --depth 1, and the cppdecl recursion are all unchanged.Scope
All 8 workflows that do a "Checkout third-party submodules" step:
build-test-windows.yml(the originally-failing one)build-test-macos.ymlbuild-test-ubuntu-x64.ymlbuild-test-ubuntu-arm64.ymlbuild-test-linux-vcpkg.ymlbuild-test-emscripten.ymlpip-build.yml(workflow_dispatch / release — not exercised by PR CI)update-docs-manual.yml(workflow_dispatch only — not exercised by PR CI)Each gets the same two-line edit (one
# Retried via retry.sh: ...comment, twobash scripts/retry.sh --prefixes). Wrapper form is identical across all 8.Why not
Wandalen/wretry.actionAn earlier attempt on this PR wrapped the whole
actions/checkoutstep inWandalen/wretry.action@v3.8.0. It triggered astartup_failure— the workflow parser refused to schedule any job. Root cause appears to be thatwretry.action's outer composite-action layer dispatches to an inner_js_actionrunningnode20, butactions/checkout@v6usesnode24; the handoff doesn't work and the whole workflow is rejected before any step runs.wretry.actionis thinly maintained and has an open issue (#193) with no ETA for a fix, so an in-workflow retry viascripts/retry.shis the right trade.Test plan
cbefd973— run 25015648390 all green:retry.shwarnings silent (i.e. no submodule needed a retry on this run; the wrapper is a no-op on the happy path).a9cfc67avalidated the wrapper in isolation — run 25009155998, 4/4 windows legs green.pip-build.ymlandupdate-docs-manual.ymlaren't reached by PR CI; their edits are identical two-line copies of the verified pattern.