Skip to content

ci: retry submodule fetch on transient github.com 500s#5969

Merged
Fedr merged 9 commits into
masterfrom
cicd/retry-windows-checkout
Apr 28, 2026
Merged

ci: retry submodule fetch on transient github.com 500s#5969
Fedr merged 9 commits into
masterfrom
cicd/retry-windows-checkout

Conversation

@Fedr
Copy link
Copy Markdown
Contributor

@Fedr Fedr commented Apr 23, 2026

Summary

Submodule clones over github.com periodically fail with HTTP 500 in CI. Captured in run 24845654039 on Windows:

error: RPC failed; HTTP 500 curl 22 The requested URL returned error: 500
fatal: expected 'packfile'
fatal: clone of 'https://github.com/AcademySoftwareFoundation/openvdb' into submodule path 'thirdparty/openvdb/v9/openvdb' failed

Six different third-party submodules failed inside ~25 s on that run (openvdb, parallel-hashmap, tinygltf, tinyxml2, zlib-ng, openvdb/v10) — pure github.com flakes, not infra on our side. The hit was on Windows but the same risk exists on every workflow that does git submodule update --init for our third-party tree.

actions/checkout@v6 does have built-in retry logic, but it only retries each submodule once and that wasn't enough to ride out the ~25 s spike.

Change

Wrap the existing selective git submodule update calls in scripts/retry.sh — the 3x/30s retry helper master already ships and that the Rocky vcpkg Dockerfiles already use:

- name: Checkout third-party submodules
  run: |
    # Selective init -- parent Checkout drops submodules:true.
    # https://github.com/actions/checkout/issues/1779
    # Retried via retry.sh: submodule endpoints occasionally 500.
    bash scripts/retry.sh -- git submodule update --init --depth 1 \
      thirdparty/imgui \
      ...
    # mrbind needs deps/cppdecl; recurse only there
    bash scripts/retry.sh -- git -C thirdparty/mrbind submodule update --init --depth 1 deps/cppdecl

Three attempts with a 30 s cooldown — enough to ride out the ~25 s github.com 500 spike observed in the failing run. Only behavioral delta vs master (#5992 for the selective-list pattern itself) is the retry.sh wrapper; the submodule lists, --init --depth 1, and the cppdecl recursion are all unchanged.

Scope

All 8 workflows that do a "Checkout third-party submodules" step:

  • build-test-windows.yml (the originally-failing one)
  • build-test-macos.yml
  • build-test-ubuntu-x64.yml
  • build-test-ubuntu-arm64.yml
  • build-test-linux-vcpkg.yml
  • build-test-emscripten.yml
  • pip-build.yml (workflow_dispatch / release — not exercised by PR CI)
  • update-docs-manual.yml (workflow_dispatch only — not exercised by PR CI)

Each gets the same two-line edit (one # Retried via retry.sh: ... comment, two bash scripts/retry.sh -- prefixes). Wrapper form is identical across all 8.

Why not Wandalen/wretry.action

An earlier attempt on this PR wrapped the whole actions/checkout step in Wandalen/wretry.action@v3.8.0. It triggered a startup_failure — the workflow parser refused to schedule any job. Root cause appears to be that wretry.action's outer composite-action layer dispatches to an inner _js_action running node20, but actions/checkout@v6 uses node24; the handoff doesn't work and the whole workflow is rejected before any step runs. wretry.action is thinly maintained and has an open issue (#193) with no ETA for a fix, so an in-workflow retry via scripts/retry.sh is the right trade.

Test plan

  • Full CI matrix on cbefd973run 25015648390 all green:
    • windows 4/4, macos 3/3, ubuntu-x64 3/3, ubuntu-arm64 2/2, linux-vcpkg 4/4, emscripten 3/3 (19/19 build-test legs).
    • First-attempt success on every leg — retry.sh warnings silent (i.e. no submodule needed a retry on this run; the wrapper is a no-op on the happy path).
  • Earlier windows-only commit a9cfc67a validated the wrapper in isolation — run 25009155998, 4/4 windows legs green.
  • pip-build.yml and update-docs-manual.yml aren't reached by PR CI; their edits are identical two-line copies of the verified pattern.
  • If a future CI run hits a github.com 500 during submodule fetch, retries kick in and the job still succeeds.

@Fedr Fedr changed the title ci(windows): retry Checkout via Wandalen/wretry.action ci(windows): retry submodule fetch on transient github.com 500s Apr 23, 2026
Fedr added 2 commits April 27, 2026 20:08
…checkout

# Conflicts:
#	.github/workflows/build-test-windows.yml
retry.sh landed on master with the same 3x/30s defaults the inline
loop was hand-rolling. Calling it removes ~13 lines of bookkeeping
and the long retry-rationale comment.
Comment on lines -57 to -58
# Selective init -- parent Checkout drops submodules:true.
# https://github.com/actions/checkout/issues/1779
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep the old comment.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restored — moved the comment back inside the run: block in 80d1f23.

thirdparty/cpp-httplib \
thirdparty/mrbind \
thirdparty/mrbind-pybind11
# mrbind needs deps/cppdecl; recurse only there
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did this comment go?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — restored in 80d1f23 right above the cppdecl line.

Fedr added 4 commits April 27, 2026 21:20
Per review (@oitel): keep the master-era "Selective init" / mrbind-deps
comments inside the run: block where they were, not lifted to YAML
level.
Came along with the original retry-loop draft as defensive cover for
"partial clone left by a failed prior attempt". On a fresh CI runner
that case is hypothetical, and dropping it narrows the diff vs master
to exactly "wrap in retry.sh".
Same source as the dropped `--force`: came from the PR's original
loop draft. Functionally identical to `--depth=1`, but master's
selective init writes the space-separated form.
Same change as windows now applied to macos, ubuntu-x64, ubuntu-arm64,
linux-vcpkg, emscripten, pip-build, and update-docs-manual: wrap the
two submodule-update calls in bash scripts/retry.sh -- so a transient
github.com 500 on any submodule clone retries 3x at 30s intervals
instead of failing the job on first try.

Also drops the leftover -c protocol.version=2 from the windows step
so all 8 workflows share the exact same wrapper form.
@Fedr Fedr changed the title ci(windows): retry submodule fetch on transient github.com 500s ci: retry submodule fetch on transient github.com 500s Apr 27, 2026
The wrapper-extension commit (8b80226) landed while the disable-build-*
labels were still set, so non-windows workflows were skipped. The labels
have since been removed; this empty commit re-triggers the matrix so the
new retry.sh wrapper actually exercises macos / ubuntu / linux-vcpkg /
emscripten / pip-build.
@Fedr Fedr merged commit ee04e2f into master Apr 28, 2026
35 checks passed
@Fedr Fedr deleted the cicd/retry-windows-checkout branch April 28, 2026 06:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants