What
POST https://api.github.com/repos/{owner}/{repo}/actions/runners/registration-token returns 502 Bad Gateway under bursts of concurrent requests, with no clear retry hint. Self-hosted runners spawned from an autoscaler (in our case Modal-hosted ephemeral GHA runners) fail to register and exit immediately.
Observed 2026-05-04 ~16:00–17:00 UTC. ~21 ephemeral containers died mid-registration during a webhook-driven spawn burst, requiring manual cleanup of phantom Modal containers.
Repro
- Configure many ephemeral runners to register on
workflow_job.queued webhook (we do ~20 concurrent at peak).
- Observe the runner-side stack trace from
_mint_registration_token:
File "/root/app.py", line 247, in _mint_registration_token
with urllib.request.urlopen(req, timeout=15) as resp:
...
urllib.error.HTTPError: HTTP Error 502: Bad Gateway
- The container's runner.sh fails before
config.sh registers; the slot is wasted.
What's missing on the runner side
The reference runner script _mint_registration_token flow doesn't retry 5xx — a single transient 502 from api.github.com kills the registration. For ephemeral spawns this is particularly painful because each failed registration burns a Modal container slot.
Suggested fix (runner-side)
Add bounded retry-with-backoff on 5xx responses to the registration-token call in the example runner setup script and document the pattern. Something like:
for attempt in range(4):
try:
with urllib.request.urlopen(req, timeout=15) as resp:
return json.load(resp)
except urllib.error.HTTPError as e:
if 500 <= e.code < 600 and attempt < 3:
time.sleep(2 ** attempt)
continue
raise
The github.com side may also want to investigate the 502 burst pattern, but a retry on the runner side dramatically reduces user-visible impact regardless.
Impact
Empty merge queue for ~3 hours despite ample Modal A100 quota; the 20-cap on max_containers was being held by phantom containers that 502'd during registration but Modal still counted their slot.
What
POST https://api.github.com/repos/{owner}/{repo}/actions/runners/registration-tokenreturns 502 Bad Gateway under bursts of concurrent requests, with no clear retry hint. Self-hosted runners spawned from an autoscaler (in our case Modal-hosted ephemeral GHA runners) fail to register and exit immediately.Observed 2026-05-04 ~16:00–17:00 UTC. ~21 ephemeral containers died mid-registration during a webhook-driven spawn burst, requiring manual cleanup of phantom Modal containers.
Repro
workflow_job.queuedwebhook (we do ~20 concurrent at peak)._mint_registration_token:config.shregisters; the slot is wasted.What's missing on the runner side
The reference runner script
_mint_registration_tokenflow doesn't retry 5xx — a single transient 502 from api.github.com kills the registration. For ephemeral spawns this is particularly painful because each failed registration burns a Modal container slot.Suggested fix (runner-side)
Add bounded retry-with-backoff on 5xx responses to the registration-token call in the example runner setup script and document the pattern. Something like:
The github.com side may also want to investigate the 502 burst pattern, but a retry on the runner side dramatically reduces user-visible impact regardless.
Impact
Empty merge queue for ~3 hours despite ample Modal A100 quota; the 20-cap on
max_containerswas being held by phantom containers that 502'd during registration but Modal still counted their slot.