Skip to content

Registration token endpoint returns 502 in bursts; ephemeral runners die before registering #4399

@vbtcl

Description

@vbtcl

What

POST https://api.github.com/repos/{owner}/{repo}/actions/runners/registration-token returns 502 Bad Gateway under bursts of concurrent requests, with no clear retry hint. Self-hosted runners spawned from an autoscaler (in our case Modal-hosted ephemeral GHA runners) fail to register and exit immediately.

Observed 2026-05-04 ~16:00–17:00 UTC. ~21 ephemeral containers died mid-registration during a webhook-driven spawn burst, requiring manual cleanup of phantom Modal containers.

Repro

  1. Configure many ephemeral runners to register on workflow_job.queued webhook (we do ~20 concurrent at peak).
  2. Observe the runner-side stack trace from _mint_registration_token:
File "/root/app.py", line 247, in _mint_registration_token
    with urllib.request.urlopen(req, timeout=15) as resp:
  ...
urllib.error.HTTPError: HTTP Error 502: Bad Gateway
  1. The container's runner.sh fails before config.sh registers; the slot is wasted.

What's missing on the runner side

The reference runner script _mint_registration_token flow doesn't retry 5xx — a single transient 502 from api.github.com kills the registration. For ephemeral spawns this is particularly painful because each failed registration burns a Modal container slot.

Suggested fix (runner-side)

Add bounded retry-with-backoff on 5xx responses to the registration-token call in the example runner setup script and document the pattern. Something like:

for attempt in range(4):
    try:
        with urllib.request.urlopen(req, timeout=15) as resp:
            return json.load(resp)
    except urllib.error.HTTPError as e:
        if 500 <= e.code < 600 and attempt < 3:
            time.sleep(2 ** attempt)
            continue
        raise

The github.com side may also want to investigate the 502 burst pattern, but a retry on the runner side dramatically reduces user-visible impact regardless.

Impact

Empty merge queue for ~3 hours despite ample Modal A100 quota; the 20-cap on max_containers was being held by phantom containers that 502'd during registration but Modal still counted their slot.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions