Skip to content

fix: add retry logic for GitHub API calls in scale-down Lambda#1

Closed
shivdesh wants to merge 2 commits into
isovalent:pr/bl/nested-virtfrom
shivdesh:fix/scale-down-retry-logic
Closed

fix: add retry logic for GitHub API calls in scale-down Lambda#1
shivdesh wants to merge 2 commits into
isovalent:pr/bl/nested-virtfrom
shivdesh:fix/scale-down-retry-logic

Conversation

@shivdesh

@shivdesh shivdesh commented Mar 9, 2026

Copy link
Copy Markdown
Collaborator

Summary

Add exponential backoff retry for transient GitHub API failures (5xx, 429) when de-registering runners during scale-down operations.

Problem

When the scale-down Lambda attempts to de-register a runner from GitHub, transient API failures (e.g., 502 Server Error) cause the operation to fail. The current code catches the error, logs it, but still terminates the EC2 instance. This leaves stale/offline runner entries in the GitHub org settings.

Solution

  • Add withRetry() helper with configurable max retries (3) and exponential backoff delays (1s, 2s, 4s)
  • Wrap deleteSelfHostedRunnerFromOrg/Repo calls with retry logic
  • Only terminate EC2 instance if GitHub de-registration succeeds
  • If de-registration fails after all retries, leave the instance running so the next scale-down cycle can retry

Changes

  • lambdas/functions/control-plane/src/scale-runners/scale-down.ts:
    • Added RETRY_CONFIG, sleep(), isRetryableError(), and withRetry() helper functions
    • Added deleteGitHubRunner() wrapper that uses retry logic
    • Modified removeRunner() to only terminate EC2 if all GitHub de-registrations succeed

Testing

  • Tested locally with simulated 502 errors
  • Retry logic handles 5xx errors and 429 rate limiting

@shivdesh shivdesh force-pushed the fix/scale-down-retry-logic branch from cc9ce66 to a4416d7 Compare March 9, 2026 20:47
@sekhar-isovalent

Copy link
Copy Markdown
Collaborator

Could you upstream this change ?

@sekhar-isovalent sekhar-isovalent self-requested a review March 10, 2026 00:42

@sekhar-isovalent sekhar-isovalent left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets try and research open issues upstream and fix that if found and we can sync the change to our module.

@brilong

brilong commented Mar 10, 2026

Copy link
Copy Markdown

FYI, in the spirit of upstreaming changes, my upstream PR for nested virtualization is almost approved. Working on one more tweak.

@brilong

brilong commented Mar 10, 2026

Copy link
Copy Markdown

Also, upstream recently released v7.4.1. We might want to compare and integrate their changes into our fork until we can get our upstream tweaks included.

@shivdesh shivdesh force-pushed the fix/scale-down-retry-logic branch from a4416d7 to 8b5e8ef Compare March 10, 2026 16:46
@shivdesh

Copy link
Copy Markdown
Collaborator Author

upstream change: github-aws-runners#5061
@sekhar-isovalent @brilong

@shivdesh shivdesh force-pushed the fix/scale-down-retry-logic branch 5 times, most recently from aa2e75b to 1f66402 Compare March 13, 2026 21:07
When the scale-down Lambda fails to de-register a runner from GitHub
(even after automatic retries via @octokit/plugin-retry), the EC2
instance should NOT be terminated. This prevents stale runner entries
in GitHub org settings.

This change complements PR github-aws-runners#4990 which added @octokit/plugin-retry for
automatic retries. While that handles transient failures, this ensures
that if de-registration ultimately fails, we don't leave orphaned
GitHub runner entries by terminating the EC2 instance prematurely.

Key changes:
- Extract deleteGitHubRunner() helper that catches errors per-runner
- Only terminate EC2 instance if ALL GitHub de-registrations succeed
- If any de-registration fails, leave instance running for next cycle
- Rename githubAppClient to githubInstallationClient for clarity
- Refactor to split owner/repo once instead of multiple times
- Fix error logging to handle non-Error objects properly

The @octokit/plugin-retry (added in github-aws-runners#4990) handles automatic retries
at the client level, so no custom retry logic is needed here.

Tests:
- Add test verifying EC2 is NOT terminated when de-registration fails
…ed maximum

When pool and scale-up lambdas run concurrently, currentRunners can
temporarily exceed maximumRunners. This caused the calculation
`maximumRunners - currentRunners` to produce a negative value, which
was then passed to EC2 CreateFleet API, resulting in:

  InvalidTargetCapacitySpecification: TotalTargetCapacity should not be negative.

This fix wraps the calculation with Math.max(0, ...) to ensure we never
attempt to create a negative number of runners.

Fixes race condition between pool-lambda and scale-up-lambda.
@shivdesh shivdesh force-pushed the fix/scale-down-retry-logic branch from 1f66402 to 7a368bc Compare March 27, 2026 17:35
@shivdesh shivdesh closed this Mar 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants