fix: add retry logic for GitHub API calls in scale-down Lambda#1
Closed
shivdesh wants to merge 2 commits into
Closed
fix: add retry logic for GitHub API calls in scale-down Lambda#1shivdesh wants to merge 2 commits into
shivdesh wants to merge 2 commits into
Conversation
cc9ce66 to
a4416d7
Compare
Collaborator
|
Could you upstream this change ? |
sekhar-isovalent
requested changes
Mar 10, 2026
sekhar-isovalent
left a comment
Collaborator
There was a problem hiding this comment.
Lets try and research open issues upstream and fix that if found and we can sync the change to our module.
|
FYI, in the spirit of upstreaming changes, my upstream PR for nested virtualization is almost approved. Working on one more tweak. |
|
Also, upstream recently released v7.4.1. We might want to compare and integrate their changes into our fork until we can get our upstream tweaks included. |
a4416d7 to
8b5e8ef
Compare
Collaborator
Author
|
upstream change: github-aws-runners#5061 |
aa2e75b to
1f66402
Compare
When the scale-down Lambda fails to de-register a runner from GitHub (even after automatic retries via @octokit/plugin-retry), the EC2 instance should NOT be terminated. This prevents stale runner entries in GitHub org settings. This change complements PR github-aws-runners#4990 which added @octokit/plugin-retry for automatic retries. While that handles transient failures, this ensures that if de-registration ultimately fails, we don't leave orphaned GitHub runner entries by terminating the EC2 instance prematurely. Key changes: - Extract deleteGitHubRunner() helper that catches errors per-runner - Only terminate EC2 instance if ALL GitHub de-registrations succeed - If any de-registration fails, leave instance running for next cycle - Rename githubAppClient to githubInstallationClient for clarity - Refactor to split owner/repo once instead of multiple times - Fix error logging to handle non-Error objects properly The @octokit/plugin-retry (added in github-aws-runners#4990) handles automatic retries at the client level, so no custom retry logic is needed here. Tests: - Add test verifying EC2 is NOT terminated when de-registration fails
…ed maximum When pool and scale-up lambdas run concurrently, currentRunners can temporarily exceed maximumRunners. This caused the calculation `maximumRunners - currentRunners` to produce a negative value, which was then passed to EC2 CreateFleet API, resulting in: InvalidTargetCapacitySpecification: TotalTargetCapacity should not be negative. This fix wraps the calculation with Math.max(0, ...) to ensure we never attempt to create a negative number of runners. Fixes race condition between pool-lambda and scale-up-lambda.
1f66402 to
7a368bc
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add exponential backoff retry for transient GitHub API failures (5xx, 429) when de-registering runners during scale-down operations.
Problem
When the scale-down Lambda attempts to de-register a runner from GitHub, transient API failures (e.g., 502 Server Error) cause the operation to fail. The current code catches the error, logs it, but still terminates the EC2 instance. This leaves stale/offline runner entries in the GitHub org settings.
Solution
withRetry()helper with configurable max retries (3) and exponential backoff delays (1s, 2s, 4s)deleteSelfHostedRunnerFromOrg/Repocalls with retry logicChanges
lambdas/functions/control-plane/src/scale-runners/scale-down.ts:RETRY_CONFIG,sleep(),isRetryableError(), andwithRetry()helper functionsdeleteGitHubRunner()wrapper that uses retry logicremoveRunner()to only terminate EC2 if all GitHub de-registrations succeedTesting