Skip to content

fix(runners): wire job_retry.lambda_memory_size and lambda_timeout#5120

Merged
Brend-Smits merged 2 commits into
github-aws-runners:mainfrom
oscarbc96:fix/job-retry-lambda-memory-and-timeout
Jun 11, 2026
Merged

fix(runners): wire job_retry.lambda_memory_size and lambda_timeout#5120
Brend-Smits merged 2 commits into
github-aws-runners:mainfrom
oscarbc96:fix/job-retry-lambda-memory-and-timeout

Conversation

@oscarbc96

@oscarbc96 oscarbc96 commented May 10, 2026

Copy link
Copy Markdown
Contributor

Description

Both var.job_retry (in modules/multi-runner/variables.tf and modules/runners/variables.tf) declare lambda_memory_size and lambda_timeout as documented configuration fields, but local.job_retry in modules/runners/job-retry.tf never copies either field into the config map passed to the inner job-retry / lambda sub-modules. The inner lambda module then falls back to its defaults (memory_size = 256, timeout = 60), so user-supplied values are silently dropped — tofu plan shows no diff and the running Lambda keeps its defaults.

The fix is a two-line addition to the local.job_retry map. It mirrors the pattern modules/runners/ssm-housekeeper.tf already uses for local.ssm_housekeeper.lambda_memory_size / local.ssm_housekeeper.lambda_timeout — that Lambda correctly threads the values through.

Motivation

Discovered in production: I pinned lambda_memory_size = 512 in multi_runner_config[*].runner_config.job_retry after observing the job-retry Lambdas at 87% memory utilisation (223 MB peak on the 256 MB default), and got No changes from tofu plan. Tracing the wiring confirmed the value never reaches the resource.

Reproduction

module "runners" {
  source  = "github-aws-runners/github-runner/aws//modules/multi-runner"
  version = "7.6.0"
  #
  multi_runner_config = {
    "example" = {
      matcherConfig = { … }
      runner_config = merge(local.default_config, {
        # … other config …
        job_retry = {
          enable             = true
          lambda_memory_size = 512  # ← silently ignored before this PR
          lambda_timeout     = 60   # ← silently ignored before this PR
        }
      })
    }
  }
}

After this fix, tofu plan shows the expected memory_size: 256 -> 512 change on the job-retry Lambda.

Verification

  • tofu fmt clean.
  • The variable type definition on both modules/runners/variables.tf and modules/multi-runner/variables.tf already declares lambda_memory_size = optional(number, 256) and lambda_timeout = optional(number, 30), so no public surface changes.
  • The inner modules/runners/modules/lambda accepts memory_size and timeout on its lambda input object (with the same defaults), so when the wiring is restored the values flow through naturally.

No-impact when not set

Defaults remain memory_size = 256 (per the variable declaration) and timeout = 30 — same as today's effective values when nothing is overridden.

The job_retry variable on both the multi-runner and runners modules
declares lambda_memory_size and lambda_timeout, but the
local.job_retry map in modules/runners/job-retry.tf never copied
either field into the config passed to the inner job-retry / lambda
sub-modules. The inner lambda module fell back to its defaults
(memory_size = 256, timeout = 60), so user-supplied values were
silently dropped.

Mirrors the pattern already used by ssm-housekeeper.tf
(local.ssm_housekeeper.lambda_memory_size /
local.ssm_housekeeper.lambda_timeout) — the ssm-housekeeper Lambda
correctly threads the values through; the job-retry one didn't.

Observed in production: a deployment pinned to lambda_memory_size = 512
in multi_runner_config[*].runner_config.job_retry produced no plan
diff because the value never reached the resource. The job-retry
Lambdas were OOM-adjacent at 87% memory utilisation (223 MB peak on
the 256 MB default) on a fleet of three runners.
@oscarbc96 oscarbc96 requested a review from a team as a code owner May 10, 2026 10:43
@Brend-Smits Brend-Smits requested a review from Copilot June 11, 2026 07:36

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes configuration wiring in the modules/runners Terraform module so that user-provided job_retry.lambda_memory_size and job_retry.lambda_timeout are actually passed into the internal job-retry/Lambda submodule rather than being silently dropped.

Changes:

  • Thread var.job_retry.lambda_memory_size through as memory_size in local.job_retry.
  • Thread var.job_retry.lambda_timeout through as timeout in local.job_retry.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread modules/runners/job-retry.tf

@Brend-Smits Brend-Smits left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for the fix!

@Brend-Smits Brend-Smits merged commit 404785e into github-aws-runners:main Jun 11, 2026
41 checks passed
Brend-Smits pushed a commit that referenced this pull request Jun 11, 2026
🤖 I have created a release *beep* *boop*
---


##
[7.7.0](v7.6.1...v7.7.0)
(2026-06-11)


### Features

* Add feature to enable dynamic ec2 config via workflow labels
([#5003](#5003))
([c68445d](c68445d))
* add support for macos runners
([#4930](#4930))
([3e179a3](3e179a3))
* Introduce Amazon Linux 2023 ARM image
([#4780](#4780))
([e572ae5](e572ae5))
* relax cpu_options schema and add amd_sev_snp + nested_virtualization
support
([#5039](#5039))
([5a3746d](5a3746d))
* **runner-role:** Enable using separate IAM role for runners
([#4875](#4875))
([6642e57](6642e57))


### Bug Fixes

* **ci:** sign auto-generated docs commits
([#5154](#5154))
([a6af4d2](a6af4d2))
* **runners:** wire job_retry.lambda_memory_size and lambda_timeout
([#5120](#5120))
([404785e](404785e))
* **scale-up:** Add ec2:TerminateInstances permission to scale-up Lambda
IAM policy
([#5152](#5152))
([94c4e12](94c4e12))
* **scale-up:** prevent negative TotalTargetCapacity when runners exceed
maximum
([#5062](#5062))
([9ab7410](9ab7410))
* **webhook:** Fix publish events to EventBridge
([#5143](#5143))
([a72b737](a72b737))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: runners-releaser[bot] <194412594+runners-releaser[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants