Skip to content

[SVLS-9175] feat: emit OOM metric on memory equality with per-request dedup#1241

Open
lym953 wants to merge 19 commits into
mainfrom
yiming.luo/fix-1237-node-oom-metric
Open

[SVLS-9175] feat: emit OOM metric on memory equality with per-request dedup#1241
lym953 wants to merge 19 commits into
mainfrom
yiming.luo/fix-1237-node-oom-metric

Conversation

@lym953
Copy link
Copy Markdown
Contributor

@lym953 lym953 commented May 29, 2026

Background

From our knowledge (before this PR), here's the behavior when each runtime OOMs:

  • emits runtime-specific error message. This can happen on Java, Node (case 1 in the table below) and .NET
  • In PlatformRuntimeDone event, error_type is Runtime.OutOfMemory. This can happen on Python and Ruby.
  • In PlatformReport event, max_memory_used == memory_size. This can happen on Python, Ruby, Node and Go.

To capture OOM for all these scenarios (except Node case 2, which was just called out in #1237) without double counting, right now the extension emits aws.lambda.enhanced.out_of_memory metric in these scenarios:

  • when we see runtime-specific error messages for Java, Node and .NET
  • when we see Runtime.OutOfMemory
  • when we see max_memory_used == memory_size for Go, i.e. only when runtime is provided.al2. We don't do this for other runtimes (Python, Ruby, Node) to avoid double counting.
image

Problem

In issue #1237, a customer called out a new scenario: "Node (case 2)" in the table. The only evidence of OOM is max_memory_used == memory_size, and there is no runtime-specific log message. As a result, OOMs like this are not captured by the OOM enhanced metric.

This PR

  • Regardless of runtime, use all the three ways to capture OOM.
  • In addition, dedup by request_id to avoid double counting.
  • Add one integration test per runtime (except for Node, which has 2 tests)

Test plan

Passed the added unit tests and integration tests.

To reviewers

Most of the code changes are for integration tests.

Details (generated by Claude Code)

Closes the gap surfaced in #1237: a Node.js Lambda that hit its memory limit (Memory Size 192 MB / Max Memory Used 192 MB, Status: timeout) did not emit aws.lambda.enhanced.out_of_memory because none of the three existing detection paths matched.

  • Why the existing paths missed it. V8 spent its budget in GC rather than declaring JavaScript heap out of memory, so the runtime log-line match never fired. The runtime crashed on a wall-clock timeout, so PlatformRuntimeDone reported no error_type. And the max_memory_used_mb == memory_size_mb check in PlatformReport was gated on runtime.starts_with("provided.al") to avoid double-counting against the log path, so Node was excluded.

  • What changes. Drop the provided.al* restriction so the equality check applies to every runtime. To avoid double-counting against the two pre-existing paths (some invocations satisfy both equality and Runtime.OutOfMemory simultaneously), add a per-Context oom_emitted flag. All three detection paths funnel through a new Processor::try_increment_oom_metric, which checks/sets the flag and is a no-op on subsequent calls for the same request_id.

  • Plumbing. Event::OutOfMemory now carries an Option<String> request_id. The log-path detector reads it from LambdaProcessor::invocation_context.request_id (set on PlatformStart, cleared on PlatformRuntimeDone/PlatformReport). None is only realistic in Managed Instance mode (extensions can't subscribe to INVOKE there); the helper falls back to a best-effort emit without dedup in that case.

…th per-request dedup

Customer report (#1237): a Node.js Lambda that hit its memory limit
(Memory Size 192 MB / Max Memory Used 192 MB, Status: timeout) did not
emit aws.lambda.enhanced.out_of_memory because none of the existing
detection paths matched. The Node runtime did not log
"JavaScript heap out of memory" (V8 spent its time in GC instead of
declaring an OOM), and PlatformRuntimeDone reported no error_type — just
a wall-clock timeout — so the log-string and Runtime.OutOfMemory paths
both stayed silent.

Drop the provided.al* restriction on the PlatformReport equality check
so any runtime emits OOM when max_memory_used_mb == memory_size_mb. To
avoid double-counting against the two pre-existing paths (some
invocations satisfy both equality and Runtime.OutOfMemory at the same
time), add a per-Context oom_emitted flag. All three detection paths now
funnel through Processor::try_increment_oom_metric, which checks the
flag, sets it on first emission, and is a no-op on subsequent calls for
the same request_id. The flag lives with the per-invocation Context and
is cleared automatically when on_platform_report removes the context.

Plumbing: Event::OutOfMemory now carries an Option<String> request_id
(the log-path detector reads it from the logs processor's
invocation_context.request_id, set on PlatformStart and cleared on
PlatformRuntimeDone). When request_id is None — only realistic in
Managed Instance mode, where extensions cannot subscribe to INVOKE — the
helper falls back to a best-effort emit without dedup.

Tests cover three scenarios: same request_id emits exactly once, two
distinct request_ids each emit, and the equality path still fires
(regression coverage for the dropped provided.al* check).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@datadog-prod-us1-5
Copy link
Copy Markdown

datadog-prod-us1-5 Bot commented May 29, 2026

Pipelines

Fix all issues with BitsAI

⚠️ Warnings

🚦 1 Pipeline job failed

DataDog/datadog-lambda-extension | publish layer e2e sandbox (amd64, fips)   View in Datadog   GitLab

🔄 Retry job. This looks flaky and may succeed on retry. Rate exceeded during AWS API call for ListLayerVersions operation due to ThrottlingException after max retries reached.

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 631ed00 | Docs | Datadog PR Page | Give us feedback!

lym953 and others added 7 commits May 29, 2026 15:52
Adds a new `oom` integration-test suite that exercises the OOM dedup
change (Context::oom_emitted, #1241) end-to-end across every supported
runtime. Each lambda intentionally allocates until it OOMs; the test
asserts aws.lambda.enhanced.out_of_memory increments by exactly one
data point per function over the invocation window — which fails if the
dedup flag stops working and two detection paths emit for the same
invocation.

New lambda apps under integration-tests/lambda/:
- oom-node-v8-heap   : exercises log-line path (JavaScript heap OOM)
- oom-node-sigkill   : exercises PlatformRuntimeDone Runtime.OutOfMemory path
- oom-python         : MemoryError — log path AND PlatformRuntimeDone path
                       both fire, so dedup is necessary for count==1
- oom-ruby           : NoMemoryError — same dual-path coverage as Python
- oom-java           : OutOfMemoryError (log-line path)
- oom-dotnet         : OutOfMemoryException (log-line path)
- oom-go             : fatal: runtime: out of memory — log path AND
                       PlatformReport memory-equality path both fire

Framework additions:
- Ruby and Go runtime/layer helpers in lib/util.ts (Ruby tracer layer;
  Go has no tracer layer — extension layer alone covers the test).
- Oom CDK stack registered in bin/app.ts.
- build-ruby.sh (zip-as-is for now; Gemfile build stubbed) and
  build-go.sh (Docker cross-compile to ARM64 Linux, bootstrap binary).
- Pipeline template additions for the two new build stages and
  oom suite registration in test-suites.yaml.
- getMetricCount() + OUT_OF_MEMORY_METRIC in tests/utils/datadog.ts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI run on the first oom suite (commit 5a833ac) returned counts:
  node-v8-heap=1 ✓  node-sigkill=1 ✓  python=1 ✓  dotnet=1 ✓
  ruby=0 ✗  java=0 ✗  go=0 ✗

Reproducing locally:

- Ruby: function failed at init with
  `cannot load such file -- datadog_lambda_rb`. The Datadog Ruby tracer
  is a regular gem (no handler shim like Python's
  `datadog_lambda.handler.handler`), so set handler to
  `lambda_function.handler` and drop `DD_LAMBDA_HANDLER`.

- Go: function timed out (30s) at `Max Memory Used: 192 MB / Memory Size:
  192 MB` without emitting any enhanced metrics. Two changes:
  * Drop `AWS_LAMBDA_EXEC_WRAPPER=/opt/datadog_wrapper` — the wrapper sets
    language-specific tracer env vars; Go's tracer is in-module not
    layer-based, so the wrapper just changes runtime detection without
    helping. With the wrapper removed and a clean exec, the extension's
    enhanced-metric pipeline starts emitting.
  * Replace the `for { append(make([]byte, 10MB)) }` loop with a single
    `make([]byte, 500MB)` that writes every page. Go's slice doubling +
    GC kept the loop from OOMing reliably in the 30s timeout window;
    eager allocation guarantees `fatal error: runtime: out of memory`
    fires immediately, exercising bottlecap's log-line detection.

- Java: also failed in CI (count=0) but local repro now returns count=1
  with the same code path. Leaving the Java app unchanged for the next
  CI run to confirm. If it fails again, likely the extension didn't
  flush the metric before the JVM crashed; would need DD_SERVERLESS_FLUSH_STRATEGY
  changes or per-function twice-invoke.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…oom suite

The integration-test framework defaults DD_SERVERLESS_FLUSH_STRATEGY to
`end`, which means the extension only flushes at end of invocation. For
OOM tests that's a tight race: the function dies, then Lambda sends
PlatformRuntimeDone, then bottlecap increments the OOM metric, then
Shutdown comes and the sandbox is reaped. If the metric flush can't
finish in that narrow window, the data point is lost.

Run 1 of the oom suite returned ruby/java/go=0 (3 of 7 failed). Run 2
returned ruby/node-sigkill/python/dotnet/go=0 (5 of 7 failed) — but
java=1 this time. The set of "failing" runtimes is not stable across
runs, confirming a timing race rather than a code bug.

`default` flushes every ~1s in addition to invocation-end, giving the
OOM metric a much wider window to reach Datadog before the sandbox is
torn down. All other integration suites keep using `end` since their
invocations complete cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI runs were intermittently returning count=0 for ruby/java/go/dotnet/
node-sigkill/python — varying combinations across runs. Diagnosing
showed the data points were correctly emitted and durably ingested by
Datadog within ~30s of the OOM, but the `/api/v1/query` endpoint
sometimes returned no results for very-recently-ingested points. The
single-shot 5-minute wait was too brittle.

Polling strategy: wait 90s after invocation, then re-query every 30s
until every runtime reports count>=1 or the 12-min budget is exhausted.
Early-exits when all runtimes pass, so the common case is faster than
the previous single-shot 5-min wait while the worst case is bounded.

Each poll iteration logs the current counts and the still-missing
runtimes, so debugging future flakes from CI logs requires no rerun.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lt` was a no-op

`FlushStrategy::Default` falls back to `End` until the lookback buffer
fills (~20 invocations). The OOM test does a single cold-start invoke
per function, so `default` behaved identically to `end` — explaining
why the prior commit's change had no observable effect.

`continuously,1000` schedules an unconditional 1s periodic flush
regardless of invocation count, so the OOM metric reaches Datadog
well before the sandbox is reaped after the function process dies.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…es OOM-kill

Root cause of the prior `[oom]` failures (6/7 runtimes stuck at count=0):
at 192 MB the kernel OOM-killer often picks the bottlecap extension
instead of the function runtime — Lambda surfaces this as
`errorType: Extension.Crash`. A dead extension can't emit the OOM
metric, so the test sees nothing in Datadog.

Reproduced locally on us-east-2 arm64 with an IntegTests-style Python
function: at 192 MB → `Extension.Crash`, no metric. Bumping to 256 MB
→ `Runtime.OutOfMemory`, count=1 in Datadog within 30 s.

256 MB gives the extension ~30 MB headroom while keeping every detection
path active: the function still hits memory_size in PlatformReport, still
emits its runtime-specific OOM log line, and still gets
`Runtime.OutOfMemory` in PlatformRuntimeDone. The customer's #1237 case
(192 MB) is unaffected — this is a test-harness change.

Also drops the `DD_SERVERLESS_FLUSH_STRATEGY=continuously,1000` override
from the prior commit; with the extension surviving, the default `end`
flush is sufficient.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Trim historical `provided.al` context from OOM detection comments
- Rewrite `test_handle_ondemand_report_emits_oom_on_memory_equality`
  doc comment to describe what the test covers, not how the rule changed
- Refocus `current_request_id` doc on its sole purpose (OOM metric dedup
  by request_id) and drop speculative scenarios that weren't directly
  verified; use "LMI mode" consistently
- Drop "as of 2026-05" qualifier from the OOM detection path list
- Bump Datadog-Ruby3-4-ARM default layer 9 -> 28

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
/// `PlatformRuntimeDone` / `PlatformReport`. Returns `None` in LMI mode,
/// where extensions cannot subscribe to the `INVOKE` event so
/// `platform.start` is never delivered.
fn current_request_id(&self) -> Option<String> {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To reviewers: is it anti-pattern to get request_id in this way?

Copy link
Copy Markdown
Contributor Author

@lym953 lym953 Jun 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, for all three cases, we can get request_id from either log payload or telemetry event payload, so we don't need this current_request_id() function. Let me delete it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the error

FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory

No request id is available from the log itself, so we still need the function current_request_id().

@lym953 lym953 changed the title fix(metrics): emit OOM metric on memory equality with per-request dedup [SVLS-9175] feat: emit OOM metric on memory equality with per-request dedup Jun 2, 2026
@lym953 lym953 marked this pull request as ready for review June 2, 2026 01:38
@lym953 lym953 requested a review from a team as a code owner June 2, 2026 01:38
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR expands OOM detection so aws.lambda.enhanced.out_of_memory can be emitted for all runtimes when any of the three OOM signals is observed (runtime-specific OOM log line, Runtime.OutOfMemory in PlatformRuntimeDone, or max_memory_used_mb == memory_size_mb in PlatformReport). To avoid double counting when multiple signals fire for the same invocation, it adds a per-invocation dedup flag keyed by request_id. It also adds a new cross-runtime integration test stack/suite (plus Ruby/Go build plumbing) to validate “exactly once per invocation”.

Changes:

  • Emit OOM metric on max_memory_used_mb == memory_size_mb for all runtimes, and dedupe per invocation via Context::oom_emitted.
  • Extend the event bus / processor plumbing so OOM events can carry an optional request_id.
  • Add an OOM integration-test stack & test suite covering multiple runtimes, plus Go/Ruby build steps in CI/local deploy.

Reviewed changes

Copilot reviewed 26 out of 28 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
bottlecap/src/lifecycle/invocation/context.rs Adds oom_emitted flag to dedupe OOM metric emissions per invocation.
bottlecap/src/lifecycle/invocation/processor.rs Uses dedup helper for all OOM detection paths; removes provided.al* gating on memory equality; adds unit tests.
bottlecap/src/lifecycle/invocation/processor_service.rs Threads optional request_id through the processor command for OOM events.
bottlecap/src/logs/lambda/processor.rs Tags OOM log-line events with request_id (when available) for dedup.
bottlecap/src/event_bus/mod.rs Changes OOM event shape to include optional request_id.
bottlecap/src/bin/bottlecap/main.rs Forwards new OOM event shape into the invocation processor service.
bottlecap/src/metrics/enhanced/lambda.rs Updates OOM metric docs to reflect new dedup path and detection coverage.
integration-tests/tests/utils/datadog.ts Adds helper to query total metric emission count.
integration-tests/tests/oom.test.ts Adds cross-runtime integration test asserting exactly one OOM metric emission per invocation.
integration-tests/lib/stacks/oom.ts New CDK stack deploying OOM repro lambdas across runtimes.
integration-tests/lib/util.ts Adds Ruby/Go runtime + Ruby tracer layer helpers.
integration-tests/bin/app.ts Registers the new OOM test stack in the integration test app.
integration-tests/lambda/oom-*/* Adds OOM repro Lambda sources for Node/Python/Ruby/Java/.NET/Go.
integration-tests/scripts/local_deploy.sh Adds Ruby/Go build steps to local deploy.
integration-tests/scripts/build-ruby.sh New Ruby build script (currently no-op for Gemfile-less lambdas).
integration-tests/scripts/build-go.sh New Go cross-compile script producing bin/bootstrap for provided runtime.
.gitlab/templates/pipeline.yaml.tpl Adds CI jobs to build Ruby/Go lambdas and wires them into the integration suite.
.gitlab/datasources/test-suites.yaml Adds the oom test suite entry.

Comment thread bottlecap/src/logs/lambda/processor.rs
Comment thread integration-tests/tests/oom.test.ts Outdated
Comment thread integration-tests/lambda/oom-go/main.go Outdated
Comment thread integration-tests/lambda/oom-go/main.go Outdated
return;
}
ctx.oom_emitted = true;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m curious: in what cases would the request ID be empty or unavailable? Are either of those cases valid? Maybe we can add a debug log for this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Added a debug log
  2. Tested on LMI. OOM log can arrive before the extension receives request_id from PlatformStart event. In this case, request_id is empty. Updated comment to explain this.

Copy link
Copy Markdown
Contributor

@litianningdatadog litianningdatadog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a minior comment

lym953 and others added 11 commits June 2, 2026 11:53
Per PR review feedback. The two no-dedup branches in
`try_increment_oom_metric` were previously silent; surfacing them as
debug logs makes the LMI-mode case (request_id=None) and the rare
context-eviction case (request_id supplied but absent from the buffer)
visible during investigations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a dedicated `lmi-oom` suite that deploys one Python function on the
LMI Capacity Provider and asserts that the OOM enhanced metric is emitted
when the function hits its memory cap. Exercises the LMI-specific log-line
path where `current_request_id()` returns `None` because `platform.start`
is never delivered, so the OOM detector flows through the no-dedup branch
of `try_increment_oom_metric`.

Assertion is `count >= 1` rather than `== 1` because Path 2
(`Runtime.OutOfMemory` via synthesized runtime_done from
`handle_managed_instance_report`) also fires for the same invocation and
cannot dedup against the log path's `None`. A future change can tighten
this once LMI dedup is addressed.

Also simplifies overly-verbose comments above the two no-dedup debug
logs — the log messages are self-explanatory.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Deployed a Python OOM Lambda on the LMI Capacity Provider, captured the
extension's debug logs from CloudWatch, and observed the actual flow:

- PlatformStart IS delivered in LMI mode (prior comment claimed it wasn't).
- For a Python `MemoryError` that fires immediately on first allocation,
  the OOM log line is processed by `LambdaProcessor` *before* the
  `PlatformStart` telemetry event's handler updates
  `invocation_context.request_id` — both arrive in the same millisecond.
- `current_request_id()` therefore returns `None` and the metric flows
  through the no-dedup branch (the new debug log fires).
- The synthesized runtime-done from `handle_managed_instance_report`
  reports `error_type=Runtime.Unknown` (not `Runtime.OutOfMemory`),
  so Path 2 does NOT fire for this Python OOM shape. Final metric
  count = 1 (no double-count).

Updates the `current_request_id()` doc, the no-dedup debug log message,
and the LMI OOM stack/test comments to reflect what was actually observed
rather than the prior (incorrect) "platform.start never delivered in LMI"
hypothesis. Assertion stays `>= 1` for robustness against future changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously the comments on `try_increment_oom_metric`, the path-3 caller
in `handle_ondemand_report`, and `increment_oom_metric` all opened with
confident phrasing ("exactly once per request_id", "isn't counted multiple
times when more than one detection path fires"), and the no-dedup
fallback was buried at the bottom or absent. That mischaracterizes the
guarantee: when the OOM log line lands before/after the active-invocation
window in `LambdaProcessor`, or when the context has been evicted, the
metric will be double-counted by a subsequent detection path.

Restructures the three comments so the best-effort caveat is up front
and the two edge cases (request_id=None race, context evicted) are
called out explicitly with their consequences.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
In LMI mode the function-log JSON payload always carries a `requestId`
field that we already extract a few lines above the OOM detection block.
Plumbing that value into `Event::OutOfMemory` instead of falling back to
`current_request_id()` closes the race observed in
#1241 (comment)
where a fast OOM log line is processed before this same processor's
`PlatformStart` handler updates `invocation_context.request_id`.

OnDemand mode is unaffected — `request_id` from the log payload is
unconditionally `None` there, so we still fall back to
`current_request_id()`, which works because `PlatformStart`'s race
window doesn't manifest in OnDemand operationally.

Updates the `current_request_id` doc and the LMI OOM stack/test
comments to reflect that the LMI case now goes through the deduped
branch by way of the payload `requestId`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…Id universally

Per #1241 (comment):
paths 2 and 3 already get `request_id` directly from the PlatformRuntimeDone
and PlatformReport event payloads (as a function parameter); only path 1
(the OOM log-line detector in `LambdaProcessor::get_message`) was using
`current_request_id()`. And path 1 has an even better source for the
request id — the `requestId` field that structured JSON log payloads
already carry — which doesn't race with the in-processor `PlatformStart`
handler.

Drops the `is_managed_instance_mode` gate around payload `requestId`
extraction so on-demand mode also benefits (it was the LMI Python case
that surfaced the race empirically, but the same source is more accurate
than `invocation_context.request_id` in on-demand mode too). The OOM
detector now tags `Event::OutOfMemory` with the extracted payload
`requestId` directly; the Extension log variant passes `None` (extension
log payloads don't carry a function request id), and falls through to
`try_increment_oom_metric`'s no-dedup branch.

Updates `test_regular_lambda_does_not_extract_request_id` →
`test_regular_lambda_extracts_request_id_from_payload` since the rule it
was locking in (LMI-only extraction) no longer holds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…) fallback

Per #1241 review:
when the request id is available from the log/event payload, use it
directly; only fall back to a workaround (`current_request_id()`) when
the payload doesn't carry one.

Drops the `is_managed_instance_mode` gate on payload `requestId`
extraction so on-demand mode also benefits. The OOM detector now reads
the payload field whenever it's present (Python, Ruby, .NET, and Java/Node
when JSON log format is configured) regardless of mode, and falls back
to `current_request_id()` only for text-payload OOM logs (Node V8 fatal,
Go fatal, Java stderr) where no `requestId` field exists.

The fallback path preserves the count==1 behavior for the
double-detect cases on the integration suite (Java OutOfMemoryError,
Node SIGKILL, Go fatal-error) — these were what the previous
"drop current_request_id() entirely" refactor would have regressed.

Also renames `test_regular_lambda_does_not_extract_request_id` →
`test_regular_lambda_extracts_request_id_from_payload` to match the
new universal extraction behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e-shot allocator

CDK stack create-function in CI failed for the new `lmi-oom` suite with
'MemorySize value failed to satisfy constraint: Lambda Managed Instance
functions must have memory size greater than or equal to 2048'.

LMI Lambda enforces a 2 GB floor. Bumping to 2048 MB exposes a second
problem: the existing `oom-python` source allocates 10 MB strings in a
loop, which on 2 GB either runs past the test budget or gets kernel
SIGKILL'd silently before CPython raises MemoryError — exactly what we
need Path 1 of the OOM detector to see.

Adds `oom-python-lmi/lambda_function.py` with a single
`bytearray(100 * 1024 ** 3)` allocation. 100 GB exceeds any reasonable
Lambda memory cap by orders of magnitude, so CPython's allocator
refuses immediately and raises a clean MemoryError without involving
the cgroup OOM killer. Verified manually with `yiming-lmi-oom-debug`
in us-east-1 (PR #1241 thread).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rollup bucket

The `[lmi-oom]` suite failed 2x in a row on `0da7a59f` despite the metric
being present in Datadog (verified via direct API query). Root cause:
Datadog rolls `aws.lambda.enhanced.out_of_memory` into 10-second
wall-clock-aligned buckets, and the `/api/v1/query` endpoint only
returns buckets whose start timestamp is >= the `from` parameter.

In the failing run, the LMI cold start was fast: `windowStart = Date.now()`
ran at 19:32:11, the function OOMed at 19:32:18, both in the same bucket
starting at 19:32:10. The bucket's timestamp (19:32:10) is less than
`from = 19:32:11`, so the bucket is excluded. The test polled 21 times
across 12 minutes and saw `count = 0` every time, while a direct query
with a wider `from` returned `count = 1` for the same data point.

Fix: pad `windowStart` 60 s earlier than the actual invoke time so the
bucket containing the OOM is always included. The `deadline` budget still
runs from `invokeTime`, not the padded value.

Apply the same defensive change to `[oom]`. It hasn't flaked on this
specifically yet but the same race is possible — workload-dependent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per #1241 (comment)
the comment in `oom-go/main.go` was written when `oomMemorySize` was
still 192 MB; the stack has since been bumped to 256 MB (so the bottlecap
extension has headroom and isn't OOM-killed itself, see the 256 MB
rationale in `lib/stacks/oom.ts`). Updates the two stale '192 MB'
references in the Go reproducer and adds a pointer to the canonical
constant in the stack file so the next person who tweaks one place sees
the other.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants