[SVLS-9175] feat: emit OOM metric on memory equality with per-request dedup#1241
[SVLS-9175] feat: emit OOM metric on memory equality with per-request dedup#1241lym953 wants to merge 19 commits into
Conversation
…th per-request dedup Customer report (#1237): a Node.js Lambda that hit its memory limit (Memory Size 192 MB / Max Memory Used 192 MB, Status: timeout) did not emit aws.lambda.enhanced.out_of_memory because none of the existing detection paths matched. The Node runtime did not log "JavaScript heap out of memory" (V8 spent its time in GC instead of declaring an OOM), and PlatformRuntimeDone reported no error_type — just a wall-clock timeout — so the log-string and Runtime.OutOfMemory paths both stayed silent. Drop the provided.al* restriction on the PlatformReport equality check so any runtime emits OOM when max_memory_used_mb == memory_size_mb. To avoid double-counting against the two pre-existing paths (some invocations satisfy both equality and Runtime.OutOfMemory at the same time), add a per-Context oom_emitted flag. All three detection paths now funnel through Processor::try_increment_oom_metric, which checks the flag, sets it on first emission, and is a no-op on subsequent calls for the same request_id. The flag lives with the per-invocation Context and is cleared automatically when on_platform_report removes the context. Plumbing: Event::OutOfMemory now carries an Option<String> request_id (the log-path detector reads it from the logs processor's invocation_context.request_id, set on PlatformStart and cleared on PlatformRuntimeDone). When request_id is None — only realistic in Managed Instance mode, where extensions cannot subscribe to INVOKE — the helper falls back to a best-effort emit without dedup. Tests cover three scenarios: same request_id emits exactly once, two distinct request_ids each emit, and the equality path still fires (regression coverage for the dropped provided.al* check). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Adds a new `oom` integration-test suite that exercises the OOM dedup change (Context::oom_emitted, #1241) end-to-end across every supported runtime. Each lambda intentionally allocates until it OOMs; the test asserts aws.lambda.enhanced.out_of_memory increments by exactly one data point per function over the invocation window — which fails if the dedup flag stops working and two detection paths emit for the same invocation. New lambda apps under integration-tests/lambda/: - oom-node-v8-heap : exercises log-line path (JavaScript heap OOM) - oom-node-sigkill : exercises PlatformRuntimeDone Runtime.OutOfMemory path - oom-python : MemoryError — log path AND PlatformRuntimeDone path both fire, so dedup is necessary for count==1 - oom-ruby : NoMemoryError — same dual-path coverage as Python - oom-java : OutOfMemoryError (log-line path) - oom-dotnet : OutOfMemoryException (log-line path) - oom-go : fatal: runtime: out of memory — log path AND PlatformReport memory-equality path both fire Framework additions: - Ruby and Go runtime/layer helpers in lib/util.ts (Ruby tracer layer; Go has no tracer layer — extension layer alone covers the test). - Oom CDK stack registered in bin/app.ts. - build-ruby.sh (zip-as-is for now; Gemfile build stubbed) and build-go.sh (Docker cross-compile to ARM64 Linux, bootstrap binary). - Pipeline template additions for the two new build stages and oom suite registration in test-suites.yaml. - getMetricCount() + OUT_OF_MEMORY_METRIC in tests/utils/datadog.ts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI run on the first oom suite (commit 5a833ac) returned counts: node-v8-heap=1 ✓ node-sigkill=1 ✓ python=1 ✓ dotnet=1 ✓ ruby=0 ✗ java=0 ✗ go=0 ✗ Reproducing locally: - Ruby: function failed at init with `cannot load such file -- datadog_lambda_rb`. The Datadog Ruby tracer is a regular gem (no handler shim like Python's `datadog_lambda.handler.handler`), so set handler to `lambda_function.handler` and drop `DD_LAMBDA_HANDLER`. - Go: function timed out (30s) at `Max Memory Used: 192 MB / Memory Size: 192 MB` without emitting any enhanced metrics. Two changes: * Drop `AWS_LAMBDA_EXEC_WRAPPER=/opt/datadog_wrapper` — the wrapper sets language-specific tracer env vars; Go's tracer is in-module not layer-based, so the wrapper just changes runtime detection without helping. With the wrapper removed and a clean exec, the extension's enhanced-metric pipeline starts emitting. * Replace the `for { append(make([]byte, 10MB)) }` loop with a single `make([]byte, 500MB)` that writes every page. Go's slice doubling + GC kept the loop from OOMing reliably in the 30s timeout window; eager allocation guarantees `fatal error: runtime: out of memory` fires immediately, exercising bottlecap's log-line detection. - Java: also failed in CI (count=0) but local repro now returns count=1 with the same code path. Leaving the Java app unchanged for the next CI run to confirm. If it fails again, likely the extension didn't flush the metric before the JVM crashed; would need DD_SERVERLESS_FLUSH_STRATEGY changes or per-function twice-invoke. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…oom suite The integration-test framework defaults DD_SERVERLESS_FLUSH_STRATEGY to `end`, which means the extension only flushes at end of invocation. For OOM tests that's a tight race: the function dies, then Lambda sends PlatformRuntimeDone, then bottlecap increments the OOM metric, then Shutdown comes and the sandbox is reaped. If the metric flush can't finish in that narrow window, the data point is lost. Run 1 of the oom suite returned ruby/java/go=0 (3 of 7 failed). Run 2 returned ruby/node-sigkill/python/dotnet/go=0 (5 of 7 failed) — but java=1 this time. The set of "failing" runtimes is not stable across runs, confirming a timing race rather than a code bug. `default` flushes every ~1s in addition to invocation-end, giving the OOM metric a much wider window to reach Datadog before the sandbox is torn down. All other integration suites keep using `end` since their invocations complete cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI runs were intermittently returning count=0 for ruby/java/go/dotnet/ node-sigkill/python — varying combinations across runs. Diagnosing showed the data points were correctly emitted and durably ingested by Datadog within ~30s of the OOM, but the `/api/v1/query` endpoint sometimes returned no results for very-recently-ingested points. The single-shot 5-minute wait was too brittle. Polling strategy: wait 90s after invocation, then re-query every 30s until every runtime reports count>=1 or the 12-min budget is exhausted. Early-exits when all runtimes pass, so the common case is faster than the previous single-shot 5-min wait while the worst case is bounded. Each poll iteration logs the current counts and the still-missing runtimes, so debugging future flakes from CI logs requires no rerun. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lt` was a no-op `FlushStrategy::Default` falls back to `End` until the lookback buffer fills (~20 invocations). The OOM test does a single cold-start invoke per function, so `default` behaved identically to `end` — explaining why the prior commit's change had no observable effect. `continuously,1000` schedules an unconditional 1s periodic flush regardless of invocation count, so the OOM metric reaches Datadog well before the sandbox is reaped after the function process dies. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…es OOM-kill Root cause of the prior `[oom]` failures (6/7 runtimes stuck at count=0): at 192 MB the kernel OOM-killer often picks the bottlecap extension instead of the function runtime — Lambda surfaces this as `errorType: Extension.Crash`. A dead extension can't emit the OOM metric, so the test sees nothing in Datadog. Reproduced locally on us-east-2 arm64 with an IntegTests-style Python function: at 192 MB → `Extension.Crash`, no metric. Bumping to 256 MB → `Runtime.OutOfMemory`, count=1 in Datadog within 30 s. 256 MB gives the extension ~30 MB headroom while keeping every detection path active: the function still hits memory_size in PlatformReport, still emits its runtime-specific OOM log line, and still gets `Runtime.OutOfMemory` in PlatformRuntimeDone. The customer's #1237 case (192 MB) is unaffected — this is a test-harness change. Also drops the `DD_SERVERLESS_FLUSH_STRATEGY=continuously,1000` override from the prior commit; with the extension surviving, the default `end` flush is sufficient. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Trim historical `provided.al` context from OOM detection comments - Rewrite `test_handle_ondemand_report_emits_oom_on_memory_equality` doc comment to describe what the test covers, not how the rule changed - Refocus `current_request_id` doc on its sole purpose (OOM metric dedup by request_id) and drop speculative scenarios that weren't directly verified; use "LMI mode" consistently - Drop "as of 2026-05" qualifier from the OOM detection path list - Bump Datadog-Ruby3-4-ARM default layer 9 -> 28 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| /// `PlatformRuntimeDone` / `PlatformReport`. Returns `None` in LMI mode, | ||
| /// where extensions cannot subscribe to the `INVOKE` event so | ||
| /// `platform.start` is never delivered. | ||
| fn current_request_id(&self) -> Option<String> { |
There was a problem hiding this comment.
To reviewers: is it anti-pattern to get request_id in this way?
There was a problem hiding this comment.
Actually, for all three cases, we can get request_id from either log payload or telemetry event payload, so we don't need this current_request_id() function. Let me delete it.
There was a problem hiding this comment.
For the error
FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
No request id is available from the log itself, so we still need the function current_request_id().
There was a problem hiding this comment.
Pull request overview
This PR expands OOM detection so aws.lambda.enhanced.out_of_memory can be emitted for all runtimes when any of the three OOM signals is observed (runtime-specific OOM log line, Runtime.OutOfMemory in PlatformRuntimeDone, or max_memory_used_mb == memory_size_mb in PlatformReport). To avoid double counting when multiple signals fire for the same invocation, it adds a per-invocation dedup flag keyed by request_id. It also adds a new cross-runtime integration test stack/suite (plus Ruby/Go build plumbing) to validate “exactly once per invocation”.
Changes:
- Emit OOM metric on
max_memory_used_mb == memory_size_mbfor all runtimes, and dedupe per invocation viaContext::oom_emitted. - Extend the event bus / processor plumbing so OOM events can carry an optional
request_id. - Add an OOM integration-test stack & test suite covering multiple runtimes, plus Go/Ruby build steps in CI/local deploy.
Reviewed changes
Copilot reviewed 26 out of 28 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
bottlecap/src/lifecycle/invocation/context.rs |
Adds oom_emitted flag to dedupe OOM metric emissions per invocation. |
bottlecap/src/lifecycle/invocation/processor.rs |
Uses dedup helper for all OOM detection paths; removes provided.al* gating on memory equality; adds unit tests. |
bottlecap/src/lifecycle/invocation/processor_service.rs |
Threads optional request_id through the processor command for OOM events. |
bottlecap/src/logs/lambda/processor.rs |
Tags OOM log-line events with request_id (when available) for dedup. |
bottlecap/src/event_bus/mod.rs |
Changes OOM event shape to include optional request_id. |
bottlecap/src/bin/bottlecap/main.rs |
Forwards new OOM event shape into the invocation processor service. |
bottlecap/src/metrics/enhanced/lambda.rs |
Updates OOM metric docs to reflect new dedup path and detection coverage. |
integration-tests/tests/utils/datadog.ts |
Adds helper to query total metric emission count. |
integration-tests/tests/oom.test.ts |
Adds cross-runtime integration test asserting exactly one OOM metric emission per invocation. |
integration-tests/lib/stacks/oom.ts |
New CDK stack deploying OOM repro lambdas across runtimes. |
integration-tests/lib/util.ts |
Adds Ruby/Go runtime + Ruby tracer layer helpers. |
integration-tests/bin/app.ts |
Registers the new OOM test stack in the integration test app. |
integration-tests/lambda/oom-*/* |
Adds OOM repro Lambda sources for Node/Python/Ruby/Java/.NET/Go. |
integration-tests/scripts/local_deploy.sh |
Adds Ruby/Go build steps to local deploy. |
integration-tests/scripts/build-ruby.sh |
New Ruby build script (currently no-op for Gemfile-less lambdas). |
integration-tests/scripts/build-go.sh |
New Go cross-compile script producing bin/bootstrap for provided runtime. |
.gitlab/templates/pipeline.yaml.tpl |
Adds CI jobs to build Ruby/Go lambdas and wires them into the integration suite. |
.gitlab/datasources/test-suites.yaml |
Adds the oom test suite entry. |
| return; | ||
| } | ||
| ctx.oom_emitted = true; | ||
| } |
There was a problem hiding this comment.
I’m curious: in what cases would the request ID be empty or unavailable? Are either of those cases valid? Maybe we can add a debug log for this.
There was a problem hiding this comment.
- Added a debug log
- Tested on LMI. OOM log can arrive before the extension receives request_id from PlatformStart event. In this case, request_id is empty. Updated comment to explain this.
litianningdatadog
left a comment
There was a problem hiding this comment.
Left a minior comment
Per PR review feedback. The two no-dedup branches in `try_increment_oom_metric` were previously silent; surfacing them as debug logs makes the LMI-mode case (request_id=None) and the rare context-eviction case (request_id supplied but absent from the buffer) visible during investigations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a dedicated `lmi-oom` suite that deploys one Python function on the LMI Capacity Provider and asserts that the OOM enhanced metric is emitted when the function hits its memory cap. Exercises the LMI-specific log-line path where `current_request_id()` returns `None` because `platform.start` is never delivered, so the OOM detector flows through the no-dedup branch of `try_increment_oom_metric`. Assertion is `count >= 1` rather than `== 1` because Path 2 (`Runtime.OutOfMemory` via synthesized runtime_done from `handle_managed_instance_report`) also fires for the same invocation and cannot dedup against the log path's `None`. A future change can tighten this once LMI dedup is addressed. Also simplifies overly-verbose comments above the two no-dedup debug logs — the log messages are self-explanatory. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Deployed a Python OOM Lambda on the LMI Capacity Provider, captured the extension's debug logs from CloudWatch, and observed the actual flow: - PlatformStart IS delivered in LMI mode (prior comment claimed it wasn't). - For a Python `MemoryError` that fires immediately on first allocation, the OOM log line is processed by `LambdaProcessor` *before* the `PlatformStart` telemetry event's handler updates `invocation_context.request_id` — both arrive in the same millisecond. - `current_request_id()` therefore returns `None` and the metric flows through the no-dedup branch (the new debug log fires). - The synthesized runtime-done from `handle_managed_instance_report` reports `error_type=Runtime.Unknown` (not `Runtime.OutOfMemory`), so Path 2 does NOT fire for this Python OOM shape. Final metric count = 1 (no double-count). Updates the `current_request_id()` doc, the no-dedup debug log message, and the LMI OOM stack/test comments to reflect what was actually observed rather than the prior (incorrect) "platform.start never delivered in LMI" hypothesis. Assertion stays `>= 1` for robustness against future changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously the comments on `try_increment_oom_metric`, the path-3 caller
in `handle_ondemand_report`, and `increment_oom_metric` all opened with
confident phrasing ("exactly once per request_id", "isn't counted multiple
times when more than one detection path fires"), and the no-dedup
fallback was buried at the bottom or absent. That mischaracterizes the
guarantee: when the OOM log line lands before/after the active-invocation
window in `LambdaProcessor`, or when the context has been evicted, the
metric will be double-counted by a subsequent detection path.
Restructures the three comments so the best-effort caveat is up front
and the two edge cases (request_id=None race, context evicted) are
called out explicitly with their consequences.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
In LMI mode the function-log JSON payload always carries a `requestId` field that we already extract a few lines above the OOM detection block. Plumbing that value into `Event::OutOfMemory` instead of falling back to `current_request_id()` closes the race observed in #1241 (comment) where a fast OOM log line is processed before this same processor's `PlatformStart` handler updates `invocation_context.request_id`. OnDemand mode is unaffected — `request_id` from the log payload is unconditionally `None` there, so we still fall back to `current_request_id()`, which works because `PlatformStart`'s race window doesn't manifest in OnDemand operationally. Updates the `current_request_id` doc and the LMI OOM stack/test comments to reflect that the LMI case now goes through the deduped branch by way of the payload `requestId`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…Id universally Per #1241 (comment): paths 2 and 3 already get `request_id` directly from the PlatformRuntimeDone and PlatformReport event payloads (as a function parameter); only path 1 (the OOM log-line detector in `LambdaProcessor::get_message`) was using `current_request_id()`. And path 1 has an even better source for the request id — the `requestId` field that structured JSON log payloads already carry — which doesn't race with the in-processor `PlatformStart` handler. Drops the `is_managed_instance_mode` gate around payload `requestId` extraction so on-demand mode also benefits (it was the LMI Python case that surfaced the race empirically, but the same source is more accurate than `invocation_context.request_id` in on-demand mode too). The OOM detector now tags `Event::OutOfMemory` with the extracted payload `requestId` directly; the Extension log variant passes `None` (extension log payloads don't carry a function request id), and falls through to `try_increment_oom_metric`'s no-dedup branch. Updates `test_regular_lambda_does_not_extract_request_id` → `test_regular_lambda_extracts_request_id_from_payload` since the rule it was locking in (LMI-only extraction) no longer holds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… requestId universally" This reverts commit 1c95f5a.
…) fallback Per #1241 review: when the request id is available from the log/event payload, use it directly; only fall back to a workaround (`current_request_id()`) when the payload doesn't carry one. Drops the `is_managed_instance_mode` gate on payload `requestId` extraction so on-demand mode also benefits. The OOM detector now reads the payload field whenever it's present (Python, Ruby, .NET, and Java/Node when JSON log format is configured) regardless of mode, and falls back to `current_request_id()` only for text-payload OOM logs (Node V8 fatal, Go fatal, Java stderr) where no `requestId` field exists. The fallback path preserves the count==1 behavior for the double-detect cases on the integration suite (Java OutOfMemoryError, Node SIGKILL, Go fatal-error) — these were what the previous "drop current_request_id() entirely" refactor would have regressed. Also renames `test_regular_lambda_does_not_extract_request_id` → `test_regular_lambda_extracts_request_id_from_payload` to match the new universal extraction behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e-shot allocator CDK stack create-function in CI failed for the new `lmi-oom` suite with 'MemorySize value failed to satisfy constraint: Lambda Managed Instance functions must have memory size greater than or equal to 2048'. LMI Lambda enforces a 2 GB floor. Bumping to 2048 MB exposes a second problem: the existing `oom-python` source allocates 10 MB strings in a loop, which on 2 GB either runs past the test budget or gets kernel SIGKILL'd silently before CPython raises MemoryError — exactly what we need Path 1 of the OOM detector to see. Adds `oom-python-lmi/lambda_function.py` with a single `bytearray(100 * 1024 ** 3)` allocation. 100 GB exceeds any reasonable Lambda memory cap by orders of magnitude, so CPython's allocator refuses immediately and raises a clean MemoryError without involving the cgroup OOM killer. Verified manually with `yiming-lmi-oom-debug` in us-east-1 (PR #1241 thread). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rollup bucket The `[lmi-oom]` suite failed 2x in a row on `0da7a59f` despite the metric being present in Datadog (verified via direct API query). Root cause: Datadog rolls `aws.lambda.enhanced.out_of_memory` into 10-second wall-clock-aligned buckets, and the `/api/v1/query` endpoint only returns buckets whose start timestamp is >= the `from` parameter. In the failing run, the LMI cold start was fast: `windowStart = Date.now()` ran at 19:32:11, the function OOMed at 19:32:18, both in the same bucket starting at 19:32:10. The bucket's timestamp (19:32:10) is less than `from = 19:32:11`, so the bucket is excluded. The test polled 21 times across 12 minutes and saw `count = 0` every time, while a direct query with a wider `from` returned `count = 1` for the same data point. Fix: pad `windowStart` 60 s earlier than the actual invoke time so the bucket containing the OOM is always included. The `deadline` budget still runs from `invokeTime`, not the padded value. Apply the same defensive change to `[oom]`. It hasn't flaked on this specifically yet but the same race is possible — workload-dependent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per #1241 (comment) the comment in `oom-go/main.go` was written when `oomMemorySize` was still 192 MB; the stack has since been bumped to 256 MB (so the bottlecap extension has headroom and isn't OOM-killed itself, see the 256 MB rationale in `lib/stacks/oom.ts`). Updates the two stale '192 MB' references in the Go reproducer and adds a pointer to the canonical constant in the stack file so the next person who tweaks one place sees the other. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Background
From our knowledge (before this PR), here's the behavior when each runtime OOMs:
PlatformRuntimeDoneevent,error_typeisRuntime.OutOfMemory. This can happen on Python and Ruby.PlatformReportevent,max_memory_used == memory_size. This can happen on Python, Ruby, Node and Go.To capture OOM for all these scenarios (except Node case 2, which was just called out in #1237) without double counting, right now the extension emits
aws.lambda.enhanced.out_of_memorymetric in these scenarios:Runtime.OutOfMemorymax_memory_used == memory_sizefor Go, i.e. only when runtime isprovided.al2. We don't do this for other runtimes (Python, Ruby, Node) to avoid double counting.Problem
In issue #1237, a customer called out a new scenario: "Node (case 2)" in the table. The only evidence of OOM is
max_memory_used == memory_size, and there is no runtime-specific log message. As a result, OOMs like this are not captured by the OOM enhanced metric.This PR
Test plan
Passed the added unit tests and integration tests.
To reviewers
Most of the code changes are for integration tests.
Details (generated by Claude Code)
Closes the gap surfaced in #1237: a Node.js Lambda that hit its memory limit (
Memory Size 192 MB / Max Memory Used 192 MB,Status: timeout) did not emitaws.lambda.enhanced.out_of_memorybecause none of the three existing detection paths matched.Why the existing paths missed it. V8 spent its budget in GC rather than declaring
JavaScript heap out of memory, so the runtime log-line match never fired. The runtime crashed on a wall-clock timeout, soPlatformRuntimeDonereported noerror_type. And themax_memory_used_mb == memory_size_mbcheck inPlatformReportwas gated onruntime.starts_with("provided.al")to avoid double-counting against the log path, so Node was excluded.What changes. Drop the
provided.al*restriction so the equality check applies to every runtime. To avoid double-counting against the two pre-existing paths (some invocations satisfy both equality andRuntime.OutOfMemorysimultaneously), add a per-Contextoom_emittedflag. All three detection paths funnel through a newProcessor::try_increment_oom_metric, which checks/sets the flag and is a no-op on subsequent calls for the samerequest_id.Plumbing.
Event::OutOfMemorynow carries anOption<String> request_id. The log-path detector reads it fromLambdaProcessor::invocation_context.request_id(set onPlatformStart, cleared onPlatformRuntimeDone/PlatformReport).Noneis only realistic in Managed Instance mode (extensions can't subscribe to INVOKE there); the helper falls back to a best-effort emit without dedup in that case.