Skip to content

[TON-388] feat(aws_quickstart): grant instrumenter IAM permissions for selected resource types#306

Merged
gpalmz merged 4 commits into
masterfrom
TON-388-instrumenter-iam-permissions
May 15, 2026
Merged

[TON-388] feat(aws_quickstart): grant instrumenter IAM permissions for selected resource types#306
gpalmz merged 4 commits into
masterfrom
TON-388-instrumenter-iam-permissions

Conversation

@gpalmz
Copy link
Copy Markdown
Contributor

@gpalmz gpalmz commented May 13, 2026

Summary

  • Add InstrumentationResourceTypes CFN parameter (CommaDelimitedList) to main_v2.yaml. When non-empty (e.g. aws:ec2:instance,aws:ecs:cluster,aws:eks:cluster), the integration-role permission-attach Lambda fetches the IAM actions required to instrument those resources from GET /api/unstable/instrumenter/aws/iam_permissions?resource_type=...&chunked=true and attaches each returned chunk as an additional managed policy on the Datadog integration role.
  • Failures to fetch or attach the extra permissions are non-blocking: a warning is logged and the integration install completes. The new policies are also cleaned up on stack delete, alongside the existing resource-collection ones.
  • Re-uses the existing JSON:API chunked parser already in place for the sibling /api/v2/integration/aws/iam_permissions/resource_collection?chunked=true endpoint; the response shape is identical.

Closes TON-388. Bumps aws_quickstart to v4.10.0.

Test plan

  • Unit tests added (attach_integration_permissions_test.py, 14 tests, all green) covering: parameter parsing (CFN comma-string and JSON-list forms, whitespace, empties), URL construction (site, repeated resource_type params, chunked=true), happy path (per-chunk policy create + attach), fetch failure swallowed without raising, per-chunk attach failure continues with the remaining chunks, cleanup iterates both prefixes.
  • Smoke test against a real Datadog account by deploying main_v2.yaml with InstrumentationResourceTypes=aws:ec2:instance,aws:ecs:cluster,aws:eks:cluster once TON-387 lands and the /api/unstable/instrumenter/aws/iam_permissions endpoint is reachable.
  • Verify stack delete also removes the datadog-aws-integration-instrumentation-permissions-* policies.

🤖 Generated with Claude Code

…r selected resource types (v4.10.0)

Add an `InstrumentationResourceTypes` CFN parameter to the Quick Start template
that accepts a comma-separated list of UDM resource types (e.g.
`aws:ec2:instance, aws:ecs:cluster, aws:eks:cluster`). When non-empty, the
integration-role permission-attach Lambda calls
`GET /api/unstable/instrumenter/aws/iam_permissions?resource_type=...&chunked=true`
and attaches each returned permission chunk as an additional managed policy on
the integration role, so customers who install the Datadog Agent on those
resource types via the Quick Start don't have to wire up the extra IAM actions
themselves (ssm:SendCommand for EC2, EKS access-entry + lambda actions, ECS
service/task-definition actions, etc.).

Failure to fetch or attach the instrumenter permissions is non-blocking: a
warning is logged and the integration install completes successfully. The new
policies are also cleaned up alongside the existing resource-collection
policies on stack delete.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gpalmz
Copy link
Copy Markdown
Contributor Author

gpalmz commented May 13, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 73fb70935d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

f"Failed to fetch instrumentation permissions for {resource_types}: {e}. "
"Integration install will continue without these permissions."
)
return
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve existing instrumentation policies on fetch failure

When an Update runs after instrumentation policies were previously attached, handle_create_update deletes those policies via cleanup_existing_policies before calling this best-effort path. If the Datadog API or network is temporarily unavailable, this return makes the custom resource report success without recreating the already-requested instrumentation policies, so an unrelated stack update can silently revoke the Agent instrumentation permissions. Consider fetching before cleanup or failing/restoring on update failures.

Useful? React with 👍 / 👎.

- Extract `_create_and_attach_policy` helper so chunked-policy attach is shared
  between the resource-collection and instrumentation code paths instead of
  being duplicated byte-for-byte.
- Have `handle_delete`/`handle_create_update` unpack `ResourceProperties`
  themselves rather than threading 4–8 positional args from `handler`.
- Drop the `or "datadoghq.com"` fallback inside `build_instrumentation_permissions_url`
  — the handler already defaults `DatadogSite`, so the builder can trust its
  input.
- Prune docstrings/comments that only restated the function name or the next
  statement, per CLAUDE.md policy. Kept only the non-obvious "why" comments
  (CommaDelimitedList serialization, best-effort/non-raising contract,
  ignored-missing-entity semantics).
- Strip trailing whitespace from the embedded ZipFile.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gpalmz
Copy link
Copy Markdown
Contributor Author

gpalmz commented May 13, 2026

@codex review

On a stack Update, the previous shape of `handle_create_update` deleted
existing instrumentation managed policies via `cleanup_existing_policies`
before attempting to refetch the latest set from Datadog. If the fetch
returned an error (transient API outage, network blip), the best-effort
path early-returned with the policies gone — so an unrelated stack update
could silently revoke the Agent's instrumentation permissions on the
integration role.

Fetch first, then clean up + reattach in a single step. If the fetch
fails, leave any previously-attached policies in place. An empty
`InstrumentationResourceTypes` list still cleans them up (user explicitly
opting out), and `handle_delete` still wipes them on stack deletion.

Surfaced by codex review on #306.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gpalmz
Copy link
Copy Markdown
Contributor Author

gpalmz commented May 13, 2026

@codex review

2 similar comments
@gpalmz
Copy link
Copy Markdown
Contributor Author

gpalmz commented May 13, 2026

@codex review

@gpalmz
Copy link
Copy Markdown
Contributor Author

gpalmz commented May 13, 2026

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Hooray!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@gpalmz gpalmz requested a review from raymondeah May 13, 2026 20:48
@gpalmz gpalmz marked this pull request as ready for review May 14, 2026 14:01
@gpalmz gpalmz requested a review from a team as a code owner May 14, 2026 14:01
# Fetch before cleanup so that a transient API failure on an Update leaves the
# previously-attached policies in place instead of silently revoking them.
if not resource_types:
cleanup_instrumentation_policies(iam_client, role_name, account_id, partition)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

open question, should we clean the policies in this case? this adds a deletion path that is not tied to the CloudFormation stack deletion

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. The behavior is intentional and matches what cleanup_existing_policies already does for the resource-collection toggle (handle_create_update unconditionally wipes the datadog-aws-integration-resource-collection-permissions-* policies, then only re-attaches them if ResourceCollectionPermissions=true) — so flipping a CFN parameter from on to off has always implied tearing down what it provisioned. The custom-resource pattern across this repo is declarative: stack state should match parameter state, not just stack existence.

That said, your point gets at a real wart — running the deletion path on every Create/Update even when the parameter was never set is wasteful (10 no-op detach + delete attempts per Create). Tightened in 8890d14: cleanup now only runs when the parameter actually went from non-empty to empty between updates (sourced from event['OldResourceProperties']). First-time Creates and steady-state Updates without instrumentation skip the IAM calls entirely.

`attach_instrumentation_permissions` now also takes the previous value of
`InstrumentationResourceTypes` (sourced from `event['OldResourceProperties']`
inside `handle_create_update`) and only runs the IAM cleanup when the
parameter actually transitioned from non-empty to empty between Updates.
This skips ~20 no-op IAM delete attempts on first-time stack Creates that
never opted in to instrumentation, while still cleaning up policies on
toggle-off Updates and on stack Delete.

Per @raymondeah's review comment on #306 — the previous shape ran the
deletion path on every Create/Update regardless of whether the parameter
had ever been non-empty.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gpalmz gpalmz requested a review from raymondeah May 14, 2026 19:53
@gpalmz gpalmz merged commit f8f6f31 into master May 15, 2026
6 checks passed
@gpalmz gpalmz deleted the TON-388-instrumenter-iam-permissions branch May 15, 2026 20:21
raymondeah added a commit that referenced this pull request May 29, 2026
…update events to Datadog for agent installation (#312)

* [TON-388] feat(aws_quickstart): port InstrumentationResourceTypes to main_extended and main_extended_workflow

v4.10.0 (PR #306) added the InstrumentationResourceTypes parameter only to
main_v2.yaml. main_extended.yaml and main_extended_workflow.yaml are the
templates UI launches actually use going forward, so the parameter and the
DatadogSite + InstrumentationResourceTypes passthrough to the role stack belong
there too.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [TON-388] chore(aws_quickstart): bump to v4.11.0 with changelog entry

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [TON-388] docs(aws_quickstart): simplify 4.11.0 changelog entry

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [TON-388] fix(aws_quickstart): shift InstrumentationResourceTypes port to main_workflow (drop main_extended)

main_extended.yaml isn't on the UI launch path; revert there and apply to
main_workflow.yaml instead.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [TON-466] feat(aws_quickstart): forward CloudTrail events to Datadog instrumenter-events intake

Adds an EventBridge connection, API destination, invocation role, and EC2
CloudTrail rule as a new nested stack, conditionally deployed when
InstrumentationResourceTypes is set. Single-region by design.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [TON-466] feat(aws_quickstart): gate forwarding rules per InstrumentationResourceTypes; add EKS

Add an EKS CloudTrail rule (CreateCluster, TagResource, UntagResource) and gate
each rule on whether its UDM type appears in InstrumentationResourceTypes.
Substring check is via Fn::Split / Fn::Join — CFN has no Conditions-level
substring intrinsic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [TON-466] feat(aws_quickstart): filter tag events to target resource type

EC2 CreateTags/DeleteTags are scoped to instances via resourcesSet item resourceId
prefix "i-"; EKS TagResource/UntagResource are scoped to cluster ARNs via
wildcard match. Creation events (RunInstances, CreateCluster) bypass the filter
through EventBridge $or because their request payloads don't carry the filter
field.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [TON-466] feat(aws_quickstart): rename forwarding template, bump to v4.11.0, changelog

- Rename datadog_agent_install_forwarding.yaml to datadog_agent_resource_update_forwarding.yaml
  (the pipeline forwards resource update events; agent install is one consumer)
- DatadogAgentInstallForwardingStack -> DatadogAgentResourceUpdateForwardingStack in main_v2.yaml
- Bump version.txt to v4.11.0 + add 4.11.0 changelog entry
- Revert README — leave matching current prod

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [TON-466] chore(aws_quickstart): scrub prose references and drop authored comments

Drop "instrumenter-events" from connection and rule descriptions, the main_v2
comment, and the changelog entry. Remove the explanatory comments I added under
Conditions and Resources (substring-trick and $or rationale). The intake URL
itself stays — it's the actual ApiDestination endpoint.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [TON-466] docs(aws_quickstart): rewrite changelog entry at product level

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [TON-466] docs(aws_quickstart): trim changelog entry

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [TON-466] docs(aws_quickstart): rename "Agent install" to "Agent management"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [TON-466] feat(aws_quickstart): wire forwarding stack into main_extended and main_extended_workflow

These two templates also need the InstrumentationResourceTypes parameter
(originally added only to main_v2 in v4.10.0) plus the same gating, role-stack
wiring, and conditional forwarding stack as main_v2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [TON-466] fix(aws_quickstart): shift forwarding stack to main_workflow (drop main_extended)

main_extended.yaml isn't on the UI launch path; revert there and add the
forwarding wiring to main_workflow.yaml alongside main_v2 and
main_extended_workflow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [TON-466][TON-473] feat(aws_quickstart): forward EC2/EKS non-tag update events

Extends the forwarding pipeline (4.12.0) to also forward EC2
ModifyInstanceAttribute and EKS UpdateClusterConfig / UpdateClusterVersion
CloudTrail events. These represent queryable-field changes that affect
Agent management rule evaluation but previously only reached Datadog via
the hourly reconciler.

Bumps to v4.13.0.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* [TON-466] chore(aws_quickstart): drop main_v2 wiring, collapse changelog to single entry

- Remove ShouldForwardEvents condition and DatadogAgentResourceUpdateForwardingStack
  resource from main_v2.yaml. The template is deprecated and no longer the UI launch
  path; forwarding ships via main_workflow.yaml and main_extended_workflow.yaml.
- Collapse the two staged CHANGELOG entries (forwarding pipeline + non-tag update
  events) into a single v4.13.0 entry; revert version.txt to v4.13.0.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* [TON-466] fix(aws_quickstart): drop fixed RoleName so stack deploys in multiple regions

IAM role names are account-global. With an explicit RoleName the second-region
deploy of the same template fails with EntityAlreadyExists. Letting
CloudFormation auto-generate the name lets customers deploy the integration
in every region they want covered.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants