Skip to content

SREP-4383: feat(e2e): add CloudTrail permission denied check for AWS tests#8168

Draft
joshbranham wants to merge 1 commit intoopenshift:mainfrom
joshbranham:cloudtrail-permission-denied-check
Draft

SREP-4383: feat(e2e): add CloudTrail permission denied check for AWS tests#8168
joshbranham wants to merge 1 commit intoopenshift:mainfrom
joshbranham:cloudtrail-permission-denied-check

Conversation

@joshbranham
Copy link
Copy Markdown
Contributor

@joshbranham joshbranham commented Apr 6, 2026

What this PR does / why we need it:

Add a post-test CloudTrail check that queries for AccessDenied and UnauthorizedOperation events associated with the HostedCluster's IAM roles. This surfaces permission gaps in ROSA managed policies during e2e CI runs without failing the test (log-only for now).

The check runs in the after() phase for AWS platform tests, using the LookupEvents API to scan CloudTrail within the test's time window. A JSON report is written to ARTIFACT_DIR when available.

Which issue(s) this PR fixes:

Fixes SREP-4383

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

@openshift-ci-robot
Copy link
Copy Markdown

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 6, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 6, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 6, 2026

Important

Review skipped

Auto reviews are limited based on label configuration.

🚫 Excluded labels (none allowed) (1)
  • do-not-merge/work-in-progress

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: bf3065b8-2da5-4407-beb3-eca20bc22888

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot added do-not-merge/needs-area area/testing Indicates the PR includes changes for e2e testing labels Apr 6, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 6, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: joshbranham
Once this PR has been reviewed and has the lgtm label, please assign cblecker for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@joshbranham
Copy link
Copy Markdown
Contributor Author

/test e2e-aws

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 6, 2026

Codecov Report

❌ Patch coverage is 0% with 199 lines in your changes missing coverage. Please review.
✅ Project coverage is 27.57%. Comparing base (bcf6ada) to head (9e2cb2e).
⚠️ Report is 24 commits behind head on main.

Files with missing lines Patch % Lines
test/e2e/util/cloudtrail.go 0.00% 189 Missing ⚠️
test/e2e/util/hypershift_framework.go 0.00% 10 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8168      +/-   ##
==========================================
+ Coverage   27.50%   27.57%   +0.07%     
==========================================
  Files        1096     1098       +2     
  Lines      107277   107787     +510     
==========================================
+ Hits        29503    29722     +219     
- Misses      75240    75518     +278     
- Partials     2534     2547      +13     
Files with missing lines Coverage Δ
test/e2e/util/hypershift_framework.go 0.00% <0.00%> (ø)
test/e2e/util/cloudtrail.go 0.00% <0.00%> (ø)

... and 11 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@cwbotbot
Copy link
Copy Markdown

cwbotbot commented Apr 6, 2026

Test Results

e2e-aws

@joshbranham
Copy link
Copy Markdown
Contributor Author

/test e2e-aws

@joshbranham
Copy link
Copy Markdown
Contributor Author

/label tide/merge-method-squash

@openshift-ci openshift-ci bot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Apr 6, 2026
@joshbranham
Copy link
Copy Markdown
Contributor Author

/test e2e-aws

1 similar comment
@joshbranham
Copy link
Copy Markdown
Contributor Author

/test e2e-aws

@joshbranham joshbranham changed the title feat(e2e): add CloudTrail permission denied check for AWS tests SREP-4383: feat(e2e): add CloudTrail permission denied check for AWS tests Apr 7, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 7, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 7, 2026

@joshbranham: This pull request references SREP-4383 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

Add a post-test CloudTrail check that queries for AccessDenied and UnauthorizedOperation events associated with the HostedCluster's IAM roles. This surfaces permission gaps in ROSA managed policies during e2e CI runs without failing the test (log-only for now).

The check runs in the after() phase for AWS platform tests, using the LookupEvents API to scan CloudTrail within the test's time window. A JSON report is written to ARTIFACT_DIR when available.

Which issue(s) this PR fixes:

Fixes

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 7, 2026

@joshbranham: This pull request references SREP-4383 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

Add a post-test CloudTrail check that queries for AccessDenied and UnauthorizedOperation events associated with the HostedCluster's IAM roles. This surfaces permission gaps in ROSA managed policies during e2e CI runs without failing the test (log-only for now).

The check runs in the after() phase for AWS platform tests, using the LookupEvents API to scan CloudTrail within the test's time window. A JSON report is written to ARTIFACT_DIR when available.

Which issue(s) this PR fixes:

Fixes SREP-4383

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Add a post-test CloudTrail check that queries for permission denied
events associated with the HostedCluster's IAM roles and any
management cluster roles discovered from HCP namespace pods/SAs.
Results are logged (non-failing) with Prow-highlighted warnings
and written as JSON to ARTIFACT_DIR. Uses sync.Once to run only
once per process to avoid CloudTrail API rate limits.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@joshbranham joshbranham force-pushed the cloudtrail-permission-denied-check branch from 78af02b to 9e2cb2e Compare April 8, 2026 02:55
@joshbranham
Copy link
Copy Markdown
Contributor Author

/test e2e-aws

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 8, 2026

@joshbranham: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws 9e2cb2e link true /test e2e-aws

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@hypershift-jira-solve-ci
Copy link
Copy Markdown

hypershift-jira-solve-ci bot commented Apr 9, 2026

Test Failure Analysis Complete

Job Information


Test Failure Analysis

Error

fixture.go:333: Failed to wait for infra resources in guest cluster to be deleted: context deadline exceeded
fixture.go:340: Failed to clean up 4 remaining resources for guest cluster

Summary

The test TestCreateClusterRequestServingIsolation passed all functional subtests (ValidateHostedCluster, Main, EnsureHostedCluster — including the new NoticeCloudTrailPermissionDenied subtest added by PR #8168). The failure occurred exclusively in the Teardown phase during post-destroy AWS resource cleanup validation.

The teardown sequence was:

  1. Cluster destroy succeeded — the destroy.log confirms all infrastructure was deleted: VPC, subnets, security groups, route tables, IAM roles, S3 bucket, and all DNS records were cleaned up successfully with final message "Successfully destroyed cluster and infrastructure".
  2. Post-delete validation timed out — after the successful destroy, validateAWSGuestResourcesDeletedFunc (fixture.go:287-353) polls the AWS Resource Groups Tagging API every 20 seconds for 15 minutes, checking if tagged resources are gone. The poll timed out with 4 resources still appearing in the tagging API index.

Root Cause

AWS Resource Groups Tagging API eventual consistency. The 4 "remaining" resources reported by the validation function were already deleted by the cluster destroy operation, but the AWS Resource Groups Tagging API index had not yet been updated to reflect their removal:

Resource Type Status
vol-0ffadbd865418c3c6 EC2 volume (node) Deleted by CAPA, stale in tagging API
vol-0bfad6976b135737f EC2 volume (node) Deleted by CAPA, stale in tagging API
vol-02f1706335f2786fd EC2 volume (node) Deleted by CAPA, stale in tagging API
request-serving-isolation-m42mm-image-registry-us-east-1-dysxj S3 bucket Explicitly deleted in destroy.log (line 8: "Deleted S3 Bucket"), stale in tagging API

The S3 bucket is the resource that triggers the timeout: the hasGuestResources() function (fixture.go:363-382) skips EC2 volumes that lack a kubernetes.io/created-for/pv/name tag (these 3 are node volumes, not PV volumes), but returns true for any non-EC2 resource (the S3 bucket).

Evidence

  1. destroy.log confirms complete infrastructure cleanup:

    • Line 8: "Deleted S3 Bucket" name="request-serving-isolation-m42mm-image-registry-us-east-1-dysxj"
    • Line 45: "Successfully destroyed cluster and infrastructure"
    • All VPC components, IAM roles, DNS zones, and security groups were deleted
  2. Teardown duration: 1330.02s (~22 minutes), which includes:

    • Cluster destroy (~21 minutes for HostedCluster finalization + infra deletion)
    • 15-minute timeout for post-delete tagging API validation
    • The validation timed out at the 15-minute mark
  3. All functional subtests passed (507 tests, 23 skipped, 2 failures — both failures are the same test's Teardown):

    • TestCreateClusterRequestServingIsolation/ValidateHostedCluster — PASS (1232.27s)
    • TestCreateClusterRequestServingIsolation/Main/EnsurePSANotPrivileged — PASS (0.21s)
    • TestCreateClusterRequestServingIsolation/EnsureHostedCluster — PASS (2.89s), including all 18 subtests
    • TestCreateClusterRequestServingIsolation/EnsureHostedCluster/NoticeCloudTrailPermissionDenied — PASS (0.00s)
  4. PR SREP-4383: feat(e2e): add CloudTrail permission denied check for AWS tests #8168 changes are not related to the failure: The NoticeCloudTrailPermissionDenied subtest passed in all test suites. The CloudTrail check for the TestCreateClusterPrivate cluster found 2 non-fatal permission denied events (ec2:DescribeInstanceTopology, ec2:DescribeAccountAttributes) and correctly reported them as non-fatal observations.

Relationship to PR #8168

This failure is unrelated to PR #8168. The PR adds the NoticeCloudTrailPermissionDenied e2e subtest which:

  • Checks AWS CloudTrail for permission denied events during hosted cluster lifecycle
  • Reports findings as non-fatal observations (uses t.Errorf with "non-fatal" prefix)
  • Passed in all 11 test suites where it ran (0.00s to 365.74s)
  • Does not modify any teardown/cleanup logic

The teardown failure is a pre-existing flaky behavior caused by AWS Resource Groups Tagging API eventual consistency — the 15-minute polling timeout in validateAWSGuestResourcesDeletedFunc is sometimes insufficient for the tagging index to catch up with actual resource deletions, particularly for S3 buckets.


Recommendations

  1. This failure should not block PR SREP-4383: feat(e2e): add CloudTrail permission denied check for AWS tests #8168 — the teardown timeout is a known infrastructure timing issue unrelated to the PR's CloudTrail permission denied check.

  2. To reduce flakiness of this validation: Consider either:

    • Increasing the timeout in validateAWSGuestResourcesDeletedFunc (currently 15 minutes)
    • Adding S3 bucket existence verification via the S3 API (HeadBucket) as a fallback before relying solely on the Resource Groups Tagging API
    • Excluding S3 buckets from the post-delete tagging check if they were explicitly deleted during the destroy phase
  3. codecov/patch failure is a separate concern — it indicates the new CloudTrail code lacks sufficient test coverage, not a functional issue.


Artifacts

  • Test artifacts: .work/prow-job-analyze-test-failure/2041711466364538880/logs/
  • Key files:
    • build-log.txt — Full test execution log
    • destroy.log — Cluster destroy log (confirms successful cleanup)
    • junit.xml — Test JUnit results
    • junit_operator.xml — CI operator JUnit results
    • cloudtrail-permission-denied.json — CloudTrail findings (from TestCreateClusterPrivate, not the failing test)

✅ Analysis complete!
📄 Report: .work/prow-job-analyze-test-failure/2041711466364538880/analysis.md

💡 Tip: The Markdown report can be copied directly into JIRA Description field


@hypershift-jira-solve-ci
Copy link
Copy Markdown

hypershift-jira-solve-ci bot commented Apr 13, 2026

Now I have all the evidence needed. Let me compile the final analysis report:

Test Failure Analysis Complete

Job Information


Test Failure Analysis

Error

fixture.go:333: Failed to wait for infra resources in guest cluster to be deleted: context deadline exceeded
fixture.go:340: Failed to clean up 4 remaining resources for guest cluster

Summary

The TestCreateClusterRequestServingIsolation/Teardown subtest failed because 4 AWS infrastructure resources owned by the hosted cluster (request-serving-isolation-m42mm) were not cleaned up within the 15-minute validateAWSGuestResourcesDeletedFunc timeout in fixture.go. The resources that remained were:

  1. 3 EBS volumes — Node root volumes from machines in 3 availability zones (us-east-1a, us-east-1b, us-east-1c), tagged with sigs.k8s.io/cluster-api-provider-aws/role=node
  2. 1 S3 bucket — Image registry bucket (request-serving-isolation-m42mm-image-registry-us-east-1-dysxj)

The teardown validation (validateAWSGuestResourcesDeletedFunc) uses wait.PollUntilContextTimeout with a 15-minute deadline to poll AWS Resource Groups Tagging API every 20 seconds, checking if resources tagged kubernetes.io/cluster/<infraID>=owned still exist. When the timeout expired, 4 resources still existed.

Critically, the cluster destroy itself succeeded — the destroy.log shows all infrastructure (VPC, subnets, security groups, route tables, DNS zones, IAM roles, and even the S3 bucket) was successfully deleted. The S3 bucket at line 8 of destroy.log was deleted, even though it was reported as "remaining" by the pre-destroy validation. This confirms the validation timeout is a race condition: the hosted cluster controllers hadn't fully cleaned up resources before the 15-minute pre-destroy validation deadline, but the subsequent destroyCluster call cleaned everything up successfully.

This failure is unrelated to PR #8168. The PR adds a purely read-only, informational-only NoticeCloudTrailPermissionDenied check that:

  • Only uses t.Logf (never t.Errorf or t.Fatalf) — it cannot cause a test failure
  • Queries CloudTrail's LookupEvents API (read-only) — it never creates, modifies, or deletes any AWS resources
  • Runs in the after() phase, completely separate from the teardown/destroy path
  • Completed in 0.00s across all test suites (confirmed by build log)
  • Is guarded by sync.Once to run at most once per process

All other tests passed, including other tests that also ran NoticeCloudTrailPermissionDenied in their EnsureHostedCluster phase (TestCreateCluster, TestUpgradeControlPlane, TestAutoscaling, TestNodePool, etc.).

Evidence

Build log (line 2230–2237):

TestCreateClusterRequestServingIsolation/Teardown
    fixture.go:333: Failed to wait for infra resources in guest cluster to be deleted: context deadline exceeded
    fixture.go:340: Failed to clean up 4 remaining resources for guest cluster
    fixture.go:347: Resource: arn:aws:ec2:us-east-1:820196288204:volume/vol-0ffadbd865418c3c6 (ec2, node volume, us-east-1c)
    fixture.go:347: Resource: arn:aws:ec2:us-east-1:820196288204:volume/vol-0bfad6976b135737f (ec2, node volume, us-east-1a)
    fixture.go:347: Resource: arn:aws:s3:::request-serving-isolation-m42mm-image-registry-us-east-1-dysxj (s3)
    fixture.go:347: Resource: arn:aws:ec2:us-east-1:820196288204:volume/vol-02f1706335f2786fd (ec2, node volume, us-east-1b)
    hypershift_framework.go:586: Destroyed cluster. Namespace: e2e-clusters-bckx9, name: request-serving-isolation-m42mm

Destroy log — cluster destroy succeeded:

"Deleted S3 Bucket","name":"request-serving-isolation-m42mm-image-registry-us-east-1-dysxj"
...
"Successfully destroyed cluster and infrastructure","namespace":"e2e-clusters-bckx9","name":"request-serving-isolation-m42mm"

Test results summary (507 tests, 23 skipped, 2 failures):

  • The 2 JUnit failures are the same test: TestCreateClusterRequestServingIsolation (parent) and TestCreateClusterRequestServingIsolation/Teardown (child)
  • All actual test assertions in TestCreateClusterRequestServingIsolation/Main and /EnsureHostedCluster passed
  • The test only failed in the teardown/cleanup phase

Additional Evidence

  • No interval files were found for this job
  • No symptom labels were detected
  • All other tests passed: TestCreateCluster, TestCreateClusterPrivate, TestCreateClusterPrivateWithRouteKAS, TestCreateClusterProxy, TestCreateClusterCustomConfig, TestUpgradeControlPlane, TestAutoscaling, TestNodePool (HostedCluster0 and HostedCluster2), TestNodePoolAutoscalingScaleFromZero, TestHAEtcdChaos — all completed successfully including their teardowns
  • The TestCreateClusterRequestServingIsolation test creates a HighlyAvailable hosted cluster with dedicated request-serving topology (3 node pools across 3 AZs + 2 request-serving node pools), making it the most resource-intensive test in the suite, which explains why its resource cleanup takes longer

Root Cause Hypothesis

Primary cause: Pre-existing flaky teardown race condition in fixture.go:validateAWSGuestResourcesDeletedFunc. The 15-minute timeout for AWS resource cleanup was insufficient for the TestCreateClusterRequestServingIsolation test's resources. This test creates a HighlyAvailable hosted cluster spanning 3 availability zones with 3 worker node volumes and an image registry S3 bucket — more resources than simpler tests. The hosted cluster controllers did not delete all tagged resources within 15 minutes, but the subsequent destroyCluster call cleaned everything up successfully.

Contributing factors:

  • The request-serving isolation test creates the most complex cluster topology (5 management node pools + HA hosted cluster across 3 AZs), producing more AWS resources that need cleanup
  • EBS volume detachment and deletion can experience AWS-side delays, especially under high API load
  • The validation polls every 20 seconds, and 15 minutes may be marginal for a cluster of this complexity

PR #8168 is NOT the cause. The NoticeCloudTrailPermissionDenied check added by this PR is purely observational (read-only, log-only, sync.Once guarded) and does not interact with the teardown or resource cleanup path in any way.

Recommendations

  1. Retry the CI job — This is a flaky infrastructure cleanup timeout, not a code defect. The test's actual assertions all passed.
  2. Consider increasing the timeout in fixture.go:validateAWSGuestResourcesDeletedFunc from 15 minutes to 20–25 minutes for HighlyAvailable clusters with many resources, or make the timeout proportional to the number of expected resources.
  3. Consider making the validation non-fatal — Since destroyCluster successfully cleans up all resources regardless, the pre-destroy validation timeout could be logged as a warning rather than a test failure.

Artifacts

  • Test artifacts: .work/prow-job-analyze-test-failure/2041711466364538880/logs/
  • Build log: .work/prow-job-analyze-test-failure/2041711466364538880/logs/build-log.txt
  • Destroy log: .work/prow-job-analyze-test-failure/2041711466364538880/logs/destroy.log
  • JUnit XML: .work/prow-job-analyze-test-failure/2041711466364538880/logs/junit.xml

✅ Analysis complete!

💡 Tip: The Markdown report can be copied directly into JIRA Description field


@hypershift-jira-solve-ci
Copy link
Copy Markdown

hypershift-jira-solve-ci bot commented Apr 14, 2026

I now have the complete picture. Let me compile the final report.

Test Failure Analysis Complete (Multi-Step)

Job Information


Failed Step Analyses

Step: e2e-aws-hypershift-aws-run-e2e-nested (test phase)

Error

fixture.go:333: Failed to wait for infra resources in guest cluster to be deleted: context deadline exceeded
fixture.go:340: Failed to clean up 4 remaining resources for guest cluster

Failed Tests (2 of 507, 23 skipped)

Test Duration Failure
TestCreateClusterRequestServingIsolation/Teardown 1330.02s AWS infra resources not cleaned up within 15-minute timeout
TestCreateClusterRequestServingIsolation 2903.00s Parent test failed due to Teardown sub-test failure

Summary

The failure is exclusively in the Teardown phase of TestCreateClusterRequestServingIsolation. All functional sub-tests passed:

  • ValidateHostedCluster — passed (1232.27s)
  • Main/EnsurePSANotPrivileged — passed (0.21s)
  • EnsureHostedCluster — passed (2.89s), including NoticeCloudTrailPermissionDenied (0.00s)
  • Teardown — failed (1330.02s)

The teardown logic in fixture.go (line 303) polls the AWS Resource Groups Tagging API every 20 seconds for up to 15 minutes, checking whether tagged guest cluster resources have been deleted. After timeout, 4 resources remained:

Remaining AWS Resources:

  1. ec2:volume/vol-0ffadbd865418c3c6 — EBS volume for node request-serving-isolation-m42mm-us-east-1c-xr26p-258f5
  2. ec2:volume/vol-0bfad6976b135737f — EBS volume for node request-serving-isolation-m42mm-us-east-1a-f59zk-d98lt
  3. s3:::request-serving-isolation-m42mm-image-registry-us-east-1-dysxj — Image registry S3 bucket
  4. ec2:volume/vol-02f1706335f2786fd — EBS volume for node request-serving-isolation-m42mm-us-east-1b-zbsmj-mjg2v

The cluster was eventually destroyed successfully (per destroy.log: "Successfully destroyed cluster and infrastructure"), but the pre-destroy resource check timed out because EC2 volumes and the S3 bucket were still being cleaned up.

Evidence

The destroy.log confirms the cluster did complete full infrastructure destruction after the test timeout:

"msg":"Deleted S3 Bucket","name":"request-serving-isolation-m42mm-image-registry-us-east-1-dysxj"
"msg":"Deleted VPC","id":"vpc-06fd88d6e2c821a99"
"msg":"Successfully destroyed cluster and infrastructure"

This indicates a race condition in the teardown sequence: the validateAWSGuestResourcesDeletedFunc check (15-minute timeout) ran concurrently with or before the actual infrastructure destruction, and the EC2 volumes + S3 bucket deletion took longer than the timeout window to propagate through the AWS Tagging API.

Relationship to PR #8168

This failure is NOT caused by PR #8168. The PR adds a non-failing NoticeCloudTrailPermissionDenied check that:

  • Uses t.Logf (not t.Errorf) — it cannot cause test failures
  • Is guarded by sync.Once — runs only once per process
  • Passed in all 12 test suites including TestCreateClusterRequestServingIsolation

The new NoticeCloudTrailPermissionDenied test did successfully detect 2 permission denied events in the TestCreateClusterPrivate run:

  • ec2:DescribeInstanceTopology — unauthorized for private-lm6jr-shared-role
  • ec2:DescribeAccountAttributes — unauthorized for private-lm6jr-shared-role

These were correctly reported as non-fatal informational findings and did not affect the test outcome.


PR #8168 CloudTrail Check Assessment

The new NoticeCloudTrailPermissionDenied check is working as designed:

  • ✅ Ran once per process (via sync.Once)
  • ✅ Detected real Client.UnauthorizedOperation events on 2 EC2 API calls
  • ✅ Wrote JSON report to cloudtrail-permission-denied.json artifact
  • ✅ Did not cause any test failures (non-fatal t.Logf only)
  • ✅ Passed in all test suites

Aggregated Root Cause

Failed Steps Summary

Step One-line Failure
TestCreateClusterRequestServingIsolation/Teardown AWS guest infra resource cleanup timed out after 15 minutes — 3 EBS volumes + 1 S3 bucket still tagged as owned

Root Cause Hypothesis

Primary cause: Pre-existing flaky teardown behavior in TestCreateClusterRequestServingIsolation. The validateAWSGuestResourcesDeletedFunc in fixture.go polls the AWS Resource Groups Tagging API with a 15-minute timeout, but the request-serving-isolation test creates a complex cluster with 5 NodePools (2 request-serving + 3 non-request-serving) across 3 availability zones. The volume of AWS resources (multiple EBS volumes per zone, S3 buckets, VPCs, subnets, security groups, etc.) sometimes takes longer than 15 minutes for the AWS Tagging API to reflect deletion — even though actual deletion succeeds (confirmed by destroy.log).

Contributing factors:

  • The request-serving-isolation test creates significantly more infrastructure than other tests (5 NodePools vs. typically 1-2), increasing teardown time
  • AWS Resource Groups Tagging API has eventual consistency — tag propagation for deletion events can lag behind the actual resource deletion
  • The 15-minute timeout in validateAWSGuestResourcesDeletedFunc may be insufficient for this test's resource scale

This is NOT a regression introduced by PR #8168. The PR only adds a read-only, non-failing CloudTrail lookup that completed in 0s for this test suite.

Recommendations

  • Retrigger the job — this is a transient AWS resource cleanup timing issue
  • Consider increasing the timeout in fixture.go validateAWSGuestResourcesDeletedFunc for tests that create multiple NodePools, or make the timeout proportional to the number of resources
  • The codecov/patch failure is a separate issue — insufficient test coverage on the new CloudTrail code added by the PR; this is unrelated to the e2e test failure

Artifacts

  • Test artifacts: .work/prow-job-analyze-test-failure/2041711466364538880/logs/
  • CloudTrail report: .work/prow-job-analyze-test-failure/2041711466364538880/logs/cloudtrail-permission-denied.json
  • Destroy log: .work/prow-job-analyze-test-failure/2041711466364538880/logs/destroy.log
  • JUnit XML: .work/prow-job-analyze-test-failure/2041711466364538880/logs/junit.xml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/testing Indicates the PR includes changes for e2e testing do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants