SREP-4383: feat(e2e): add CloudTrail permission denied check for AWS tests#8168
SREP-4383: feat(e2e): add CloudTrail permission denied check for AWS tests#8168joshbranham wants to merge 1 commit intoopenshift:mainfrom
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
Skipping CI for Draft Pull Request. |
|
Important Review skippedAuto reviews are limited based on label configuration. 🚫 Excluded labels (none allowed) (1)
Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository YAML (base), Organization UI (inherited) Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: joshbranham The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/test e2e-aws |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #8168 +/- ##
==========================================
+ Coverage 27.50% 27.57% +0.07%
==========================================
Files 1096 1098 +2
Lines 107277 107787 +510
==========================================
+ Hits 29503 29722 +219
- Misses 75240 75518 +278
- Partials 2534 2547 +13
🚀 New features to boost your workflow:
|
Test Resultse2e-aws
|
|
/test e2e-aws |
|
/label tide/merge-method-squash |
|
/test e2e-aws |
1 similar comment
|
/test e2e-aws |
|
@joshbranham: This pull request references SREP-4383 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@joshbranham: This pull request references SREP-4383 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Add a post-test CloudTrail check that queries for permission denied events associated with the HostedCluster's IAM roles and any management cluster roles discovered from HCP namespace pods/SAs. Results are logged (non-failing) with Prow-highlighted warnings and written as JSON to ARTIFACT_DIR. Uses sync.Once to run only once per process to avoid CloudTrail API rate limits. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
78af02b to
9e2cb2e
Compare
|
/test e2e-aws |
|
@joshbranham: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryThe test The teardown sequence was:
Root CauseAWS Resource Groups Tagging API eventual consistency. The 4 "remaining" resources reported by the validation function were already deleted by the cluster destroy operation, but the AWS Resource Groups Tagging API index had not yet been updated to reflect their removal:
The S3 bucket is the resource that triggers the timeout: the Evidence
Relationship to PR #8168This failure is unrelated to PR #8168. The PR adds the
The teardown failure is a pre-existing flaky behavior caused by AWS Resource Groups Tagging API eventual consistency — the 15-minute polling timeout in Recommendations
Artifacts
✅ Analysis complete! 💡 Tip: The Markdown report can be copied directly into JIRA Description field |
|
Now I have all the evidence needed. Let me compile the final analysis report: Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryThe
The teardown validation ( Critically, the cluster destroy itself succeeded — the This failure is unrelated to PR #8168. The PR adds a purely read-only, informational-only
All other tests passed, including other tests that also ran EvidenceBuild log (line 2230–2237): Destroy log — cluster destroy succeeded: Test results summary (507 tests, 23 skipped, 2 failures):
Additional Evidence
Root Cause HypothesisPrimary cause: Pre-existing flaky teardown race condition in Contributing factors:
PR #8168 is NOT the cause. The Recommendations
Artifacts
✅ Analysis complete! 💡 Tip: The Markdown report can be copied directly into JIRA Description field |
|
I now have the complete picture. Let me compile the final report. Test Failure Analysis Complete (Multi-Step)Job Information
Failed Step AnalysesStep:
|
| Test | Duration | Failure |
|---|---|---|
TestCreateClusterRequestServingIsolation/Teardown |
1330.02s | AWS infra resources not cleaned up within 15-minute timeout |
TestCreateClusterRequestServingIsolation |
2903.00s | Parent test failed due to Teardown sub-test failure |
Summary
The failure is exclusively in the Teardown phase of TestCreateClusterRequestServingIsolation. All functional sub-tests passed:
- ✅
ValidateHostedCluster— passed (1232.27s) - ✅
Main/EnsurePSANotPrivileged— passed (0.21s) - ✅
EnsureHostedCluster— passed (2.89s), includingNoticeCloudTrailPermissionDenied(0.00s) - ❌
Teardown— failed (1330.02s)
The teardown logic in fixture.go (line 303) polls the AWS Resource Groups Tagging API every 20 seconds for up to 15 minutes, checking whether tagged guest cluster resources have been deleted. After timeout, 4 resources remained:
Remaining AWS Resources:
ec2:volume/vol-0ffadbd865418c3c6— EBS volume for noderequest-serving-isolation-m42mm-us-east-1c-xr26p-258f5ec2:volume/vol-0bfad6976b135737f— EBS volume for noderequest-serving-isolation-m42mm-us-east-1a-f59zk-d98lts3:::request-serving-isolation-m42mm-image-registry-us-east-1-dysxj— Image registry S3 bucketec2:volume/vol-02f1706335f2786fd— EBS volume for noderequest-serving-isolation-m42mm-us-east-1b-zbsmj-mjg2v
The cluster was eventually destroyed successfully (per destroy.log: "Successfully destroyed cluster and infrastructure"), but the pre-destroy resource check timed out because EC2 volumes and the S3 bucket were still being cleaned up.
Evidence
The destroy.log confirms the cluster did complete full infrastructure destruction after the test timeout:
"msg":"Deleted S3 Bucket","name":"request-serving-isolation-m42mm-image-registry-us-east-1-dysxj"
"msg":"Deleted VPC","id":"vpc-06fd88d6e2c821a99"
"msg":"Successfully destroyed cluster and infrastructure"
This indicates a race condition in the teardown sequence: the validateAWSGuestResourcesDeletedFunc check (15-minute timeout) ran concurrently with or before the actual infrastructure destruction, and the EC2 volumes + S3 bucket deletion took longer than the timeout window to propagate through the AWS Tagging API.
Relationship to PR #8168
This failure is NOT caused by PR #8168. The PR adds a non-failing NoticeCloudTrailPermissionDenied check that:
- Uses
t.Logf(nott.Errorf) — it cannot cause test failures - Is guarded by
sync.Once— runs only once per process - Passed in all 12 test suites including
TestCreateClusterRequestServingIsolation
The new NoticeCloudTrailPermissionDenied test did successfully detect 2 permission denied events in the TestCreateClusterPrivate run:
ec2:DescribeInstanceTopology— unauthorized forprivate-lm6jr-shared-roleec2:DescribeAccountAttributes— unauthorized forprivate-lm6jr-shared-role
These were correctly reported as non-fatal informational findings and did not affect the test outcome.
PR #8168 CloudTrail Check Assessment
The new NoticeCloudTrailPermissionDenied check is working as designed:
- ✅ Ran once per process (via
sync.Once) - ✅ Detected real
Client.UnauthorizedOperationevents on 2 EC2 API calls - ✅ Wrote JSON report to
cloudtrail-permission-denied.jsonartifact - ✅ Did not cause any test failures (non-fatal
t.Logfonly) - ✅ Passed in all test suites
Aggregated Root Cause
Failed Steps Summary
| Step | One-line Failure |
|---|---|
TestCreateClusterRequestServingIsolation/Teardown |
AWS guest infra resource cleanup timed out after 15 minutes — 3 EBS volumes + 1 S3 bucket still tagged as owned |
Root Cause Hypothesis
Primary cause: Pre-existing flaky teardown behavior in TestCreateClusterRequestServingIsolation. The validateAWSGuestResourcesDeletedFunc in fixture.go polls the AWS Resource Groups Tagging API with a 15-minute timeout, but the request-serving-isolation test creates a complex cluster with 5 NodePools (2 request-serving + 3 non-request-serving) across 3 availability zones. The volume of AWS resources (multiple EBS volumes per zone, S3 buckets, VPCs, subnets, security groups, etc.) sometimes takes longer than 15 minutes for the AWS Tagging API to reflect deletion — even though actual deletion succeeds (confirmed by destroy.log).
Contributing factors:
- The request-serving-isolation test creates significantly more infrastructure than other tests (5 NodePools vs. typically 1-2), increasing teardown time
- AWS Resource Groups Tagging API has eventual consistency — tag propagation for deletion events can lag behind the actual resource deletion
- The 15-minute timeout in
validateAWSGuestResourcesDeletedFuncmay be insufficient for this test's resource scale
This is NOT a regression introduced by PR #8168. The PR only adds a read-only, non-failing CloudTrail lookup that completed in 0s for this test suite.
Recommendations
- Retrigger the job — this is a transient AWS resource cleanup timing issue
- Consider increasing the timeout in
fixture.govalidateAWSGuestResourcesDeletedFuncfor tests that create multiple NodePools, or make the timeout proportional to the number of resources - The
codecov/patchfailure is a separate issue — insufficient test coverage on the new CloudTrail code added by the PR; this is unrelated to the e2e test failure
Artifacts
- Test artifacts:
.work/prow-job-analyze-test-failure/2041711466364538880/logs/ - CloudTrail report:
.work/prow-job-analyze-test-failure/2041711466364538880/logs/cloudtrail-permission-denied.json - Destroy log:
.work/prow-job-analyze-test-failure/2041711466364538880/logs/destroy.log - JUnit XML:
.work/prow-job-analyze-test-failure/2041711466364538880/logs/junit.xml
What this PR does / why we need it:
Add a post-test CloudTrail check that queries for AccessDenied and UnauthorizedOperation events associated with the HostedCluster's IAM roles. This surfaces permission gaps in ROSA managed policies during e2e CI runs without failing the test (log-only for now).
The check runs in the after() phase for AWS platform tests, using the LookupEvents API to scan CloudTrail within the test's time window. A JSON report is written to ARTIFACT_DIR when available.
Which issue(s) this PR fixes:
Fixes SREP-4383
Special notes for your reviewer:
Checklist: