Skip to content

fix: keep polling service-linked-role delete on read-after-write lag#1149

Open
james00012 wants to merge 1 commit into
masterfrom
james/lib-5166-slr-followup
Open

fix: keep polling service-linked-role delete on read-after-write lag#1149
james00012 wants to merge 1 commit into
masterfrom
james/lib-5166-slr-followup

Conversation

@james00012

@james00012 james00012 commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Follow-up to #1147 (merged). Based on master.

Summary

Addresses review feedback on #1147 and one correctness gap found while reviewing that change.

Intent

#1147 fixed false failures when a service-linked role deletes faster than the first status poll, by verifying the role via GetRole on a NoSuchEntity from GetServiceLinkedRoleDeletionStatus. Reviewing it surfaced a residual: the "task gone but role still present" branch returned a hard error, and PollUntil treats any returned error as terminal. IAM is eventually consistent, so a GetRole issued right after a fast delete can still see the role for a moment, which would turn read-after-write lag into the same false failure #1147 set out to remove.

Decision

Treat "task record gone, role still visible" as propagation lag and keep polling, the same way an in-progress deletion is handled (return false, nil). AWS has already accepted the deletion when the task record is purged, so a persistent "still present" state is implausible, and the existing 5 minute poll timeout still bounds a genuinely stuck delete. This fully closes the false-failure class rather than narrowing it.

Alternative considered: keep the fail-fast error. Rejected because the most likely cause of "still present" here is lag, not a real failure, so failing fast reintroduces the bug in a smaller window.

What changed

  • aws/resources/iam_service_linked_role.go: the "role still present" branch now logs and returns false, nil to continue polling instead of returning an error. Added a note that the verification requires iam:GetRole.
  • aws/resources/iam_service_linked_role_test.go: replaced the "still present returns error" test with one that models read-after-write lag (GetRole sees the role once, then NoSuchEntity) and asserts success. Mock GetRole gained a per-call result sequence. Lowercased a throttling error string per review (Go convention).

Illustration

status poll -> NoSuchEntity (task purged)
  -> GetRole
       NoSuchEntity        -> success
       role still visible  -> keep polling (lag), bounded by timeout   <- changed
       other error         -> propagate

Validation

  • Unit tests: go test ./aws/resources/ -run TestIAMServiceLinkedRoles -v, PASS (5). The lag test takes ~3s, confirming it retried across a poll interval and then succeeded rather than passing trivially.
  • gofmt, go vet ./aws/resources/, go build ./...: clean (re-run after rebasing onto current master).
  • Local read-only E2E against phxdevops (087285199408): inspect-aws --resource-type iam-service-linked-role listed ~40 service-linked roles including AWSServiceRoleForOrganizations, AWSServiceRoleForSSO, AWSServiceRoleForSupport, AWSServiceRoleForTrustedAdvisor. List/identify path works with no regression.
  • Not run: live nuke. The fast-delete race is non-deterministic and destructive in a shared account, so the delete branch is covered by unit tests.

Known limitations / follow-ups

  • A genuinely stuck "role still present" state now surfaces only at the 5 minute timeout rather than immediately. This matches existing in-progress handling and is the intended tradeoff.

Summary by CodeRabbit

Bug Fixes

  • Improved IAM role deletion verification – Enhanced the deletion process to gracefully handle temporary inconsistencies when verifying role state, accounting for AWS IAM's eventual consistency. The system now retries rather than immediately failing when encountering transient delays, improving reliability of role deletion operations.

@coderabbitai

coderabbitai Bot commented Jun 13, 2026

Copy link
Copy Markdown

Review Change Stack

Walkthrough

This change updates IAM service-linked role deletion polling to tolerate temporary role visibility after the deletion task record disappears, and refreshes tests and mocks to cover that lag-aware verification path.

Changes

IAM service-linked role deletion polling

Layer / File(s) Summary
Deletion verification polling
aws/resources/iam_service_linked_role.go
When GetServiceLinkedRoleDeletionStatus returns NoSuchEntity, the code now verifies the role with GetRole and keeps polling with false, nil if the role is still present.
Lag-aware test coverage
aws/resources/iam_service_linked_role_test.go
The mock now supports sequenced GetRole results, the deletion test now expects success when the role briefly remains visible before disappearing, and the throttling message assertion is updated.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related issues

  • gruntwork-io/cloud-nuke#1143 — The PR implements the described fix by verifying the role with GetRole when deletion status lookup returns NoSuchEntity.

Possibly related PRs

  • gruntwork-io/cloud-nuke#1147 — Both PRs touch deleteIAMServiceLinkedRole and its tests around GetRole-based verification after a missing deletion status record.

Suggested reviewers

  • yhakbar
  • denis256

Poem

A role said, “I’m gone!” but IAM said, “Hmm… maybe.”
So polling got patient, calm, and wavy.
First seen, then gone, in lag’s little dance,
The tests now follow that fleeting chance.
Debug logs wink: “Just wait one more glance.”

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly matches the main change: continuing polling when service-linked-role deletion sees read-after-write lag.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description is detailed and on-topic, but it omits several template sections like TODOs, release notes, and migration guide.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch james/lib-5166-slr-followup

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@james00012 james00012 marked this pull request as ready for review June 13, 2026 23:12
When the deletion-status poll returns NoSuchEntity, the verify-via-GetRole
branch now treats a still-visible role as IAM read-after-write lag and keeps
polling rather than failing, consistent with how the in-progress status is
handled. AWS has already accepted the deletion at that point, so a persistent
"still present" state is implausible and the surrounding timeout still bounds a
genuinely stuck delete.

Also document the new iam:GetRole requirement in the code comment and lowercase
a test error string per review.
@james00012 james00012 changed the base branch from fix-service-linked-role-fast-delete to master June 13, 2026 23:14
@james00012 james00012 force-pushed the james/lib-5166-slr-followup branch from 47669c0 to 82ea741 Compare June 13, 2026 23:14

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@aws/resources/iam_service_linked_role_test.go`:
- Around line 38-46: Guard the GetRole sequence handling in the IAM mock against
a nil getRoleCalls pointer: in the GetRoleErrSeq branch, add a nil check before
dereferencing or incrementing m.getRoleCalls so future tests that set
GetRoleErrSeq without initializing it won’t panic. Update the sequence logic in
the mock method that reads GetRoleErrSeq and advances getRoleCalls to fall back
safely when the counter is unset.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8b44fb8e-6f02-439e-8ba3-eb0058efeaa6

📥 Commits

Reviewing files that changed from the base of the PR and between 2ed0ab0 and 82ea741.

📒 Files selected for processing (2)
  • aws/resources/iam_service_linked_role.go
  • aws/resources/iam_service_linked_role_test.go

Comment on lines +38 to +46
err := m.GetRoleErr
if len(m.GetRoleErrSeq) > 0 {
i := *m.getRoleCalls
if i >= len(m.GetRoleErrSeq) {
i = len(m.GetRoleErrSeq) - 1
}
*m.getRoleCalls++
err = m.GetRoleErrSeq[i]
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard against nil pointer dereference in the GetRole sequence logic.

If a future test sets GetRoleErrSeq without initializing getRoleCalls, line 40 will panic when dereferencing nil. The current test at line 174 correctly initializes it, but the mock should be more defensive.

🛡️ Suggested fix to add a nil check
 func (m mockedIAMServiceLinkedRoles) GetRole(ctx context.Context, params *iam.GetRoleInput, optFns ...func(*iam.Options)) (*iam.GetRoleOutput, error) {
 	err := m.GetRoleErr
-	if len(m.GetRoleErrSeq) > 0 {
+	if len(m.GetRoleErrSeq) > 0 && m.getRoleCalls != nil {
 		i := *m.getRoleCalls
 		if i >= len(m.GetRoleErrSeq) {
 			i = len(m.GetRoleErrSeq) - 1
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
err := m.GetRoleErr
if len(m.GetRoleErrSeq) > 0 {
i := *m.getRoleCalls
if i >= len(m.GetRoleErrSeq) {
i = len(m.GetRoleErrSeq) - 1
}
*m.getRoleCalls++
err = m.GetRoleErrSeq[i]
}
err := m.GetRoleErr
if len(m.GetRoleErrSeq) > 0 && m.getRoleCalls != nil {
i := *m.getRoleCalls
if i >= len(m.GetRoleErrSeq) {
i = len(m.GetRoleErrSeq) - 1
}
*m.getRoleCalls++
err = m.GetRoleErrSeq[i]
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@aws/resources/iam_service_linked_role_test.go` around lines 38 - 46, Guard
the GetRole sequence handling in the IAM mock against a nil getRoleCalls
pointer: in the GetRoleErrSeq branch, add a nil check before dereferencing or
incrementing m.getRoleCalls so future tests that set GetRoleErrSeq without
initializing it won’t panic. Update the sequence logic in the mock method that
reads GetRoleErrSeq and advances getRoleCalls to fall back safely when the
counter is unset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants