fix: keep polling service-linked-role delete on read-after-write lag#1149
fix: keep polling service-linked-role delete on read-after-write lag#1149james00012 wants to merge 1 commit into
Conversation
WalkthroughThis change updates IAM service-linked role deletion polling to tolerate temporary role visibility after the deletion task record disappears, and refreshes tests and mocks to cover that lag-aware verification path. ChangesIAM service-linked role deletion polling
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related issues
Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
When the deletion-status poll returns NoSuchEntity, the verify-via-GetRole branch now treats a still-visible role as IAM read-after-write lag and keeps polling rather than failing, consistent with how the in-progress status is handled. AWS has already accepted the deletion at that point, so a persistent "still present" state is implausible and the surrounding timeout still bounds a genuinely stuck delete. Also document the new iam:GetRole requirement in the code comment and lowercase a test error string per review.
47669c0 to
82ea741
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@aws/resources/iam_service_linked_role_test.go`:
- Around line 38-46: Guard the GetRole sequence handling in the IAM mock against
a nil getRoleCalls pointer: in the GetRoleErrSeq branch, add a nil check before
dereferencing or incrementing m.getRoleCalls so future tests that set
GetRoleErrSeq without initializing it won’t panic. Update the sequence logic in
the mock method that reads GetRoleErrSeq and advances getRoleCalls to fall back
safely when the counter is unset.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 8b44fb8e-6f02-439e-8ba3-eb0058efeaa6
📒 Files selected for processing (2)
aws/resources/iam_service_linked_role.goaws/resources/iam_service_linked_role_test.go
| err := m.GetRoleErr | ||
| if len(m.GetRoleErrSeq) > 0 { | ||
| i := *m.getRoleCalls | ||
| if i >= len(m.GetRoleErrSeq) { | ||
| i = len(m.GetRoleErrSeq) - 1 | ||
| } | ||
| *m.getRoleCalls++ | ||
| err = m.GetRoleErrSeq[i] | ||
| } |
There was a problem hiding this comment.
Guard against nil pointer dereference in the GetRole sequence logic.
If a future test sets GetRoleErrSeq without initializing getRoleCalls, line 40 will panic when dereferencing nil. The current test at line 174 correctly initializes it, but the mock should be more defensive.
🛡️ Suggested fix to add a nil check
func (m mockedIAMServiceLinkedRoles) GetRole(ctx context.Context, params *iam.GetRoleInput, optFns ...func(*iam.Options)) (*iam.GetRoleOutput, error) {
err := m.GetRoleErr
- if len(m.GetRoleErrSeq) > 0 {
+ if len(m.GetRoleErrSeq) > 0 && m.getRoleCalls != nil {
i := *m.getRoleCalls
if i >= len(m.GetRoleErrSeq) {
i = len(m.GetRoleErrSeq) - 1📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| err := m.GetRoleErr | |
| if len(m.GetRoleErrSeq) > 0 { | |
| i := *m.getRoleCalls | |
| if i >= len(m.GetRoleErrSeq) { | |
| i = len(m.GetRoleErrSeq) - 1 | |
| } | |
| *m.getRoleCalls++ | |
| err = m.GetRoleErrSeq[i] | |
| } | |
| err := m.GetRoleErr | |
| if len(m.GetRoleErrSeq) > 0 && m.getRoleCalls != nil { | |
| i := *m.getRoleCalls | |
| if i >= len(m.GetRoleErrSeq) { | |
| i = len(m.GetRoleErrSeq) - 1 | |
| } | |
| *m.getRoleCalls++ | |
| err = m.GetRoleErrSeq[i] | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@aws/resources/iam_service_linked_role_test.go` around lines 38 - 46, Guard
the GetRole sequence handling in the IAM mock against a nil getRoleCalls
pointer: in the GetRoleErrSeq branch, add a nil check before dereferencing or
incrementing m.getRoleCalls so future tests that set GetRoleErrSeq without
initializing it won’t panic. Update the sequence logic in the mock method that
reads GetRoleErrSeq and advances getRoleCalls to fall back safely when the
counter is unset.
Follow-up to #1147 (merged). Based on master.
Summary
Addresses review feedback on #1147 and one correctness gap found while reviewing that change.
Intent
#1147 fixed false failures when a service-linked role deletes faster than the first status poll, by verifying the role via GetRole on a NoSuchEntity from GetServiceLinkedRoleDeletionStatus. Reviewing it surfaced a residual: the "task gone but role still present" branch returned a hard error, and PollUntil treats any returned error as terminal. IAM is eventually consistent, so a GetRole issued right after a fast delete can still see the role for a moment, which would turn read-after-write lag into the same false failure #1147 set out to remove.
Decision
Treat "task record gone, role still visible" as propagation lag and keep polling, the same way an in-progress deletion is handled (return false, nil). AWS has already accepted the deletion when the task record is purged, so a persistent "still present" state is implausible, and the existing 5 minute poll timeout still bounds a genuinely stuck delete. This fully closes the false-failure class rather than narrowing it.
Alternative considered: keep the fail-fast error. Rejected because the most likely cause of "still present" here is lag, not a real failure, so failing fast reintroduces the bug in a smaller window.
What changed
Illustration
Validation
Known limitations / follow-ups
Summary by CodeRabbit
Bug Fixes