fix(vpc): retry transient API errors on one-shot create calls#2
Closed
cchristous wants to merge 2 commits into
Closed
fix(vpc): retry transient API errors on one-shot create calls#2cchristous wants to merge 2 commits into
cchristous wants to merge 2 commits into
Conversation
Transient API failures (HTTP 5xx/429 or network blips) on one-shot create/mutating VPC calls previously aborted the whole bake on the first error. PR IBM#156 added transient-error tolerance to the status-poll loops but left the create calls unprotected, so a single 502 on e.g. SSH-key creation failed the build (job 01bd515f-538b-4a09-8864-d9eddcbb3522). Add a retryTransient helper that reuses IBM#156's isTransientPollError classifier and maxConsecutiveTransientPollFailures cap, retrying transient failures with the existing pollInterval backoff while still failing fast on fatal 4xx errors. Wrap the one-shot create/mutating calls: CreateKey, CreateInstanceAction, CreateFloatingIP, CreateSecurityGroup, CreateSecurityGroupRule, CreateSecurityGroupTargetBinding, CreateImage, CreateImageExportJob and CreateInstance.
- Fold retry-exhaustion into the returned error (mirroring the poll loop) so an exhausted-transient failure is distinguishable from a fast-failed 4xx in the build output, instead of only being logged. - Extract createInstanceWithRetry and route all four instance-prototype branches through it, removing the byte-identical copy-paste blocks. - Extract effectivePollInterval shared by pollUntil and retryTransient. - Document the duplicate-create consideration for non-idempotent calls (fixed resource names make a true duplicate fail fast as a 4xx). - Add tests: network-blip (nil response) retry, and a fatal error encountered mid-retry short-circuits without running out the cap.
Owner
Author
|
Superseded by IBM#156, which now delivers transient-error tolerance for all VPC calls — including the one-shot creates this PR wrapped — via the IBM Cloud SDK's built-in request retries ( That approach is a strict superset of this one (it also covers status polls, honors |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Transient API failures (HTTP 5xx/429 or network blips) on one-shot create/mutating VPC calls previously aborted the whole image bake on the first error. This wraps those calls in a bounded retry, reusing the transient/fatal classifier introduced in IBM#156 for the polling loops.
This is a follow-on to IBM#156 and is stacked on its branch (
fix/vpc-image-poll-transient-tolerance) so it can reuseisTransientPollError/maxConsecutiveTransientPollFailuresrather than duplicate them. Rebase ontomasteronce IBM#156 merges.Motivating failure: an s390x agent-image bake failed when the SSH-key create call hit a transient
502 Bad Gatewayand the plugin gave up after 2.6s with no retry (failing job01bd515f-538b-4a09-8864-d9eddcbb3522, 2026-06-26). The bake runs automatically per-PR/on-merge, so one transient 502 fails the whole build.Rationale — why this is the right change
IBM#156 added transient-error tolerance only inside the status-poll loops (
pollUntil,isResourceReady/isResourceDown, the image-export poll), via atransient = 5xx/429/network,fatal = 4xxclassifier (isTransientPollError) and a consecutive-failure cap (maxConsecutiveTransientPollFailures = 5). It deliberately did not touch the one-shot create calls, soCreateKey,CreateInstance, etc. still failed on the first transient error.This PR closes that gap using the same mechanism, so the create path and the poll path treat the same errors the same way:
retryTransient(state, action, op)next to the poll machinery inclient.go. It reuses the existingisTransientPollErrorclassifier andmaxConsecutiveTransientPollFailurescap, retries transient failures with the existingpollIntervalbackoff, and fails fast on fatal 4xx. On exhaustion it folds the give-up reason into the returned error (mirroring the poll loop) so an operator can tell a flaky/throttled failure apart from a hard 4xx.CreateKey,CreateInstanceAction,CreateFloatingIP,CreateSecurityGroup,CreateSecurityGroupRule,CreateSecurityGroupTargetBinding(client.go),CreateImage(step_capture_image.go),CreateImageExportJob(step_image_export.go), andCreateInstance(all four instance-prototype branches instep_create_instance.go, collapsed into a sharedcreateInstanceWithRetryhelper).effectivePollInterval()so the poll and create paths share one interval-resolution seam.Alternatives considered:
EnableRetrieson the sharedvpcv1.VpcV1client (transport-level retries, honorsRetry-After, idempotency-aware network-error exclusions). This is a better long-term altitude and would deleteretryTransient, but it replaces fix(vpc): tolerate transient API errors via SDK request retries IBM/packer-plugin-ibmcloud#156's hand-rolled approach with a competing pattern, changes retry behavior for the poll GETs too, and is a larger change than this targeted fix warrants. Recommended as a follow-up to converge the poll and create paths onto one primitive.Delete*calls. Out of scope: those run inCleanupand don't abort the bake.Risk
op()returns immediately, identical to the prior one-shot call). The only behavioral change is on error: transient errors now retry up to 5× with backoff before surfacing the same error; fatal 4xx behavior andStateTimeoutare unchanged.go vet, andgofmtare green.Known trade-off (documented in code)
isTransientPollErrortreats an error with no HTTP response (nilDetailedResponse) as transient — that's required to retry genuine network blips, but it also means a deterministic SDK client-side validation error (also nil response) would be retried ~50s before failing. The impact is bounded (capped, returns the same error) and only occurs on a malformed request that would fail the build regardless. For non-idempotent creates, a network blip after the server processed the request could in principle duplicate a resource; in practice these calls use fixed resource names, so a true duplicate retry hits a name conflict and fails fast as a 4xx rather than orphaning resources. Both points are noted in theretryTransientdoc comment.Why this is safe
The mitigations cover the identified risk directly:
retryTransientcallsop()once and returns onerr == nil, so a healthy build sees no new behavior and no added latency.fmt.Errorf("[ERROR] ...")wrapping andstate.Put("error", ...)handling.ui.Say, and exhaustion produces an explicit "giving up after N consecutive transient errors" message in the build output.createSecurityGroupcall against anhttptestserver, mirroringclient_test.go's existing style.The change reuses the same classifier and cap that already passed review in IBM#156, so the create path inherits that path's vetted transient/fatal semantics rather than introducing a second, divergent definition.
Pre-Submission Review
/pr-review-toolkit:review-pr— run (6 parallel agents: 3× code-reviewer, silent-failure-hunter, pr-test-analyzer, comment-analyzer). Findings addressed: (1) folded retry-exhaustion into the returned error so it's visible in the UI, not just logged; (2) added network-blip and fatal-mid-retry test cases; (3) documented the SDK-validation-error / duplicate-create trade-off. The "move to SDKEnableRetries" altitude finding is intentionally deferred (see Alternatives) and called out above as a follow-up./simplify— run (4 parallel cleanup agents: reuse, simplification, efficiency, altitude). Findings addressed: collapsed all fourCreateInstancebranches intocreateInstanceWithRetry(removing copy-paste) and extracted the sharedeffectivePollInterval()helper.