fix(vpc): tolerate transient API errors via SDK request retries#156
Merged
deepakibms merged 6 commits intoJul 1, 2026
Merged
Conversation
The resource-status poll loops aborted the whole build on a single transient IBM VPC API error (5xx/429 or a network blip), even during long-running (~48 min) bakes. A flaky GetImage during the post-capture "wait for AVAILABLE" poll would fail an otherwise-successful build. Classify poll errors as transient (5xx/429, or no/zero-status response = network-level failure) vs fatal (4xx). Transient errors are retried up to maxConsecutiveTransientPollFailures (5) consecutive times, resetting on any successful poll; fatal errors still abort immediately. The overall StateTimeout remains the outer bound. Applied to all three poll loops, which previously treated any error as fatal: waitForResourceReady/isResourceReady and waitForResourceDown/ isResourceDown (now share a single pollUntil helper), and waitForExportJobToSucceed (the 5-45 min image-export poll). The previously-discarded *core.DetailedResponse is captured to read the status code. A pollInterval seam was added so the loops are unit-testable without real 10s sleeps. Signed-off-by: Corey Christous <cchristous@confluent.io>
- Drop ui.Error/log.Println from isResourceDown's error path so a retried transient blip no longer prints a misleading [ERROR] line before the "retrying" message (matches isResourceReady, which only returns the error). - Extract sleepOrDone helper to remove the duplicated sleep+select between pollUntil and the export-job loop. - Generalize the maxConsecutiveTransientPollFailures/classifyPollError doc comments (they no longer name waitForResourceReady specifically; mention 429) and fix the "wrapped wrapped" typo. - Align the export loop's error formatting to %s. - Add tests: streak-reset (5 transient -> success -> 5 transient -> ready, pinning both the cap at 5 and reset-on-success) and transient/fatal classification for the floating_ips and subnets branches. Signed-off-by: Corey Christous <cchristous@confluent.io>
8f5b1c4 to
c53378a
Compare
deepakibms
approved these changes
Jun 25, 2026
deepakibms
left a comment
Contributor
There was a problem hiding this comment.
Overall the change looks good to me. I don’t see any blocking issues. One small nit: the comment in waitForExportJobToSucceed() says “Default to 45 minutes for image exports,” but the code actually defaults to 5 minutes. Might be worth correcting that comment for clarity.
Replace the hand-rolled poll-loop transient classifier with the IBM Cloud SDK's built-in request retries (go-sdk-core EnableRetries), enabled once at VPC service construction (StepCreateVPCServiceInstance, used by both the builder and the export post-processor). The SDK retries 429/5xx/network errors, honors Retry-After, and backs off exponentially, covering every VPC call -- one-shot creates (CreateKey, CreateInstance, ...) and status polls alike -- so a single transient 502 no longer aborts a bake. Removes the now-redundant transientPollError / classifyPollError / isTransientPollError / maxConsecutiveTransientPollFailures machinery and the consecutive-failure counters in pollUntil and waitForExportJobToSucceed; those loops now just re-check status on the configured cadence and surface any genuine error. pollUntil's generalization (shared check func) is kept. Tests now exercise the SDK retry behavior end-to-end (502->success, give-up after the configured cap, fatal 4xx fails fast). Signed-off-by: Corey Christous <cchristous@confluent.io>
- Add TestStepCreateVPCServiceInstanceEnablesRetries: runs the construction step and asserts the service has a retryablehttp transport, so removing the EnableRetries call (which would silently drop transient tolerance for every VPC call) fails the build. The behavioral retry tests enable retries on a fresh service themselves and would not catch that regression. - Document why waitForExportJobToSucceed keeps a pollInterval seam while pollUntil hardcodes defaultPollInterval (its multi-poll loop is tested). Signed-off-by: Corey Christous <cchristous@confluent.io>
…mment Address review-pass findings: - The SDK honors a server-sent Retry-After uncapped; vpcRetryMaxInterval caps only the exponential backoff. Reword the constant comment so it no longer implies Retry-After is bounded by vpcRetryMaxInterval. - Fix a stale comment claiming a '45 minutes' default export timeout where the code uses 5 minutes. Signed-off-by: Corey Christous <cchristous@confluent.io>
1a11fd8 to
0aba9e6
Compare
Address final-round review findings: - Add TestWaitForResourceReadyAbortsOnAPIError and TestWaitForResourceReadyReturnsWhenReady so the production pollUntil loop is exercised end-to-end (the prior transient tests went through the SDK client directly). Both terminate on the first check, so no real poll wait. - Reword pollUntil's comment: transient errors are retried per-request by the SDK (up to vpcRetryMaxAttempts each), not absorbed across the poll loop, so an error surfacing here means retries were exhausted or it's fatal. Signed-off-by: Corey Christous <cchristous@confluent.io>
astha-jain
approved these changes
Jun 30, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Tolerate transient IBM Cloud VPC API errors (HTTP 429/5xx and network blips) so a single flaky response no longer aborts an image bake. This is done by enabling the IBM Cloud SDK's built-in request retries on the shared VPC service client, which covers every VPC call — one-shot creates (SSH key, instance, image, floating IP, security groups, image-export job) and status polls alike.
Motivating failure: an s390x agent-image bake failed when the SSH-key create call hit a transient
502 Bad Gatewayand the plugin gave up after 2.6s with no retry (failing job01bd515f-538b-4a09-8864-d9eddcbb3522, 2026-06-26). The bake runs automatically per-PR/on-merge, so one transient 502 fails the whole build.Rationale — why this is the right change
The IBM Cloud Go SDK (
go-sdk-core) already ships a transport-level retry mechanism,BaseService.EnableRetries(maxRetries, maxRetryInterval), whose policy (IBMCloudSDKRetryPolicy) retries exactly the transient cases we care about —429,5xxexcept501, and network-level failures — while also doing things a hand-rolled poll-loop classifier cannot:Retry-Afterheader (parses both seconds and HTTP-date), so we back off as the API asks during throttling.vpcRetryMaxInterval(a server-sentRetry-Afteris honored as given and is not bounded by it).Crucially, retries live at the single VPC service construction site (
StepCreateVPCServiceInstance, used by both the builder and the export post-processor), so one line —vpcService.EnableRetries(vpcRetryMaxAttempts, vpcRetryMaxInterval)— protects every create and every poll. The original failure was on a one-shotCreateKey, which a poll-loop-only fix would not have covered.This let us delete the bespoke
transientPollError/classifyPollError/isTransientPollError/maxConsecutiveTransientPollFailuresmachinery and the consecutive-failure counters inpollUntilandwaitForExportJobToSucceed(net −400 / +107 lines in the rework). Those loops now simply re-check status on a fixed cadence and surface any genuine error — transient blips are absorbed beneath them by the SDK.Alternatives considered:
Retry-After.Risk
EnableRetriesaffects all of them. But it's a single, well-understood SDK API and the change is otherwise a net deletion.4xxstill fail fast on the first attempt.vpcRetryMaxAttempts = 5; exponential backoff capped atvpcRetryMaxInterval = 30s, though a server-sentRetry-Afteris honored as given);StateTimeoutis unchanged and still bounds the overall waits (the pollselectreturns on timeout even if an SDK retry is in flight); tests cover retry-then-success, give-up-after-cap, fatal-4xx-fast-fail, and the construction wiring; full suite +go vet+gofmtgreen.Why this is safe
EnableRetriesonly alters behavior when a request fails transiently; successful calls are unaffected, with no added latency.IBMCloudSDKRetryPolicydoes not retry4xx, so genuine bad requests surface immediately as before.StateTimeoutpreserved:pollUntil/waitForExportJobToSucceedrun their check in a goroutine andselectontime.After(timeout); a slow SDK-retrying GET cannot defeat the timeout, and the buffered result channel prevents any send-deadlock.vpcRetryMaxAttemptsretries per call; exponential backoff capped atvpcRetryMaxInterval(a server-sentRetry-Afteris respected as given).TestStepCreateVPCServiceInstanceEnablesRetriesfails if theEnableRetriescall is ever removed, so the whole premise can't silently regress.Pre-Submission Review
/pr-review-toolkit:review-pr— run against the rework. The test-coverage analyzer (→ added theEnableRetrieswiring test) and the simplification/altitude reviewer (→ confirmed keepingpollUntil's generalization; documented the export-poll test seam) completed and their findings were addressed. Two general code-review passes were also dispatched; they did not return in time, so their scope — overall correctness and theStateTimeout/retry-backoff interaction — was verified directly instead (single construction site covers all paths; timeout still bounds the waits; constants sized sensibly)./simplify— covered by the same parallel run above (reuse/simplification/efficiency/altitude); findings addressed (net deletion, no dead code, generalization retained).Known behavior changes & caveats
vpcRetryMaxAttemptswith backoff, and if a single GET still fails, the wait aborts with that error. In practice this is equivalent-or-better (a blip is retried within seconds rather than waiting for the next 10s poll), but the surfaced error on a genuinely degraded API is the raw last SDK error rather than the old "giving up after N consecutive transient errors" message — slightly less self-describing. Enablevpc_log = "debug"to see the SDK's retry attempts.Retry-Afteris honored as the server sends it.vpcRetryMaxInterval(30s) caps the exponential backoff but not a server-sent integerRetry-After. The poll loops still return on their ownStateTimeout/ExportTimeout(the timeoutselectfires regardless), so the timeout contract holds; a pathological largeRetry-Afterwould only leave the in-flight poll goroutine lingering briefly in the background (bounded byvpcRetryMaxAttempts × Retry-After, buffered result channel — no deadlock). Threading a deadline context into the poll GETs to bound that is a reasonable future hardening but is not needed for correctness here.