fix(vpc): tolerate transient API errors via SDK request retries by cchristous · Pull Request #156 · IBM/packer-plugin-ibmcloud

cchristous · 2026-06-23T15:38:15Z

Summary

Tolerate transient IBM Cloud VPC API errors (HTTP 429/5xx and network blips) so a single flaky response no longer aborts an image bake. This is done by enabling the IBM Cloud SDK's built-in request retries on the shared VPC service client, which covers every VPC call — one-shot creates (SSH key, instance, image, floating IP, security groups, image-export job) and status polls alike.

Motivating failure: an s390x agent-image bake failed when the SSH-key create call hit a transient 502 Bad Gateway and the plugin gave up after 2.6s with no retry (failing job 01bd515f-538b-4a09-8864-d9eddcbb3522, 2026-06-26). The bake runs automatically per-PR/on-merge, so one transient 502 fails the whole build.

Note: this PR was reworked. It previously hand-rolled a transient/fatal classifier and consecutive-failure counters inside the status-poll loops. It now delegates retries to the SDK (EnableRetries), which is a strict superset of that behavior and also covers the one-shot create calls — the place the original failure actually occurred.

Rationale — why this is the right change

The IBM Cloud Go SDK (go-sdk-core) already ships a transport-level retry mechanism, BaseService.EnableRetries(maxRetries, maxRetryInterval), whose policy (IBMCloudSDKRetryPolicy) retries exactly the transient cases we care about — 429, 5xx except 501, and network-level failures — while also doing things a hand-rolled poll-loop classifier cannot:

Honors the Retry-After header (parses both seconds and HTTP-date), so we back off as the API asks during throttling.
Exponential backoff for retries, capped at vpcRetryMaxInterval (a server-sent Retry-After is honored as given and is not bounded by it).
Excludes genuinely non-retryable network errors (bad scheme, invalid header, TLS-cert failures, too-many-redirects).

Crucially, retries live at the single VPC service construction site (StepCreateVPCServiceInstance, used by both the builder and the export post-processor), so one line — vpcService.EnableRetries(vpcRetryMaxAttempts, vpcRetryMaxInterval) — protects every create and every poll. The original failure was on a one-shot CreateKey, which a poll-loop-only fix would not have covered.

This let us delete the bespoke transientPollError / classifyPollError / isTransientPollError / maxConsecutiveTransientPollFailures machinery and the consecutive-failure counters in pollUntil and waitForExportJobToSucceed (net −400 / +107 lines in the rework). Those loops now simply re-check status on a fixed cadence and surface any genuine error — transient blips are absorbed beneath them by the SDK.

Alternatives considered:

Keep the hand-rolled poll-loop classifier (the original approach). Rejected: it only covered poll GETs, not the create calls where the bake actually failed; it duplicated retry logic the SDK already provides; and it ignored Retry-After.
Wrap each one-shot create call in a local retry helper. Rejected as a competing, lower-altitude pattern — it re-implements per call site what the SDK does once at the transport layer.

Risk

Blast radius: Medium. This plugin bakes agent images; every VPC call is on the critical build path, and EnableRetries affects all of them. But it's a single, well-understood SDK API and the change is otherwise a net deletion.
Change criticality: Low–Medium. The happy path is unchanged (a successful call returns on the first attempt). The behavior change is confined to the error path: transient errors are now retried with backoff before surfacing; fatal 4xx still fail fast on the first attempt.
Overall: Low–Medium.
Mitigations: Retries are bounded (vpcRetryMaxAttempts = 5; exponential backoff capped at vpcRetryMaxInterval = 30s, though a server-sent Retry-After is honored as given); StateTimeout is unchanged and still bounds the overall waits (the poll select returns on timeout even if an SDK retry is in flight); tests cover retry-then-success, give-up-after-cap, fatal-4xx-fast-fail, and the construction wiring; full suite + go vet + gofmt green.

Why this is safe

Happy path unchanged: EnableRetries only alters behavior when a request fails transiently; successful calls are unaffected, with no added latency.
Fatal errors still fail fast: IBMCloudSDKRetryPolicy does not retry 4xx, so genuine bad requests surface immediately as before.
StateTimeout preserved: pollUntil/waitForExportJobToSucceed run their check in a goroutine and select on time.After(timeout); a slow SDK-retrying GET cannot defeat the timeout, and the buffered result channel prevents any send-deadlock.
Bounded: at most vpcRetryMaxAttempts retries per call; exponential backoff capped at vpcRetryMaxInterval (a server-sent Retry-After is respected as given).
Wiring is guarded by a test: TestStepCreateVPCServiceInstanceEnablesRetries fails if the EnableRetries call is ever removed, so the whole premise can't silently regress.

Pre-Submission Review

/pr-review-toolkit:review-pr — run against the rework. The test-coverage analyzer (→ added the EnableRetries wiring test) and the simplification/altitude reviewer (→ confirmed keeping pollUntil's generalization; documented the export-poll test seam) completed and their findings were addressed. Two general code-review passes were also dispatched; they did not return in time, so their scope — overall correctness and the StateTimeout/retry-backoff interaction — was verified directly instead (single construction site covers all paths; timeout still bounds the waits; constants sized sensibly).
/simplify — covered by the same parallel run above (reuse/simplification/efficiency/altitude); findings addressed (net deletion, no dead code, generalization retained).

Known behavior changes & caveats

Per-request retry budget, not a loop-level streak. The removed hand-rolled poll loop tolerated up to 5 consecutive transient failures across the multi-minute poll (resetting on any success). SDK retries are per request: each status GET retries up to vpcRetryMaxAttempts with backoff, and if a single GET still fails, the wait aborts with that error. In practice this is equivalent-or-better (a blip is retried within seconds rather than waiting for the next 10s poll), but the surfaced error on a genuinely degraded API is the raw last SDK error rather than the old "giving up after N consecutive transient errors" message — slightly less self-describing. Enable vpc_log = "debug" to see the SDK's retry attempts.
Retry-After is honored as the server sends it. vpcRetryMaxInterval (30s) caps the exponential backoff but not a server-sent integer Retry-After. The poll loops still return on their own StateTimeout/ExportTimeout (the timeout select fires regardless), so the timeout contract holds; a pathological large Retry-After would only leave the in-flight poll goroutine lingering briefly in the background (bounded by vpcRetryMaxAttempts × Retry-After, buffered result channel — no deadlock). Threading a deadline context into the poll GETs to bound that is a reasonable future hardening but is not needed for correctness here.

The resource-status poll loops aborted the whole build on a single transient IBM VPC API error (5xx/429 or a network blip), even during long-running (~48 min) bakes. A flaky GetImage during the post-capture "wait for AVAILABLE" poll would fail an otherwise-successful build. Classify poll errors as transient (5xx/429, or no/zero-status response = network-level failure) vs fatal (4xx). Transient errors are retried up to maxConsecutiveTransientPollFailures (5) consecutive times, resetting on any successful poll; fatal errors still abort immediately. The overall StateTimeout remains the outer bound. Applied to all three poll loops, which previously treated any error as fatal: waitForResourceReady/isResourceReady and waitForResourceDown/ isResourceDown (now share a single pollUntil helper), and waitForExportJobToSucceed (the 5-45 min image-export poll). The previously-discarded *core.DetailedResponse is captured to read the status code. A pollInterval seam was added so the loops are unit-testable without real 10s sleeps. Signed-off-by: Corey Christous <cchristous@confluent.io>

- Drop ui.Error/log.Println from isResourceDown's error path so a retried transient blip no longer prints a misleading [ERROR] line before the "retrying" message (matches isResourceReady, which only returns the error). - Extract sleepOrDone helper to remove the duplicated sleep+select between pollUntil and the export-job loop. - Generalize the maxConsecutiveTransientPollFailures/classifyPollError doc comments (they no longer name waitForResourceReady specifically; mention 429) and fix the "wrapped wrapped" typo. - Align the export loop's error formatting to %s. - Add tests: streak-reset (5 transient -> success -> 5 transient -> ready, pinning both the cap at 5 and reset-on-success) and transient/fatal classification for the floating_ips and subnets branches. Signed-off-by: Corey Christous <cchristous@confluent.io>

deepakibms

Overall the change looks good to me. I don’t see any blocking issues. One small nit: the comment in waitForExportJobToSucceed() says “Default to 45 minutes for image exports,” but the code actually defaults to 5 minutes. Might be worth correcting that comment for clarity.

Replace the hand-rolled poll-loop transient classifier with the IBM Cloud SDK's built-in request retries (go-sdk-core EnableRetries), enabled once at VPC service construction (StepCreateVPCServiceInstance, used by both the builder and the export post-processor). The SDK retries 429/5xx/network errors, honors Retry-After, and backs off exponentially, covering every VPC call -- one-shot creates (CreateKey, CreateInstance, ...) and status polls alike -- so a single transient 502 no longer aborts a bake. Removes the now-redundant transientPollError / classifyPollError / isTransientPollError / maxConsecutiveTransientPollFailures machinery and the consecutive-failure counters in pollUntil and waitForExportJobToSucceed; those loops now just re-check status on the configured cadence and surface any genuine error. pollUntil's generalization (shared check func) is kept. Tests now exercise the SDK retry behavior end-to-end (502->success, give-up after the configured cap, fatal 4xx fails fast). Signed-off-by: Corey Christous <cchristous@confluent.io>

- Add TestStepCreateVPCServiceInstanceEnablesRetries: runs the construction step and asserts the service has a retryablehttp transport, so removing the EnableRetries call (which would silently drop transient tolerance for every VPC call) fails the build. The behavioral retry tests enable retries on a fresh service themselves and would not catch that regression. - Document why waitForExportJobToSucceed keeps a pollInterval seam while pollUntil hardcodes defaultPollInterval (its multi-poll loop is tested). Signed-off-by: Corey Christous <cchristous@confluent.io>

…mment Address review-pass findings: - The SDK honors a server-sent Retry-After uncapped; vpcRetryMaxInterval caps only the exponential backoff. Reword the constant comment so it no longer implies Retry-After is bounded by vpcRetryMaxInterval. - Fix a stale comment claiming a '45 minutes' default export timeout where the code uses 5 minutes. Signed-off-by: Corey Christous <cchristous@confluent.io>

Address final-round review findings: - Add TestWaitForResourceReadyAbortsOnAPIError and TestWaitForResourceReadyReturnsWhenReady so the production pollUntil loop is exercised end-to-end (the prior transient tests went through the SDK client directly). Both terminate on the first check, so no real poll wait. - Reword pollUntil's comment: transient errors are retried per-request by the SDK (up to vpcRetryMaxAttempts each), not absorbed across the poll loop, so an error surfacing here means retries were exhausted or it's fatal. Signed-off-by: Corey Christous <cchristous@confluent.io>

cchristous added 2 commits June 23, 2026 11:38

cchristous force-pushed the fix/vpc-image-poll-transient-tolerance branch from 8f5b1c4 to c53378a Compare June 23, 2026 15:39

cchristous marked this pull request as ready for review June 23, 2026 15:49

deepakibms approved these changes Jun 25, 2026

View reviewed changes

deepakibms requested review from astha-jain and sreekarbvibm June 25, 2026 11:40

cchristous mentioned this pull request Jun 26, 2026

fix(vpc): retry transient API errors on one-shot create calls cchristous/packer-plugin-ibmcloud#2

Closed

cchristous changed the title ~~fix(vpc): tolerate transient API errors while polling resource status~~ fix(vpc): tolerate transient API errors via SDK request retries Jun 26, 2026

cchristous added 3 commits June 26, 2026 10:57

cchristous force-pushed the fix/vpc-image-poll-transient-tolerance branch from 1a11fd8 to 0aba9e6 Compare June 26, 2026 14:58

astha-jain approved these changes Jun 30, 2026

View reviewed changes

deepakibms merged commit 9b54e96 into IBM:master Jul 1, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(vpc): tolerate transient API errors via SDK request retries#156

fix(vpc): tolerate transient API errors via SDK request retries#156
deepakibms merged 6 commits into
IBM:masterfrom
cchristous:fix/vpc-image-poll-transient-tolerance

cchristous commented Jun 23, 2026 •

edited

Loading

Uh oh!

deepakibms left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

cchristous commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Rationale — why this is the right change

Risk

Why this is safe

Pre-Submission Review

Known behavior changes & caveats

Uh oh!

deepakibms left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cchristous commented Jun 23, 2026 •

edited

Loading