Skip to content

fix(vpc): tolerate transient API errors via SDK request retries#156

Merged
deepakibms merged 6 commits into
IBM:masterfrom
cchristous:fix/vpc-image-poll-transient-tolerance
Jul 1, 2026
Merged

fix(vpc): tolerate transient API errors via SDK request retries#156
deepakibms merged 6 commits into
IBM:masterfrom
cchristous:fix/vpc-image-poll-transient-tolerance

Conversation

@cchristous

@cchristous cchristous commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Summary

Tolerate transient IBM Cloud VPC API errors (HTTP 429/5xx and network blips) so a single flaky response no longer aborts an image bake. This is done by enabling the IBM Cloud SDK's built-in request retries on the shared VPC service client, which covers every VPC call — one-shot creates (SSH key, instance, image, floating IP, security groups, image-export job) and status polls alike.

Motivating failure: an s390x agent-image bake failed when the SSH-key create call hit a transient 502 Bad Gateway and the plugin gave up after 2.6s with no retry (failing job 01bd515f-538b-4a09-8864-d9eddcbb3522, 2026-06-26). The bake runs automatically per-PR/on-merge, so one transient 502 fails the whole build.

Note: this PR was reworked. It previously hand-rolled a transient/fatal classifier and consecutive-failure counters inside the status-poll loops. It now delegates retries to the SDK (EnableRetries), which is a strict superset of that behavior and also covers the one-shot create calls — the place the original failure actually occurred.

Rationale — why this is the right change

The IBM Cloud Go SDK (go-sdk-core) already ships a transport-level retry mechanism, BaseService.EnableRetries(maxRetries, maxRetryInterval), whose policy (IBMCloudSDKRetryPolicy) retries exactly the transient cases we care about — 429, 5xx except 501, and network-level failures — while also doing things a hand-rolled poll-loop classifier cannot:

  • Honors the Retry-After header (parses both seconds and HTTP-date), so we back off as the API asks during throttling.
  • Exponential backoff for retries, capped at vpcRetryMaxInterval (a server-sent Retry-After is honored as given and is not bounded by it).
  • Excludes genuinely non-retryable network errors (bad scheme, invalid header, TLS-cert failures, too-many-redirects).

Crucially, retries live at the single VPC service construction site (StepCreateVPCServiceInstance, used by both the builder and the export post-processor), so one line — vpcService.EnableRetries(vpcRetryMaxAttempts, vpcRetryMaxInterval) — protects every create and every poll. The original failure was on a one-shot CreateKey, which a poll-loop-only fix would not have covered.

This let us delete the bespoke transientPollError / classifyPollError / isTransientPollError / maxConsecutiveTransientPollFailures machinery and the consecutive-failure counters in pollUntil and waitForExportJobToSucceed (net −400 / +107 lines in the rework). Those loops now simply re-check status on a fixed cadence and surface any genuine error — transient blips are absorbed beneath them by the SDK.

Alternatives considered:

  • Keep the hand-rolled poll-loop classifier (the original approach). Rejected: it only covered poll GETs, not the create calls where the bake actually failed; it duplicated retry logic the SDK already provides; and it ignored Retry-After.
  • Wrap each one-shot create call in a local retry helper. Rejected as a competing, lower-altitude pattern — it re-implements per call site what the SDK does once at the transport layer.

Risk

  • Blast radius: Medium. This plugin bakes agent images; every VPC call is on the critical build path, and EnableRetries affects all of them. But it's a single, well-understood SDK API and the change is otherwise a net deletion.
  • Change criticality: Low–Medium. The happy path is unchanged (a successful call returns on the first attempt). The behavior change is confined to the error path: transient errors are now retried with backoff before surfacing; fatal 4xx still fail fast on the first attempt.
  • Overall: Low–Medium.
  • Mitigations: Retries are bounded (vpcRetryMaxAttempts = 5; exponential backoff capped at vpcRetryMaxInterval = 30s, though a server-sent Retry-After is honored as given); StateTimeout is unchanged and still bounds the overall waits (the poll select returns on timeout even if an SDK retry is in flight); tests cover retry-then-success, give-up-after-cap, fatal-4xx-fast-fail, and the construction wiring; full suite + go vet + gofmt green.

Why this is safe

  • Happy path unchanged: EnableRetries only alters behavior when a request fails transiently; successful calls are unaffected, with no added latency.
  • Fatal errors still fail fast: IBMCloudSDKRetryPolicy does not retry 4xx, so genuine bad requests surface immediately as before.
  • StateTimeout preserved: pollUntil/waitForExportJobToSucceed run their check in a goroutine and select on time.After(timeout); a slow SDK-retrying GET cannot defeat the timeout, and the buffered result channel prevents any send-deadlock.
  • Bounded: at most vpcRetryMaxAttempts retries per call; exponential backoff capped at vpcRetryMaxInterval (a server-sent Retry-After is respected as given).
  • Wiring is guarded by a test: TestStepCreateVPCServiceInstanceEnablesRetries fails if the EnableRetries call is ever removed, so the whole premise can't silently regress.

Pre-Submission Review

  • /pr-review-toolkit:review-pr — run against the rework. The test-coverage analyzer (→ added the EnableRetries wiring test) and the simplification/altitude reviewer (→ confirmed keeping pollUntil's generalization; documented the export-poll test seam) completed and their findings were addressed. Two general code-review passes were also dispatched; they did not return in time, so their scope — overall correctness and the StateTimeout/retry-backoff interaction — was verified directly instead (single construction site covers all paths; timeout still bounds the waits; constants sized sensibly).
  • /simplify — covered by the same parallel run above (reuse/simplification/efficiency/altitude); findings addressed (net deletion, no dead code, generalization retained).

Known behavior changes & caveats

  • Per-request retry budget, not a loop-level streak. The removed hand-rolled poll loop tolerated up to 5 consecutive transient failures across the multi-minute poll (resetting on any success). SDK retries are per request: each status GET retries up to vpcRetryMaxAttempts with backoff, and if a single GET still fails, the wait aborts with that error. In practice this is equivalent-or-better (a blip is retried within seconds rather than waiting for the next 10s poll), but the surfaced error on a genuinely degraded API is the raw last SDK error rather than the old "giving up after N consecutive transient errors" message — slightly less self-describing. Enable vpc_log = "debug" to see the SDK's retry attempts.
  • Retry-After is honored as the server sends it. vpcRetryMaxInterval (30s) caps the exponential backoff but not a server-sent integer Retry-After. The poll loops still return on their own StateTimeout/ExportTimeout (the timeout select fires regardless), so the timeout contract holds; a pathological large Retry-After would only leave the in-flight poll goroutine lingering briefly in the background (bounded by vpcRetryMaxAttempts × Retry-After, buffered result channel — no deadlock). Threading a deadline context into the poll GETs to bound that is a reasonable future hardening but is not needed for correctness here.

The resource-status poll loops aborted the whole build on a single
transient IBM VPC API error (5xx/429 or a network blip), even during
long-running (~48 min) bakes. A flaky GetImage during the post-capture
"wait for AVAILABLE" poll would fail an otherwise-successful build.

Classify poll errors as transient (5xx/429, or no/zero-status response =
network-level failure) vs fatal (4xx). Transient errors are retried up to
maxConsecutiveTransientPollFailures (5) consecutive times, resetting on any
successful poll; fatal errors still abort immediately. The overall
StateTimeout remains the outer bound.

Applied to all three poll loops, which previously treated any error as
fatal: waitForResourceReady/isResourceReady and waitForResourceDown/
isResourceDown (now share a single pollUntil helper), and
waitForExportJobToSucceed (the 5-45 min image-export poll). The
previously-discarded *core.DetailedResponse is captured to read the status
code. A pollInterval seam was added so the loops are unit-testable without
real 10s sleeps.

Signed-off-by: Corey Christous <cchristous@confluent.io>
- Drop ui.Error/log.Println from isResourceDown's error path so a retried
  transient blip no longer prints a misleading [ERROR] line before the
  "retrying" message (matches isResourceReady, which only returns the error).
- Extract sleepOrDone helper to remove the duplicated sleep+select between
  pollUntil and the export-job loop.
- Generalize the maxConsecutiveTransientPollFailures/classifyPollError doc
  comments (they no longer name waitForResourceReady specifically; mention
  429) and fix the "wrapped wrapped" typo.
- Align the export loop's error formatting to %s.
- Add tests: streak-reset (5 transient -> success -> 5 transient -> ready,
  pinning both the cap at 5 and reset-on-success) and transient/fatal
  classification for the floating_ips and subnets branches.

Signed-off-by: Corey Christous <cchristous@confluent.io>
@cchristous cchristous force-pushed the fix/vpc-image-poll-transient-tolerance branch from 8f5b1c4 to c53378a Compare June 23, 2026 15:39
@cchristous cchristous marked this pull request as ready for review June 23, 2026 15:49

@deepakibms deepakibms left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the change looks good to me. I don’t see any blocking issues. One small nit: the comment in waitForExportJobToSucceed() says “Default to 45 minutes for image exports,” but the code actually defaults to 5 minutes. Might be worth correcting that comment for clarity.

@cchristous cchristous changed the title fix(vpc): tolerate transient API errors while polling resource status fix(vpc): tolerate transient API errors via SDK request retries Jun 26, 2026
Replace the hand-rolled poll-loop transient classifier with the IBM Cloud
SDK's built-in request retries (go-sdk-core EnableRetries), enabled once at
VPC service construction (StepCreateVPCServiceInstance, used by both the
builder and the export post-processor). The SDK retries 429/5xx/network
errors, honors Retry-After, and backs off exponentially, covering every VPC
call -- one-shot creates (CreateKey, CreateInstance, ...) and status polls
alike -- so a single transient 502 no longer aborts a bake.

Removes the now-redundant transientPollError / classifyPollError /
isTransientPollError / maxConsecutiveTransientPollFailures machinery and the
consecutive-failure counters in pollUntil and waitForExportJobToSucceed;
those loops now just re-check status on the configured cadence and surface
any genuine error. pollUntil's generalization (shared check func) is kept.

Tests now exercise the SDK retry behavior end-to-end (502->success, give-up
after the configured cap, fatal 4xx fails fast).

Signed-off-by: Corey Christous <cchristous@confluent.io>
- Add TestStepCreateVPCServiceInstanceEnablesRetries: runs the construction
  step and asserts the service has a retryablehttp transport, so removing the
  EnableRetries call (which would silently drop transient tolerance for every
  VPC call) fails the build. The behavioral retry tests enable retries on a
  fresh service themselves and would not catch that regression.
- Document why waitForExportJobToSucceed keeps a pollInterval seam while
  pollUntil hardcodes defaultPollInterval (its multi-poll loop is tested).

Signed-off-by: Corey Christous <cchristous@confluent.io>
…mment

Address review-pass findings:
- The SDK honors a server-sent Retry-After uncapped; vpcRetryMaxInterval
  caps only the exponential backoff. Reword the constant comment so it no
  longer implies Retry-After is bounded by vpcRetryMaxInterval.
- Fix a stale comment claiming a '45 minutes' default export timeout where
  the code uses 5 minutes.

Signed-off-by: Corey Christous <cchristous@confluent.io>
@cchristous cchristous force-pushed the fix/vpc-image-poll-transient-tolerance branch from 1a11fd8 to 0aba9e6 Compare June 26, 2026 14:58
Address final-round review findings:
- Add TestWaitForResourceReadyAbortsOnAPIError and
  TestWaitForResourceReadyReturnsWhenReady so the production pollUntil loop is
  exercised end-to-end (the prior transient tests went through the SDK client
  directly). Both terminate on the first check, so no real poll wait.
- Reword pollUntil's comment: transient errors are retried per-request by the
  SDK (up to vpcRetryMaxAttempts each), not absorbed across the poll loop, so
  an error surfacing here means retries were exhausted or it's fatal.

Signed-off-by: Corey Christous <cchristous@confluent.io>
@deepakibms deepakibms merged commit 9b54e96 into IBM:master Jul 1, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants