Skip to content

fix(vpc): retry transient API errors on one-shot create calls#2

Closed
cchristous wants to merge 2 commits into
fix/vpc-image-poll-transient-tolerancefrom
fix/vpc-create-transient-retry
Closed

fix(vpc): retry transient API errors on one-shot create calls#2
cchristous wants to merge 2 commits into
fix/vpc-image-poll-transient-tolerancefrom
fix/vpc-create-transient-retry

Conversation

@cchristous

Copy link
Copy Markdown
Owner

Summary

Transient API failures (HTTP 5xx/429 or network blips) on one-shot create/mutating VPC calls previously aborted the whole image bake on the first error. This wraps those calls in a bounded retry, reusing the transient/fatal classifier introduced in IBM#156 for the polling loops.

This is a follow-on to IBM#156 and is stacked on its branch (fix/vpc-image-poll-transient-tolerance) so it can reuse isTransientPollError / maxConsecutiveTransientPollFailures rather than duplicate them. Rebase onto master once IBM#156 merges.

Motivating failure: an s390x agent-image bake failed when the SSH-key create call hit a transient 502 Bad Gateway and the plugin gave up after 2.6s with no retry (failing job 01bd515f-538b-4a09-8864-d9eddcbb3522, 2026-06-26). The bake runs automatically per-PR/on-merge, so one transient 502 fails the whole build.

==> Creating a new SSH key for VPC...
[ERROR] Error sending the HTTP request that creates the SSH Key for VPC. Error: Bad Gateway
Build 'ibmcloud-vpc.agent' errored after 2 seconds 665 milliseconds

Rationale — why this is the right change

IBM#156 added transient-error tolerance only inside the status-poll loops (pollUntil, isResourceReady/isResourceDown, the image-export poll), via a transient = 5xx/429/network, fatal = 4xx classifier (isTransientPollError) and a consecutive-failure cap (maxConsecutiveTransientPollFailures = 5). It deliberately did not touch the one-shot create calls, so CreateKey, CreateInstance, etc. still failed on the first transient error.

This PR closes that gap using the same mechanism, so the create path and the poll path treat the same errors the same way:

  • Adds retryTransient(state, action, op) next to the poll machinery in client.go. It reuses the existing isTransientPollError classifier and maxConsecutiveTransientPollFailures cap, retries transient failures with the existing pollInterval backoff, and fails fast on fatal 4xx. On exhaustion it folds the give-up reason into the returned error (mirroring the poll loop) so an operator can tell a flaky/throttled failure apart from a hard 4xx.
  • Wraps the one-shot create/mutating calls: CreateKey, CreateInstanceAction, CreateFloatingIP, CreateSecurityGroup, CreateSecurityGroupRule, CreateSecurityGroupTargetBinding (client.go), CreateImage (step_capture_image.go), CreateImageExportJob (step_image_export.go), and CreateInstance (all four instance-prototype branches in step_create_instance.go, collapsed into a shared createInstanceWithRetry helper).
  • Extracts effectivePollInterval() so the poll and create paths share one interval-resolution seam.

Alternatives considered:

  • SDK-native EnableRetries on the shared vpcv1.VpcV1 client (transport-level retries, honors Retry-After, idempotency-aware network-error exclusions). This is a better long-term altitude and would delete retryTransient, but it replaces fix(vpc): tolerate transient API errors via SDK request retries IBM/packer-plugin-ibmcloud#156's hand-rolled approach with a competing pattern, changes retry behavior for the poll GETs too, and is a larger change than this targeted fix warrants. Recommended as a follow-up to converge the poll and create paths onto one primitive.
  • Wrapping the cleanup Delete* calls. Out of scope: those run in Cleanup and don't abort the bake.

Risk

  • Blast radius: Medium. The IBM Cloud VPC Packer plugin is the tool that bakes agent images; every wrapped call is on the critical build path. But the change is confined to one package and is additive — it only alters the behavior of the error path.
  • Change criticality: Low–Medium. The happy path is unchanged (on success op() returns immediately, identical to the prior one-shot call). The only behavioral change is on error: transient errors now retry up to 5× with backoff before surfacing the same error; fatal 4xx behavior and StateTimeout are unchanged.
  • Overall: Low–Medium.
  • Mitigations: Reuses the already-reviewed fix(vpc): tolerate transient API errors via SDK request retries IBM/packer-plugin-ibmcloud#156 classifier/cap; fatal 4xx still fails fast on first occurrence; retries are bounded (1 initial + 5 retries) with a clear give-up error; unit + httptest integration tests cover retry-then-success, fatal-fast-fail, the cap, network-blip (nil response), and a fatal-mid-retry short-circuit; full suite (133 tests), go vet, and gofmt are green.

Known trade-off (documented in code)

isTransientPollError treats an error with no HTTP response (nil DetailedResponse) as transient — that's required to retry genuine network blips, but it also means a deterministic SDK client-side validation error (also nil response) would be retried ~50s before failing. The impact is bounded (capped, returns the same error) and only occurs on a malformed request that would fail the build regardless. For non-idempotent creates, a network blip after the server processed the request could in principle duplicate a resource; in practice these calls use fixed resource names, so a true duplicate retry hits a name conflict and fails fast as a 4xx rather than orphaning resources. Both points are noted in the retryTransient doc comment.

Why this is safe

The mitigations cover the identified risk directly:

  • Happy path is byte-for-byte equivalent to before — retryTransient calls op() once and returns on err == nil, so a healthy build sees no new behavior and no added latency.
  • Fatal errors are unchanged — non-transient (4xx) errors return on the first occurrence, preserving the existing fast-fail and error messages; each caller keeps its existing fmt.Errorf("[ERROR] ...") wrapping and state.Put("error", ...) handling.
  • Bounded and observable — retries are capped at the same constant fix(vpc): tolerate transient API errors via SDK request retries IBM/packer-plugin-ibmcloud#156 uses; each retry emits a ui.Say, and exhaustion produces an explicit "giving up after N consecutive transient errors" message in the build output.
  • Tested — the retry contract is pinned by unit tests and proven end-to-end through a real createSecurityGroup call against an httptest server, mirroring client_test.go's existing style.

The change reuses the same classifier and cap that already passed review in IBM#156, so the create path inherits that path's vetted transient/fatal semantics rather than introducing a second, divergent definition.

Pre-Submission Review

  • /pr-review-toolkit:review-pr — run (6 parallel agents: 3× code-reviewer, silent-failure-hunter, pr-test-analyzer, comment-analyzer). Findings addressed: (1) folded retry-exhaustion into the returned error so it's visible in the UI, not just logged; (2) added network-blip and fatal-mid-retry test cases; (3) documented the SDK-validation-error / duplicate-create trade-off. The "move to SDK EnableRetries" altitude finding is intentionally deferred (see Alternatives) and called out above as a follow-up.
  • /simplify — run (4 parallel cleanup agents: reuse, simplification, efficiency, altitude). Findings addressed: collapsed all four CreateInstance branches into createInstanceWithRetry (removing copy-paste) and extracted the shared effectivePollInterval() helper.

Transient API failures (HTTP 5xx/429 or network blips) on one-shot
create/mutating VPC calls previously aborted the whole bake on the first
error. PR IBM#156 added transient-error tolerance to the status-poll loops
but left the create calls unprotected, so a single 502 on e.g. SSH-key
creation failed the build (job 01bd515f-538b-4a09-8864-d9eddcbb3522).

Add a retryTransient helper that reuses IBM#156's isTransientPollError
classifier and maxConsecutiveTransientPollFailures cap, retrying
transient failures with the existing pollInterval backoff while still
failing fast on fatal 4xx errors. Wrap the one-shot create/mutating
calls: CreateKey, CreateInstanceAction, CreateFloatingIP,
CreateSecurityGroup, CreateSecurityGroupRule,
CreateSecurityGroupTargetBinding, CreateImage, CreateImageExportJob and
CreateInstance.
- Fold retry-exhaustion into the returned error (mirroring the poll
  loop) so an exhausted-transient failure is distinguishable from a
  fast-failed 4xx in the build output, instead of only being logged.
- Extract createInstanceWithRetry and route all four instance-prototype
  branches through it, removing the byte-identical copy-paste blocks.
- Extract effectivePollInterval shared by pollUntil and retryTransient.
- Document the duplicate-create consideration for non-idempotent calls
  (fixed resource names make a true duplicate fail fast as a 4xx).
- Add tests: network-blip (nil response) retry, and a fatal error
  encountered mid-retry short-circuits without running out the cap.
@cchristous cchristous closed this Jun 26, 2026
@cchristous cchristous deleted the fix/vpc-create-transient-retry branch June 26, 2026 14:16
@cchristous

Copy link
Copy Markdown
Owner Author

Superseded by IBM#156, which now delivers transient-error tolerance for all VPC calls — including the one-shot creates this PR wrapped — via the IBM Cloud SDK's built-in request retries (EnableRetries), rather than a hand-rolled retryTransient helper.

That approach is a strict superset of this one (it also covers status polls, honors Retry-After, and backs off exponentially) and removes ~400 lines of bespoke retry machinery instead of adding more. Closing this in favor of IBM#156.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant