feat: retry annotation value selects retry mode (retry|skip)#42
Merged
Conversation
Add rollout.kuberik.com/retry annotation to re-run a failed deployment: - Rollout controller resets History[0].BakeStatus from Failed to Deploying and records LastRetryTimestamp - HealthCheck controller resets status when LastRetryTimestamp is newer than LastErrorTime / LastChangeTime - KustomizationHealth reclassifies stale pre-retry failure conditions (LastTransitionTime < LastRetryTimestamp) as Pending, preventing bogus errors during the brief unwind window Double-retry is a no-op: handler only acts when BakeStatus == Failed.
The annotation value now carries how to handle stale-failed RolloutTests: "retry" (default): re-run failed tests "skip": mark failed tests as Skipped, advance without re-running Mode is persisted on History[0].LastRetryMode so downstream controllers (openkruise stepgate) can read it after the annotation is cleared. Also fold the Phase 4 integration tests (end-to-end retry, multi-cycle retries) into the retry test suite.
kustomizationhealth now stores actual condition timestamps in LastErrorTime instead of "now". Rollout controller computes errorCutoff = max(deployTime, retryTime) and compares LastErrorTime against that, removing the need for isStaleFailure / findLastRetryTimestampForHealthCheck in kustomizationhealth. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* refactor: remove RetryMode from API; retry annotation is now value-agnostic The mode of retry (re-run vs skip) belongs to the domain that owns the tests, not to the rollout controller. Removing LastRetryMode field and RetryModeRetry/Skip constants makes the rollout-controller API generic — it only stamps LastRetryTimestamp, which is the cross-controller contract. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: use errorCutoff (not deployTime) for deploy-timeout and canStartBake gates After retry, errorCutoff = max(deployTime, lastRetryTimestamp). Three call sites were still using raw deployTime, causing incorrect behavior post-retry: 1. rollout_controller.go:1625 - deploy-timeout deadline: if the original deploy was already past DeployTimeout when the retry happened, the next reconcile immediately marked the rollout Failed with "deploy timeout" before any health check had a chance to re-evaluate. 2. rollout_controller.go:1743 - canStartBake gate: health checks with LastChangeTime between deployTime and lastRetryTimestamp passed the gate, meaning bake could start based on a pre-retry health report. 3. rollout_controller.go:1643 - collectUnhealthyHealthChecks: FailedHealthChecks on the deploy-timeout path included pre-retry stale failures in the record. Also fix kustomizationhealth_controller.go:354 - when a kustomization is Unhealthy but getFailureConditionTime returns nil (no conditions on the managed resource), LastErrorTime was not updated, leaving a stale or nil timestamp. The rollout controller compares LastErrorTime against errorCutoff to detect fresh failures; a stale timestamp causes the failure to be silently ignored until the next healthy health-check cycle that does return a timestamp. Fall back to now when errorTime is nil so there is always a valid timestamp. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * test: cover errorCutoff gates and nil-errorTime fallback Three new errorCutoff tests (retry_test.go): - deploy-timeout window resets after retry — no instant failure if original deadline elapsed - canStartBake blocks when health check LastChangeTime predates retry timestamp - FailedHealthChecks records health checks whose LastChangeTime predates retry Two nil-errorTime tests (kustomizationhealth_controller_test.go): - LastErrorTime is set to now when errorTime is nil (resource has no condition timestamps) - LastErrorTime uses provided errorTime when non-nil Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * lint --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.