Skip to content

feat: retry annotation value selects retry mode (retry|skip)#42

Merged
LittleChimera merged 5 commits into
mainfrom
feat/retry-annotation
Apr 28, 2026
Merged

feat: retry annotation value selects retry mode (retry|skip)#42
LittleChimera merged 5 commits into
mainfrom
feat/retry-annotation

Conversation

@LittleChimera
Copy link
Copy Markdown
Contributor

No description provided.

LittleChimera and others added 5 commits April 22, 2026 07:51
Add rollout.kuberik.com/retry annotation to re-run a failed deployment:
- Rollout controller resets History[0].BakeStatus from Failed to Deploying
  and records LastRetryTimestamp
- HealthCheck controller resets status when LastRetryTimestamp is newer
  than LastErrorTime / LastChangeTime
- KustomizationHealth reclassifies stale pre-retry failure conditions
  (LastTransitionTime < LastRetryTimestamp) as Pending, preventing
  bogus errors during the brief unwind window

Double-retry is a no-op: handler only acts when BakeStatus == Failed.
The annotation value now carries how to handle stale-failed RolloutTests:
  "retry" (default): re-run failed tests
  "skip":            mark failed tests as Skipped, advance without re-running

Mode is persisted on History[0].LastRetryMode so downstream controllers
(openkruise stepgate) can read it after the annotation is cleared.

Also fold the Phase 4 integration tests (end-to-end retry, multi-cycle
retries) into the retry test suite.
kustomizationhealth now stores actual condition timestamps in LastErrorTime
instead of "now". Rollout controller computes errorCutoff = max(deployTime,
retryTime) and compares LastErrorTime against that, removing the need for
isStaleFailure / findLastRetryTimestampForHealthCheck in kustomizationhealth.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* refactor: remove RetryMode from API; retry annotation is now value-agnostic

The mode of retry (re-run vs skip) belongs to the domain that owns the
tests, not to the rollout controller. Removing LastRetryMode field and
RetryModeRetry/Skip constants makes the rollout-controller API generic —
it only stamps LastRetryTimestamp, which is the cross-controller contract.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: use errorCutoff (not deployTime) for deploy-timeout and canStartBake gates

After retry, errorCutoff = max(deployTime, lastRetryTimestamp). Three call
sites were still using raw deployTime, causing incorrect behavior post-retry:

1. rollout_controller.go:1625 - deploy-timeout deadline: if the original
   deploy was already past DeployTimeout when the retry happened, the next
   reconcile immediately marked the rollout Failed with "deploy timeout"
   before any health check had a chance to re-evaluate.

2. rollout_controller.go:1743 - canStartBake gate: health checks with
   LastChangeTime between deployTime and lastRetryTimestamp passed the gate,
   meaning bake could start based on a pre-retry health report.

3. rollout_controller.go:1643 - collectUnhealthyHealthChecks: FailedHealthChecks
   on the deploy-timeout path included pre-retry stale failures in the record.

Also fix kustomizationhealth_controller.go:354 - when a kustomization is
Unhealthy but getFailureConditionTime returns nil (no conditions on the managed
resource), LastErrorTime was not updated, leaving a stale or nil timestamp.
The rollout controller compares LastErrorTime against errorCutoff to detect
fresh failures; a stale timestamp causes the failure to be silently ignored
until the next healthy health-check cycle that does return a timestamp.
Fall back to now when errorTime is nil so there is always a valid timestamp.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* test: cover errorCutoff gates and nil-errorTime fallback

Three new errorCutoff tests (retry_test.go):
- deploy-timeout window resets after retry — no instant failure if original deadline elapsed
- canStartBake blocks when health check LastChangeTime predates retry timestamp
- FailedHealthChecks records health checks whose LastChangeTime predates retry

Two nil-errorTime tests (kustomizationhealth_controller_test.go):
- LastErrorTime is set to now when errorTime is nil (resource has no condition timestamps)
- LastErrorTime uses provided errorTime when non-nil

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* lint

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
@LittleChimera LittleChimera merged commit 100ab0e into main Apr 28, 2026
3 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant