🐛 fix: (boxcutter) Allow spec changes after ProgressDeadlineExceeded#2610
🐛 fix: (boxcutter) Allow spec changes after ProgressDeadlineExceeded#2610camilamacedo86 wants to merge 1 commit intooperator-framework:mainfrom
Conversation
✅ Deploy Preview for olmv1 ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
Updates the ClusterObjectSet controller’s update-event predicate so that once a ClusterObjectSet hits ProgressDeadlineExceeded, spec-driven updates (identified via metadata.generation changes) are still reconciled while status-only updates remain suppressed—preventing resources from getting stuck and unable to transition (e.g., to Archived).
Changes:
- Adjust
skipProgressDeadlineExceededPredicateto allow updates whenObjectOld.Generation != ObjectNew.Generation. - Preserve the existing behavior of suppressing status-only update churn after
ProgressDeadlineExceeded.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
internal/operator-controller/controllers/clusterobjectset_controller.go
Outdated
Show resolved
Hide resolved
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2610 +/- ##
==========================================
+ Coverage 68.84% 68.88% +0.03%
==========================================
Files 139 139
Lines 9902 9895 -7
==========================================
- Hits 6817 6816 -1
+ Misses 2573 2570 -3
+ Partials 512 509 -3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
6b78b23 to
426beb1
Compare
| func (c *ClusterObjectSetReconciler) SetupWithManager(mgr ctrl.Manager) error { | ||
| skipProgressDeadlineExceededPredicate := predicate.Funcs{ | ||
| UpdateFunc: func(e event.UpdateEvent) bool { | ||
| rev, ok := e.ObjectNew.(*ocv1.ClusterObjectSet) |
There was a problem hiding this comment.
I would extract the whole update func into standalone function, not just the logic under shouldAllowProgressDeadlineExceededUpdate
There was a problem hiding this comment.
@pedjak
I think we might can remove the predicate.
See the current proposal now.
WDYT? I tried to update the PR desc as well
426beb1 to
59eaf3d
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Maybe I misunderstand the progress deadline exceeded stuff. But I thought exceeding the deadline literally only meant that we set Progressing=False? But otherwise all other functionality and decision making around reconciliation of the CE works the same as it does prior to the deadline exceeding. If that's not the case, why? |
59eaf3d to
b2da731
Compare
b2da731 to
bf9e524
Compare
|
Hi @joelanford I changed the proposal here. I think it might be a better way to solve it. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
internal/operator-controller/controllers/clusterobjectset_controller.go
Outdated
Show resolved
Hide resolved
internal/operator-controller/controllers/clusterobjectset_controller_internal_test.go
Outdated
Show resolved
Hide resolved
Generated-by: Cursor:Claude
bf9e524 to
38e7024
Compare
pedjak
left a comment
There was a problem hiding this comment.
Three suggestions:
-
Add a comment explaining the
ReasonSucceededcarve-out. The guard usesreason != ReasonSucceeded, which means any new reason added in the future is silently blocked by default. That's the safe default, but it's a hidden constraint — a short comment explaining why onlySucceededis exempt would prevent a future contributor from accidentally breaking this. -
Add a debug log in the early return. When the guard fires, it silently swallows a state transition with zero signal. A
V(1)log line like"skipping markAsProgressing: ProgressDeadlineExceeded is sticky"would help debugging when someone is troubleshooting why a COS condition isn't updating. -
Consider an integration test for the archival scenario (follow-up). The unit tests cover
markAsProgressingwell, but the motivating bug (archival blocked after deadline exceeded) isn't tested e2e. A test simulating deadline exceeded →lifecycleState: Archived→ verify reconcile proceeds would cover the actual user-facing scenario.
How it works on main
The COS reconciler checks if a revision took too long to roll out. If it did, it sets
Progressing=False/ProgressDeadlineExceeded.To avoid a reconcile loop after that,
mainhas a watch predicate that blocks allupdates to COS objects with ProgressDeadlineExceeded. The predicate only allows
deletions through.
What is the problem?
The predicate blocks too much. It blocks all updates, not just status updates.
Example: a new revision rolls out and tries to archive the old stuck one by patching
lifecycleState: Archived. The predicate sees ProgressDeadlineExceeded and drops theevent. The old revision never gets archived.
Example scenario
ProgressDeadlineMinutes, the reconciler setsProgressing=False/ProgressDeadlineExceeded.lifecycleState: Archivedso the old revision gets cleaned up.
COS-rev-1 never reconciles, never processes the archival, and stays stuck forever.
Why does the reconcile loop happen?
Every time the reconciler runs on a deadline-exceeded COS, it calls
markAsProgressing()which sets
Progressing=True. Then the deadline check immediately sets it back toProgressing=False/ProgressDeadlineExceeded. This back-and-forth produces a statusupdate, which triggers another reconcile, and so on.
The predicate was added to break this loop. But it also breaks archiving.
The fix
Make
markAsProgressing()not overwriteProgressDeadlineExceeded. Once the deadlineis exceeded, the Progressing condition stays put.
No flip-flop means no unnecessary status updates. No unnecessary status updates means
no loop. No loop means the predicate is not needed. So it is removed.
Everything else works the same: archiving, deletion, object reconciliation, error retries.
Alternative solution:
Keep the predicate but made it smarter: it extracted it into a named function skipProgressDeadlineExceededUpdateFunc and added a generation check. If the generation changed (meaning a spec change like lifecycleState: Archived), it allowed the event through. Only status-only updates (same generation) were suppressed.
Motivation
Addresses feedback from:
openshift/operator-framework-operator-controller#687 (comment)