Skip to content

🐛 fix: (boxcutter) Allow spec changes after ProgressDeadlineExceeded#2610

Open
camilamacedo86 wants to merge 1 commit intooperator-framework:mainfrom
camilamacedo86:fix-transitions
Open

🐛 fix: (boxcutter) Allow spec changes after ProgressDeadlineExceeded#2610
camilamacedo86 wants to merge 1 commit intooperator-framework:mainfrom
camilamacedo86:fix-transitions

Conversation

@camilamacedo86
Copy link
Copy Markdown
Contributor

@camilamacedo86 camilamacedo86 commented Mar 31, 2026

How it works on main

The COS reconciler checks if a revision took too long to roll out. If it did, it sets
Progressing=False/ProgressDeadlineExceeded.

To avoid a reconcile loop after that, main has a watch predicate that blocks all
updates to COS objects with ProgressDeadlineExceeded. The predicate only allows
deletions through.

What is the problem?

The predicate blocks too much. It blocks all updates, not just status updates.

Example: a new revision rolls out and tries to archive the old stuck one by patching
lifecycleState: Archived. The predicate sees ProgressDeadlineExceeded and drops the
event. The old revision never gets archived.

Example scenario

  1. User installs a ClusterExtension. The CE controller creates COS-rev-1.
  2. COS-rev-1 gets stuck (e.g. a Deployment never becomes ready). After
    ProgressDeadlineMinutes, the reconciler sets Progressing=False/ProgressDeadlineExceeded.
  3. User updates the ClusterExtension. The CE controller creates COS-rev-2.
  4. COS-rev-2 rolls out successfully. It patches COS-rev-1 with lifecycleState: Archived
    so the old revision gets cleaned up.
  5. The watch predicate sees COS-rev-1 has ProgressDeadlineExceeded and drops the event.
    COS-rev-1 never reconciles, never processes the archival, and stays stuck forever.

Why does the reconcile loop happen?

Every time the reconciler runs on a deadline-exceeded COS, it calls markAsProgressing()
which sets Progressing=True. Then the deadline check immediately sets it back to
Progressing=False/ProgressDeadlineExceeded. This back-and-forth produces a status
update, which triggers another reconcile, and so on.

The predicate was added to break this loop. But it also breaks archiving.

The fix

Make markAsProgressing() not overwrite ProgressDeadlineExceeded. Once the deadline
is exceeded, the Progressing condition stays put.

No flip-flop means no unnecessary status updates. No unnecessary status updates means
no loop. No loop means the predicate is not needed. So it is removed.

Everything else works the same: archiving, deletion, object reconciliation, error retries.

Alternative solution:

Keep the predicate but made it smarter: it extracted it into a named function skipProgressDeadlineExceededUpdateFunc and added a generation check. If the generation changed (meaning a spec change like lifecycleState: Archived), it allowed the event through. Only status-only updates (same generation) were suppressed.

Motivation

Addresses feedback from:
openshift/operator-framework-operator-controller#687 (comment)

Copilot AI review requested due to automatic review settings March 31, 2026 05:30
@netlify
Copy link
Copy Markdown

netlify bot commented Mar 31, 2026

Deploy Preview for olmv1 ready!

Name Link
🔨 Latest commit 38e7024
🔍 Latest deploy log https://app.netlify.com/projects/olmv1/deploys/69ccba585084f6000802cdda
😎 Deploy Preview https://deploy-preview-2610--olmv1.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Mar 31, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign oceanc80 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@camilamacedo86 camilamacedo86 changed the title 🐛 fix: Allow spec changes after ProgressDeadlineExceeded 🐛 fix: (boxcutter) Allow spec changes after ProgressDeadlineExceeded Mar 31, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the ClusterObjectSet controller’s update-event predicate so that once a ClusterObjectSet hits ProgressDeadlineExceeded, spec-driven updates (identified via metadata.generation changes) are still reconciled while status-only updates remain suppressed—preventing resources from getting stuck and unable to transition (e.g., to Archived).

Changes:

  • Adjust skipProgressDeadlineExceededPredicate to allow updates when ObjectOld.Generation != ObjectNew.Generation.
  • Preserve the existing behavior of suppressing status-only update churn after ProgressDeadlineExceeded.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 31, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.88%. Comparing base (c304741) to head (38e7024).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2610      +/-   ##
==========================================
+ Coverage   68.84%   68.88%   +0.03%     
==========================================
  Files         139      139              
  Lines        9902     9895       -7     
==========================================
- Hits         6817     6816       -1     
+ Misses       2573     2570       -3     
+ Partials      512      509       -3     
Flag Coverage Δ
e2e 37.71% <0.00%> (+0.05%) ⬆️
experimental-e2e 52.21% <100.00%> (-0.02%) ⬇️
unit 53.62% <100.00%> (+0.09%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

func (c *ClusterObjectSetReconciler) SetupWithManager(mgr ctrl.Manager) error {
skipProgressDeadlineExceededPredicate := predicate.Funcs{
UpdateFunc: func(e event.UpdateEvent) bool {
rev, ok := e.ObjectNew.(*ocv1.ClusterObjectSet)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would extract the whole update func into standalone function, not just the logic under shouldAllowProgressDeadlineExceededUpdate

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pedjak
I think we might can remove the predicate.
See the current proposal now.
WDYT? I tried to update the PR desc as well

Copilot AI review requested due to automatic review settings March 31, 2026 07:32
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@joelanford
Copy link
Copy Markdown
Member

Maybe I misunderstand the progress deadline exceeded stuff. But I thought exceeding the deadline literally only meant that we set Progressing=False?

But otherwise all other functionality and decision making around reconciliation of the CE works the same as it does prior to the deadline exceeding.

If that's not the case, why?

@camilamacedo86 camilamacedo86 changed the title 🐛 fix: (boxcutter) Allow spec changes after ProgressDeadlineExceeded WIP 🐛 fix: (boxcutter) Allow spec changes after ProgressDeadlineExceeded Apr 1, 2026
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 1, 2026
@camilamacedo86 camilamacedo86 changed the title WIP 🐛 fix: (boxcutter) Allow spec changes after ProgressDeadlineExceeded 🐛 fix: (boxcutter) Allow spec changes after ProgressDeadlineExceeded Apr 1, 2026
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 1, 2026
Copilot AI review requested due to automatic review settings April 1, 2026 06:16
@camilamacedo86
Copy link
Copy Markdown
Contributor Author

Hi @joelanford

I changed the proposal here. I think it might be a better way to solve it.
Also, with the Claude I update the description with a description of this.
I tried to answer your questions. Thank you for your time.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

@pedjak pedjak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three suggestions:

  1. Add a comment explaining the ReasonSucceeded carve-out. The guard uses reason != ReasonSucceeded, which means any new reason added in the future is silently blocked by default. That's the safe default, but it's a hidden constraint — a short comment explaining why only Succeeded is exempt would prevent a future contributor from accidentally breaking this.

  2. Add a debug log in the early return. When the guard fires, it silently swallows a state transition with zero signal. A V(1) log line like "skipping markAsProgressing: ProgressDeadlineExceeded is sticky" would help debugging when someone is troubleshooting why a COS condition isn't updating.

  3. Consider an integration test for the archival scenario (follow-up). The unit tests cover markAsProgressing well, but the motivating bug (archival blocked after deadline exceeded) isn't tested e2e. A test simulating deadline exceeded → lifecycleState: Archived → verify reconcile proceeds would cover the actual user-facing scenario.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants