Skip to content

fix(operator): halt remaining siblings on stop under continue#349

Open
Kiran01bm wants to merge 2 commits into
kiran01bm/operation-state-first-persistencefrom
kiran01bm/stop-halts-continue-siblings
Open

fix(operator): halt remaining siblings on stop under continue#349
Kiran01bm wants to merge 2 commits into
kiran01bm/operation-state-first-persistencefrom
kiran01bm/stop-halts-continue-siblings

Conversation

@Kiran01bm

@Kiran01bm Kiran01bm commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

What & why

Stacked on #345 (→ #343). Makes a user stop actually halt a rollout running under on_failure: continue.

The bug: under continue, a failed earlier deployment no longer blocks later siblings — but stop failed to stop them. The operation claim predicate had no stop check, so a continue-exempted pending sibling was still started after the user asked to stop.

apply (on_failure: continue),  user issues STOP
op1 = failed     op2 = pending     op3 = pending

BEFORE                                      AFTER
──────                                      ─────
STOP ignored by claim gate                STOP gates the pending claim
│                                             │
▼                                             ▼
op2 claimed & STARTED  ✗                   op2 / op3 → stopped  (never started)
op3 claimed & STARTED  ✗                      │
│                                             ▼
└─ or strands: parent held                parent derives terminal (failed)
running forever, stop never                   │
completes  ✗                                  ▼
stop request completed  ✓

The fix:

  1. Claim gate — a pending operation is un-claimable while its apply has a pending stop (added to the FindNextApplyOperation SELECT and the pending→running UPDATE). Stale-active recovery is untouched, so an in-flight drive still resumes and observes the stop.
  2. Stop reconciliation — terminalize the still-pending siblings to stopped so the aggregate settles terminal, then complete the stop. Driven inline in recoverApplyOperation when an op is being driven, and via a dedicated recoverApplyPendingStop claim path when nothing is active (which would otherwise strand a failed-continue apply).

Only pending rows are touched (never running/terminal), completed_at stays nil (stopped is resumable), writes are apply-lease guarded. Dormant until multi-deployment fan-out lands (one operation per apply today).

When the rollout settles to failed, surface the reason from the first
failed operation row instead of leaving whatever message the last-driven
operation wrote. Under on_failure "continue" the last driver may be a
successful sibling, which would otherwise leave the failed verdict with
no matching reason.
Under on_failure "continue" a stop did not stop the rollout: the
operation claim predicate had no stop check, so a continue-exempted
pending sibling was still started. Gate the pending claim on a pending
stop, then terminalize the pending siblings so the apply settles instead
of stranding running with siblings the gate keeps from ever starting.
@Kiran01bm Kiran01bm marked this pull request as ready for review June 15, 2026 07:02
@Kiran01bm Kiran01bm requested review from aparajon and morgo as code owners June 15, 2026 07:02
@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

Base automatically changed from kiran01bm/aggregate-error-message to kiran01bm/operation-state-first-persistence June 15, 2026 11:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants