fix(operator): persist operation state from its own tasks on drive#343
Open
Kiran01bm wants to merge 2 commits into
Open
fix(operator): persist operation state from its own tasks on drive#343Kiran01bm wants to merge 2 commits into
Kiran01bm wants to merge 2 commits into
Conversation
Under on_failure "continue" the parent apply is held running while siblings settle, so mirroring the parent down left a failed operation claimable and never recorded its failure. Derive the operation's result from its own tasks instead, then re-derive the parent from the operation rows. Fan-out-gated fix surfaced by #337 review.
There was a problem hiding this comment.
Pull request overview
Adjusts the operator drive/recovery flow so each apply_operations row is persisted from the operation’s own task outcomes (instead of mirroring the parent apply state), preventing re-claim/re-drive loops and stale sibling-gating under on_failure: continue when the parent apply projection is intentionally held running.
Changes:
- Persist operation state from its own tasks via
markOperationFromOwnResult, then reload and re-derive the parent apply state from operation rows. - Extract shared state→row-write mapping into
persistOperationState, used by both the drive path and terminal-parent reconciliation path. - Add unit tests covering (a) failed op recorded failed independent of parent state and (b) non-terminal op left claimable.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| pkg/api/operator.go | Drive path now persists apply_operations from the operation’s task-derived result before re-deriving the parent; adds shared persistence helper + task-error extraction. |
| pkg/api/operator_test.go | Adds unit tests validating failed/non-terminal behavior for the new own-result persistence path. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Task rows carry task-vocabulary states; compare them with state.Task.* instead of state.Apply.* so they don't silently drift if the vocabularies diverge.
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What and Why ?
When an apply fans out to multiple deployments under
on_failure: continue, the parent apply is heldrunninguntil every sibling settles. The drive path mirrored the parent apply state onto the just-driven operation row, so while the parent was heldrunninga terminally failed deployment hit the non-terminal "leave claimable" branch: its failure was never recorded, the row was re-claimed and re-driven on every poll, and the deployment-order gate (which keys off an earlier sibling'sfailedstate undercontinue) read a stale value.This drives each operation off its own result. After a successful resume the operation row is persisted from the aggregate of its own tasks; the parent apply state is then re-derived from the operation rows via the policy-aware projection.
The drive path now records each operation's own outcome instead of mirroring the parent apply down, so a failed deployment is durably persisted even while the
on_failure: continueprojection holds the parent running.Correct for single-op today (the operation result equals the aggregate), and required before the multi-deployment fan-out is wired.
Changes
markOperationFromOwnResult: derive the operation's state from its tasks (Tasks().GetByApplyOperationID) and persist the row.persistOperationState;markOperationFromApplyStatenow delegates to it.Testing / validation
go build ./...,go vet ./pkg/api,gofmtclean, fullpkg/apiunit suite green.failed(with its own task's message) independent of the parent state; a still-running operation is left claimable.References
Fan-out-gated fix surfaced by the #337 review (operation-state-first persistence).