Skip to content

fix(operator): persist operation state from its own tasks on drive#343

Open
Kiran01bm wants to merge 2 commits into
mainfrom
kiran01bm/operation-state-first-persistence
Open

fix(operator): persist operation state from its own tasks on drive#343
Kiran01bm wants to merge 2 commits into
mainfrom
kiran01bm/operation-state-first-persistence

Conversation

@Kiran01bm

@Kiran01bm Kiran01bm commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

What and Why ?

When an apply fans out to multiple deployments under on_failure: continue, the parent apply is held running until every sibling settles. The drive path mirrored the parent apply state onto the just-driven operation row, so while the parent was held running a terminally failed deployment hit the non-terminal "leave claimable" branch: its failure was never recorded, the row was re-claimed and re-driven on every poll, and the deployment-order gate (which keys off an earlier sibling's failed state under continue) read a stale value.

This drives each operation off its own result. After a successful resume the operation row is persisted from the aggregate of its own tasks; the parent apply state is then re-derived from the operation rows via the policy-aware projection.

The drive path now records each operation's own outcome instead of mirroring the parent apply down, so a failed deployment is durably persisted even while the on_failure: continue projection holds the parent running.

drive op ─▶ resume ─▶ markOperationFromOwnResult(op)      ◀── derive op state from op's OWN tasks
                      (persistOperationState, under op lease)
        ─▶ reload parent ─▶ updateApplyStateFromOperations(parent)  ◀── policy-aware parent re-derive (#337)

Correct for single-op today (the operation result equals the aggregate), and required before the multi-deployment fan-out is wired.

Changes

  • Add markOperationFromOwnResult: derive the operation's state from its tasks (Tasks().GetByApplyOperationID) and persist the row.
  • Extract the state→row-write mapping into a shared persistOperationState; markOperationFromApplyState now delegates to it.
  • Drive path uses the own-result path and persists the operation before reloading the parent.
  • The unclaimable-parent reconciliation path keeps mirroring from the already-terminal parent (unchanged).

Testing / validation

  • go build ./..., go vet ./pkg/api, gofmt clean, full pkg/api unit suite green.
  • New unit tests: a failed operation is recorded failed (with its own task's message) independent of the parent state; a still-running operation is left claimable.

References

Fan-out-gated fix surfaced by the #337 review (operation-state-first persistence).

Under on_failure "continue" the parent apply is held running while
siblings settle, so mirroring the parent down left a failed operation
claimable and never recorded its failure. Derive the operation's result
from its own tasks instead, then re-derive the parent from the operation
rows. Fan-out-gated fix surfaced by #337 review.
Copilot AI review requested due to automatic review settings June 15, 2026 01:23

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adjusts the operator drive/recovery flow so each apply_operations row is persisted from the operation’s own task outcomes (instead of mirroring the parent apply state), preventing re-claim/re-drive loops and stale sibling-gating under on_failure: continue when the parent apply projection is intentionally held running.

Changes:

  • Persist operation state from its own tasks via markOperationFromOwnResult, then reload and re-derive the parent apply state from operation rows.
  • Extract shared state→row-write mapping into persistOperationState, used by both the drive path and terminal-parent reconciliation path.
  • Add unit tests covering (a) failed op recorded failed independent of parent state and (b) non-terminal op left claimable.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
pkg/api/operator.go Drive path now persists apply_operations from the operation’s task-derived result before re-deriving the parent; adds shared persistence helper + task-error extraction.
pkg/api/operator_test.go Adds unit tests validating failed/non-terminal behavior for the new own-result persistence path.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/api/operator.go
Comment thread pkg/api/operator_test.go
Comment thread pkg/api/operator_test.go
Task rows carry task-vocabulary states; compare them with state.Task.*
instead of state.Apply.* so they don't silently drift if the
vocabularies diverge.
@Kiran01bm Kiran01bm marked this pull request as ready for review June 15, 2026 01:51
@Kiran01bm Kiran01bm requested review from aparajon and morgo as code owners June 15, 2026 01:51
@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants