Skip to content

fix(operator): stamp apply ErrorMessage from the aggregate failure#345

Merged
morgo merged 1 commit into
kiran01bm/operation-state-first-persistencefrom
kiran01bm/aggregate-error-message
Jun 15, 2026
Merged

fix(operator): stamp apply ErrorMessage from the aggregate failure#345
morgo merged 1 commit into
kiran01bm/operation-state-first-persistencefrom
kiran01bm/aggregate-error-message

Conversation

@Kiran01bm

Copy link
Copy Markdown
Collaborator

What

When the operator re-derives a parent apply's state from its apply_operations rows and the rollout settles to failed, the apply's ErrorMessage is now surfaced from the aggregate — the first failed operation — instead of whatever message the last-driven operation left behind.

Why

The bug: updateApplyStateFromOperations re-derives applies.state from the operation rows but copies the apply struct verbatim and never touches ErrorMessage. So the failure reason is whatever the last operation to write the apply happened to leave. Under on_failure: continue the rollout keeps driving siblings past a failed deployment, so the last driver is frequently a successful sibling — leaving the apply with a failed verdict and an empty or misleading reason. An operator triaging the failure sees the verdict but not why.

The fix: when the derived state is failed, stamp ErrorMessage from the first failed operation row (deployment order), formatted as deployment <name> failed: <reason>. The rollout's verdict is the first failure, so the first failed row wins. If no failed operation carries a message, the existing apply message is kept as a fallback rather than blanked.

Operation rows already carry their own correct failure message (each operation is persisted from its own tasks), so the aggregate has the right data to surface.

ops: [region-a failed "cutover failed", region-b completed] ─▶ derived = failed
└▶ apply.ErrorMessage = "deployment region-a failed: cutover failed"
(not region-b's empty / last-op message)

Notes

Stacked on #343 (operation-state-first persistence) — review/merge that first. Scoped to the operator's aggregate writer; the tern per-op stamping and single-writer refactor are separate follow-ups.

When the rollout settles to failed, surface the reason from the first
failed operation row instead of leaving whatever message the last-driven
operation wrote. Under on_failure "continue" the last driver may be a
successful sibling, which would otherwise leave the failed verdict with
no matching reason.
@Kiran01bm Kiran01bm marked this pull request as ready for review June 15, 2026 02:21
@Kiran01bm Kiran01bm requested review from aparajon and morgo as code owners June 15, 2026 02:21
@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@morgo morgo merged commit 975830d into kiran01bm/operation-state-first-persistence Jun 15, 2026
27 checks passed
@morgo morgo deleted the kiran01bm/aggregate-error-message branch June 15, 2026 11:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants