You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: NodeUpdate plans for image rollout orchestration (#94)
* feat: NodeUpdate plans for image rollout orchestration
When a Running node's spec.image differs from status.currentImage, the
planner detects the drift and builds a NodeUpdate plan to roll out the
update.
NodeUpdate plan tasks:
1. apply-statefulset — pushes the new image to the StatefulSet via SSA
2. apply-service — keeps the headless Service in sync
3. observe-image — polls StatefulSet rollout until complete, stamps
status.currentImage in-memory
4. mark-ready — re-initializes the sidecar on the new pod
Planner condition management:
- ResolvePlan sets NodeUpdateInProgress=True when building the plan
- On plan completion: clears with Reason=UpdateComplete
- On plan failure: clears with Reason=UpdateFailed, then immediately
retries (builds a new plan since drift persists)
- handleTerminalPlan clears completed/failed plans before building
the next one
Drift-gated plans:
- No drift (spec.image == status.currentImage) → no plan, steady state
- Image drift → NodeUpdate plan
Reconciler simplification:
- Removed observeCurrentImage method — replaced by observe-image task
- Running-phase post-plan block removed — fully plan-driven
- Deleted 7 observeCurrentImage unit tests (coverage moved to task tests)
New tests:
- observe-image task: rollout complete, rolling update, stale generation,
StatefulSet not found
- NodeUpdate planner: drift detection, plan composition, condition
lifecycle (start, complete, fail/retry), terminal plan handling
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: critical — NodeUpdate plans not nilled by executor, condition lifecycle
Fixes from adversarial review by Tide specialists:
CRITICAL: clearCompletedConvergencePlan was nilling NodeUpdate plans,
preventing handleTerminalPlan from ever observing the completed plan
and clearing NodeUpdateInProgress. Now renamed to clearCompletedPlanIfSafe
and checks isNodeUpdatePlan — NodeUpdate plans stay Complete for one
reconcile so the planner can observe them and clear conditions.
IMPORTANT: Document observe-image's Status() mutation of currentImage as
a sanctioned exception to the "tasks own resources" rule, explaining why
the mutation must happen at observation time.
MINOR: Fix stale "convergence plans" reference in isNodeUpdatePlan comment.
Update test to verify NodeUpdate plans stay Complete (not nilled).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: planner owns all plan lifecycle — executor no longer nils plans
Remove clearCompletedPlanIfSafe (formerly clearCompletedConvergencePlan)
from the executor entirely. The executor now only marks plans Complete or
Failed — it never nils them. The planner's handleTerminalPlan handles
ALL plan cleanup: clearing conditions and nilling terminal plans.
This is a cleaner separation: the executor drives tasks to completion,
the planner decides what to do with the result. No type-specific checks
(isNodeUpdatePlan) in the executor. No plan lifecycle management outside
the planner.
Also removes the now-unused currentPhase helper from the executor.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: move observe-image logic from Status() to Execute()
Address PR feedback: Execute() now polls the StatefulSet and stamps
currentImage when the rollout completes. Status() just returns the
cached result via DefaultStatus(). This keeps Execute as the action
method and Status as the query method.
When the rollout is not yet complete, Execute returns nil (no error).
The executor will re-invoke on the next reconcile since the task
remains Pending.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: conditions set directly at point of decision, no inference
Move condition setting from post-processing inference to direct mutation
at the point of decision:
- buildNodeUpdatePlan sets NodeUpdateInProgress=True directly on the
node when it builds the plan — no separate applyPlanStartConditions
step that scans tasks to infer the plan type
- handleTerminalPlan sets NodeUpdateInProgress=False via the shared
setNodeUpdateCondition helper — no clearNodeUpdateCondition that
checked whether the condition was already set
Deleted: isNodeUpdatePlan, applyPlanStartConditions, clearNodeUpdateCondition.
These inferred plan type from task contents and had special-case logic for
condition state. The new approach is simpler: the code that creates the
scenario sets the condition. The code that observes the outcome clears it.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: only clear NodeUpdateInProgress if it was set
handleTerminalPlan was setting NodeUpdateInProgress=False for ALL
terminal plans, including init plans that never had the condition set.
This would add a spurious False condition after init completion.
Now checks hasNodeUpdateCondition before clearing — only clears the
condition if it was previously set to True by buildNodeUpdatePlan.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: reference TaskTypeMarkReady from task package consistently
Move the mark-ready sidecar task type constant to the task package
(re-exported from sidecar) so buildNodeUpdatePlan references all task
types via task.TaskType* consistently.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
-**SeiNode** creates StatefulSets (replicas=1), headless Services, and PVCs via server-side apply (fieldOwner: `seinode-controller`).
58
58
-**Plan-driven reconciliation** — Both controllers use ordered task plans (stored in `.status.plan`) to drive lifecycle. Plans are built by `internal/planner/` (`ResolvePlan` for nodes, `ForGroup` for deployments), executed by `planner.Executor`, with individual tasks in `internal/task/`. The reconcile loop is: `ResolvePlan → persist plan → ExecutePlan`. See `internal/planner/doc.go` for the full plan lifecycle.
59
59
-**Init plans** transition nodes from Pending → Running. They include infrastructure tasks (`ensure-data-pvc`, `apply-statefulset`, `apply-service`) followed by sidecar tasks (`configure-genesis`, `config-apply`, etc.).
60
-
-**Convergence plans**keep Running nodes in sync. They contain only`apply-statefulset` + `apply-service` and are nilled from status after completion.
60
+
-**NodeUpdate plans**roll out image changes on Running nodes. Built when `spec.image != status.currentImage`. Tasks:`apply-statefulset`, `apply-service`, `observe-image` (polls StatefulSet rollout, stamps `currentImage`), `mark-ready` (sidecar re-init). The planner sets `NodeUpdateInProgress` condition on creation and clears it on completion/failure. When no drift is detected, no plan is built — the node sits in steady state.
61
61
-**Atomic plan creation** — New plans are persisted before any tasks execute. The reconciler flushes the plan, then requeues. Execution starts on the next reconcile. This guarantees external observers see the plan before side effects occur.
62
62
-**Condition ownership** — The planner owns all condition management on the owning resource. It sets conditions when creating plans (e.g., `NodeUpdateInProgress=True`) and when observing terminal plans (e.g., `NodeUpdateInProgress=False`). The executor does not set conditions — it only mutates plan/task state and phase transitions.
63
63
-**Single-patch model** — All status mutations (plan state, conditions, phase, currentImage) accumulate in-memory during a reconcile and are flushed in a single `Status().Patch()` at the end. Tasks mutate owned resources (StatefulSets, Services, PVCs); the executor mutates plan state in-memory; the reconciler flushes once.
0 commit comments