Skip to content

Commit 23bf638

Browse files
bdchathamclaude
andauthored
feat: plan-centric reconcile wiring (Steps 5-6) (#88)
* feat: plan-centric reconcile wiring (Steps 5-6) Wire the plan foundations into the actual reconcile loop. This is the behavioral change that makes Plan the central abstraction. Step 5: Phase-aware BuildPlan + infrastructure tasks in plans - buildBasePlan prepends ensure-data-pvc, apply-statefulset, apply-service before sidecar tasks - buildBootstrapPlan inserts ensure-data-pvc at start, apply-statefulset + apply-service after teardown-bootstrap - buildGenesisPlan includes infrastructure tasks before ceremony tasks - All plans set TargetPhase: Running, FailedPhase: Failed - Each mode planner's BuildPlan checks node phase: Pending builds init plan, Running builds convergence plan (apply-statefulset + apply-service) Step 6: Mutating ForNode + simplified reconcile loop - ForNode(node) error — resumes active plan or builds new one, stamps directly onto node.Status.Plan - Reconcile loop: ensure finalizer → check Failed → reconcilePeers → ForNode → persist new plan → ExecutePlan → emit metrics → runtime tasks - Deleted: reconcilePending, reconcileInitializing, reconcileRunning, transitionPhase, ensureNodeDataPVC, reconcileNodeStatefulSet, reconcileNodeService, fieldOwner constant - New plan is persisted via status patch before ExecutePlan runs (captures patch base before ForNode mutates the in-memory object) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: code quality cleanup from adversarial review Address all findings from the Kubernetes and Platform specialist reviews. Must fix: - Document deliberate empty FailedPhase on convergence plans (retry, not fail) - Delete dead TestBootstrapMode test (passed but tested nothing) - Replace non-asserting phase transition tests with proper executor test Renames for clarity: - ForNode → ResolvePlan (conveys mutation, not accessor) - hadPlan → planAlreadyActive (self-documenting branch condition) - nilPlanIfConvergence → clearCompletedConvergencePlan (says what it does) Deduplication: - Remove duplicate NeedsLongStartup from planner (noderesource has it) - Consolidate fieldOwner into task.go (was duplicated across task files) - Remove unused *SeiNode param from configureGenesisParams - Add DefaultStatus() to taskBase, use in all fire-and-forget tasks Test quality: - Replace hardcoded Tasks[3] indices with findPlannedTask name-based lookup - Add TestExecutePlan_ConvergencePlan_NilsOnCompletion - Add TestExecutePlan_TaskFailure_SetsPlanFailedCondition - Add comments to magic for-range-N loops explaining plan progression Documentation: - Add planner/doc.go explaining full plan lifecycle - Document ResolvePlan contract (caller must capture patch base before calling) - Fix truncated ResultRequeueImmediate comment - Update CLAUDE.md Key Patterns with plan-driven reconciliation section Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: address PR comments — observedPhase, genesis params factory - Rename prevPhase to observedPhase in controller.go - Genesis plan uses genesisParamsForTaskType for genesis-specific tasks, falls through to paramsForTaskType for shared tasks (infrastructure + common sidecar). Eliminates the inline switch that duplicated the params factory. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: genesis plan uses shared paramsForTaskType factory buildGenesisPlan now follows the same pattern as buildBasePlan: configApplyParams (with validator mode + overrides) is passed through to the shared paramsForTaskType factory. genesisCeremonyParams only handles the 4 ceremony-specific tasks (generate-identity, generate-gentx, upload-genesis-artifacts, set-genesis-peers) and returns nil for everything else, falling through to the shared factory. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: consolidate all task params into single paramsForTaskType factory Move genesis ceremony task params (generate-identity, generate-gentx, upload-genesis-artifacts, set-genesis-peers) into the shared paramsForTaskType factory. Remove the separate genesisCeremonyParams function — one factory handles all task types. Genesis ceremony params are guarded by a nil check on Validator.GenesisCeremony, extracted into genesisCeremonyTaskParams helper for readability. buildGenesisPlan now calls paramsForTaskType directly, same as every other plan builder. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: extract shared buildSidecarProgression, insertBefore returns error Two structural improvements to plan composition: 1. Extract buildSidecarProgression — the insertBefore calls that build the sidecar task sequence were duplicated between buildBasePlan and buildBootstrapProgression. Now both delegate to a single shared function, eliminating the drift risk. 2. insertBefore returns an error when the target task is not found, instead of silently dropping the insertion. This catches plan construction bugs at build time rather than producing silently incomplete plans that run without the missing task. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: inline buildBootstrapProgression, document postBootstrap intent Remove the one-line buildBootstrapProgression wrapper — buildBootstrapPlan calls buildSidecarProgression directly. Add comment to buildPostBootstrapProgression explaining it is intentionally separate from the shared sidecar progression. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent b93d623 commit 23bf638

21 files changed

Lines changed: 620 additions & 515 deletions

CLAUDE.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,10 @@ make docker-push IMG=<image> # Push container image
5555

5656
- **SeiNodeDeployment** creates and owns **SeiNode** resources. Groups orchestrate genesis ceremonies, manage deployments, and coordinate networking/monitoring.
5757
- **SeiNode** creates StatefulSets (replicas=1), headless Services, and PVCs via server-side apply (fieldOwner: `seinode-controller`).
58+
- **Plan-driven reconciliation** — Both controllers use ordered task plans (stored in `.status.plan`) to drive lifecycle. Plans are built by `internal/planner/` (`ResolvePlan` for nodes, `ForGroup` for deployments), executed by `planner.Executor`, with individual tasks in `internal/task/`. The reconcile loop is: `ResolvePlan → persist plan → ExecutePlan`. See `internal/planner/doc.go` for the full plan lifecycle.
59+
- **Init plans** transition nodes from Pending → Running. They include infrastructure tasks (`ensure-data-pvc`, `apply-statefulset`, `apply-service`) followed by sidecar tasks (`configure-genesis`, `config-apply`, etc.).
60+
- **Convergence plans** keep Running nodes in sync. They contain only `apply-statefulset` + `apply-service` and are nilled from status after completion.
61+
- **Resource generators** live in `internal/noderesource/` — pure functions that produce StatefulSets, Services, and PVCs from a SeiNode spec. Used by both the controller and plan tasks.
5862
- **Platform config** is fully environment-driven — all fields in `platform.Config` must be set via env vars (no defaults). See `internal/platform/platform.go` for the full list.
5963
- **Genesis resolution** is handled by the sidecar autonomously: embedded sei-config for well-known chains, S3 fallback at `{SEI_GENESIS_BUCKET}/{chainID}/genesis.json` for custom chains.
60-
- Sidecar bootstrap progression is driven by the node controller polling the sidecar HTTP API and submitting tasks in sequence.
6164
- Config keys in seid's `config.toml` use **hyphens** (e.g., `persistent-peers`, `trust-height`), not underscores.

internal/controller/node/controller.go

Lines changed: 44 additions & 151 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,6 @@ const (
3030
nodeFinalizerName = "sei.io/seinode-finalizer"
3131
seiNodeControllerName = "seinode"
3232
statusPollInterval = 30 * time.Second
33-
fieldOwner = client.FieldOwner("seinode-controller")
3433
)
3534

3635
// PlatformConfig is an alias for platform.Config, used throughout the node
@@ -78,116 +77,74 @@ func (r *SeiNodeReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ct
7877
return ctrl.Result{}, err
7978
}
8079

81-
p, err := planner.ForNode(node)
82-
if err != nil {
83-
return ctrl.Result{}, fmt.Errorf("resolving planner: %w", err)
84-
}
85-
if err := p.Validate(node); err != nil {
86-
return ctrl.Result{}, fmt.Errorf("validating spec: %w", err)
80+
// Failed is terminal — nothing to do.
81+
if node.Status.Phase == seiv1alpha1.PhaseFailed {
82+
r.Recorder.Eventf(node, corev1.EventTypeWarning, "NodeFailed",
83+
"SeiNode is in Failed state. Delete and recreate the resource to retry.")
84+
return ctrl.Result{}, nil
8785
}
8886

89-
// TODO: reconcile peers should become a part of a plan if it needs to be performed.
87+
// Pre-plan: resolve label-based peers so plan params have fresh data.
9088
if err := r.reconcilePeers(ctx, node); err != nil {
9189
return ctrl.Result{}, fmt.Errorf("reconciling peers: %w", err)
9290
}
9391

94-
// TODO: this should be a part of a plan and not part of the main reconciliation flow.
95-
if err := r.ensureNodeDataPVC(ctx, node); err != nil {
96-
return ctrl.Result{}, fmt.Errorf("ensuring data PVC: %w", err)
92+
// Resolve or resume plan. ResolvePlan either resumes an active plan or
93+
// builds a new one based on the node's phase, stamping it onto
94+
// node.Status.Plan (and transitioning Pending → Initializing).
95+
observedPhase := node.Status.Phase
96+
planAlreadyActive := node.Status.Plan != nil && node.Status.Plan.Phase == seiv1alpha1.TaskPlanActive
97+
patch := client.MergeFromWithOptions(node.DeepCopy(), client.MergeFromWithOptimisticLock{})
98+
if err := planner.ResolvePlan(node); err != nil {
99+
return ctrl.Result{}, fmt.Errorf("resolving plan: %w", err)
97100
}
98101

99-
// TODO: if the plan becomes the abstraction it should be, then p from planner.ForNode should just take an existing
100-
// plan off the node if one exists, otherwise, create one based on the state it sees.
101-
switch node.Status.Phase {
102-
case "", seiv1alpha1.PhasePending:
103-
return r.reconcilePending(ctx, node, p)
104-
case seiv1alpha1.PhaseInitializing:
105-
return r.reconcileInitializing(ctx, node)
106-
case seiv1alpha1.PhaseRunning:
107-
return r.reconcileRunning(ctx, node)
108-
case seiv1alpha1.PhaseFailed:
109-
r.Recorder.Eventf(node, corev1.EventTypeWarning, "NodeFailed",
110-
"SeiNode is in Failed state. Delete and recreate the resource to retry.")
111-
return ctrl.Result{}, nil
112-
default:
102+
if node.Status.Plan == nil {
113103
return ctrl.Result{}, nil
114104
}
115-
}
116105

117-
// reconcilePending builds the unified Plan and transitions to Initializing.
118-
// For genesis ceremony nodes the plan includes artifact generation and assembly
119-
// steps. For bootstrap nodes the plan includes controller-side Job lifecycle
120-
// tasks followed by production config. All orchestration lives in the plan.
121-
func (r *SeiNodeReconciler) reconcilePending(ctx context.Context, node *seiv1alpha1.SeiNode, p planner.NodePlanner) (ctrl.Result, error) {
122-
patch := client.MergeFromWithOptions(node.DeepCopy(), client.MergeFromWithOptimisticLock{})
123-
124-
plan, err := p.BuildPlan(node)
125-
if err != nil {
126-
return ctrl.Result{}, fmt.Errorf("building plan: %w", err)
106+
// If ResolvePlan built a new plan, persist it before execution.
107+
if !planAlreadyActive {
108+
if err := r.Status().Patch(ctx, node, patch); err != nil {
109+
return ctrl.Result{}, fmt.Errorf("persisting new plan: %w", err)
110+
}
111+
return planner.ResultRequeueImmediate, nil
127112
}
128-
node.Status.Plan = plan
129-
node.Status.Phase = seiv1alpha1.PhaseInitializing
130113

131-
if err := r.Status().Patch(ctx, node, patch); err != nil {
132-
return ctrl.Result{}, fmt.Errorf("initializing plans: %w", err)
114+
// Execute the plan. The executor handles phase transitions via
115+
// TargetPhase/FailedPhase and sets Conditions on failure.
116+
result, err := r.PlanExecutor.ExecutePlan(ctx, node, node.Status.Plan)
117+
if err != nil {
118+
return result, err
133119
}
134120

135-
ns, name := node.Namespace, node.Name
136-
nodePhaseTransitions.WithLabelValues(ns, string(seiv1alpha1.PhasePending), string(seiv1alpha1.PhaseInitializing)).Inc()
137-
emitNodePhase(ns, name, seiv1alpha1.PhaseInitializing)
138-
r.Recorder.Eventf(node, corev1.EventTypeNormal, "PhaseTransition",
139-
"Phase changed from %s to %s", seiv1alpha1.PhasePending, seiv1alpha1.PhaseInitializing)
121+
// Emit metrics/events if the phase changed.
122+
if node.Status.Phase != observedPhase {
123+
ns, name := node.Namespace, node.Name
124+
nodePhaseTransitions.WithLabelValues(ns, string(observedPhase), string(node.Status.Phase)).Inc()
125+
emitNodePhase(ns, name, node.Status.Phase)
126+
r.Recorder.Eventf(node, corev1.EventTypeNormal, "PhaseTransition",
127+
"Phase changed from %s to %s", observedPhase, node.Status.Phase)
140128

141-
return planner.ResultRequeueImmediate, nil
142-
}
143-
144-
// reconcileInitializing drives the unified Plan to completion. For
145-
// bootstrap nodes the StatefulSet and Service are created only after
146-
// bootstrap teardown is complete (to avoid RWO PVC conflicts). For
147-
// non-bootstrap nodes they are created immediately.
148-
func (r *SeiNodeReconciler) reconcileInitializing(ctx context.Context, node *seiv1alpha1.SeiNode) (ctrl.Result, error) {
149-
plan := node.Status.Plan
150-
151-
// TODO: This should be a part of the plan, not a special case we need to filter out for the reconciliation before we
152-
// go ahead with the actual plan.
153-
if !planner.NeedsBootstrap(node) || planner.IsBootstrapComplete(plan) {
154-
if err := r.reconcileNodeStatefulSet(ctx, node); err != nil {
155-
return ctrl.Result{}, fmt.Errorf("reconciling statefulset: %w", err)
156-
}
157-
if err := r.reconcileNodeService(ctx, node); err != nil {
158-
return ctrl.Result{}, fmt.Errorf("reconciling service: %w", err)
129+
if node.Status.Phase == seiv1alpha1.PhaseRunning {
130+
dur := time.Since(node.CreationTimestamp.Time).Seconds()
131+
nodeInitDuration.WithLabelValues(ns, node.Spec.ChainID).Observe(dur)
132+
nodeLastInitDuration.WithLabelValues(ns, name).Set(dur)
159133
}
160134
}
161135

162-
result, err := r.PlanExecutor.ExecutePlan(ctx, node, plan)
163-
if err != nil {
164-
return result, err
136+
// Running phase: after the convergence plan completes, handle
137+
// runtime tasks (image observation, monitor task polling).
138+
if node.Status.Phase == seiv1alpha1.PhaseRunning {
139+
return r.reconcileRunningTasks(ctx, node)
165140
}
166141

167-
// TODO: transitioning should be a part of ExecutePlan, I would think. This way, when a plan is done, it has materialized
168-
// the proper state and does some final patch to the node, basically leaving it in a state that other logic can observe and
169-
// act on as well. For instance, if a plan succeeds, the plan knows what state to put it in. Same for if it fails.
170-
if plan.Phase == seiv1alpha1.TaskPlanComplete {
171-
return r.transitionPhase(ctx, node, seiv1alpha1.PhaseRunning)
172-
}
173-
if plan.Phase == seiv1alpha1.TaskPlanFailed {
174-
return r.transitionPhase(ctx, node, seiv1alpha1.PhaseFailed)
175-
}
176142
return result, nil
177143
}
178144

179-
// reconcileRunning converges owned resources and handles runtime tasks.
180-
func (r *SeiNodeReconciler) reconcileRunning(ctx context.Context, node *seiv1alpha1.SeiNode) (ctrl.Result, error) {
181-
182-
// TODO: ideally this becomes a task in a plan
183-
if err := r.reconcileNodeStatefulSet(ctx, node); err != nil {
184-
return ctrl.Result{}, fmt.Errorf("reconciling statefulset: %w", err)
185-
}
186-
// TODO: ideally this becomes a task in a plan
187-
if err := r.reconcileNodeService(ctx, node); err != nil {
188-
return ctrl.Result{}, fmt.Errorf("reconciling service: %w", err)
189-
}
190-
145+
// reconcileRunningTasks handles Running-phase work that is outside the plan:
146+
// image observation and sidecar monitor task polling.
147+
func (r *SeiNodeReconciler) reconcileRunningTasks(ctx context.Context, node *seiv1alpha1.SeiNode) (ctrl.Result, error) {
191148
if err := r.observeCurrentImage(ctx, node); err != nil {
192149
return ctrl.Result{}, fmt.Errorf("observing current image: %w", err)
193150
}
@@ -226,36 +183,6 @@ func (r *SeiNodeReconciler) observeCurrentImage(ctx context.Context, node *seiv1
226183
return nil
227184
}
228185

229-
// transitionPhase transitions the node to a new phase and emits the associated
230-
// metric counter, phase gauge, and Kubernetes event.
231-
func (r *SeiNodeReconciler) transitionPhase(ctx context.Context, node *seiv1alpha1.SeiNode, phase seiv1alpha1.SeiNodePhase) (ctrl.Result, error) {
232-
prev := node.Status.Phase
233-
if prev == "" {
234-
prev = seiv1alpha1.PhasePending
235-
}
236-
237-
patch := client.MergeFromWithOptions(node.DeepCopy(), client.MergeFromWithOptimisticLock{})
238-
node.Status.Phase = phase
239-
if err := r.Status().Patch(ctx, node, patch); err != nil {
240-
return ctrl.Result{}, fmt.Errorf("setting phase to %s: %w", phase, err)
241-
}
242-
243-
ns, name := node.Namespace, node.Name
244-
nodePhaseTransitions.WithLabelValues(ns, string(prev), string(phase)).Inc()
245-
emitNodePhase(ns, name, phase)
246-
247-
if phase == seiv1alpha1.PhaseRunning {
248-
dur := time.Since(node.CreationTimestamp.Time).Seconds()
249-
nodeInitDuration.WithLabelValues(ns, node.Spec.ChainID).Observe(dur)
250-
nodeLastInitDuration.WithLabelValues(ns, name).Set(dur)
251-
}
252-
253-
r.Recorder.Eventf(node, corev1.EventTypeNormal, "PhaseTransition",
254-
"Phase changed from %s to %s", prev, phase)
255-
256-
return planner.ResultRequeueImmediate, nil
257-
}
258-
259186
// SetupWithManager sets up the controller with the Manager.
260187
func (r *SeiNodeReconciler) SetupWithManager(mgr ctrl.Manager) error {
261188
return ctrl.NewControllerManagedBy(mgr).
@@ -308,37 +235,3 @@ func (r *SeiNodeReconciler) deleteNodeDataPVC(ctx context.Context, node *seiv1al
308235
}
309236
return r.Delete(ctx, pvc)
310237
}
311-
312-
func (r *SeiNodeReconciler) ensureNodeDataPVC(ctx context.Context, node *seiv1alpha1.SeiNode) error {
313-
desired := noderesource.GenerateDataPVC(node, r.Platform)
314-
if err := ctrl.SetControllerReference(node, desired, r.Scheme); err != nil {
315-
return fmt.Errorf("setting owner reference: %w", err)
316-
}
317-
318-
existing := &corev1.PersistentVolumeClaim{}
319-
err := r.Get(ctx, types.NamespacedName{Name: desired.Name, Namespace: desired.Namespace}, existing)
320-
if apierrors.IsNotFound(err) {
321-
return r.Create(ctx, desired)
322-
}
323-
return err
324-
}
325-
326-
func (r *SeiNodeReconciler) reconcileNodeStatefulSet(ctx context.Context, node *seiv1alpha1.SeiNode) error {
327-
desired := noderesource.GenerateStatefulSet(node, r.Platform)
328-
desired.SetGroupVersionKind(appsv1.SchemeGroupVersion.WithKind("StatefulSet"))
329-
if err := ctrl.SetControllerReference(node, desired, r.Scheme); err != nil {
330-
return fmt.Errorf("setting owner reference: %w", err)
331-
}
332-
//nolint:staticcheck // migrating to typed ApplyConfiguration is a separate effort
333-
return r.Patch(ctx, desired, client.Apply, fieldOwner, client.ForceOwnership)
334-
}
335-
336-
func (r *SeiNodeReconciler) reconcileNodeService(ctx context.Context, node *seiv1alpha1.SeiNode) error {
337-
desired := noderesource.GenerateHeadlessService(node)
338-
desired.SetGroupVersionKind(corev1.SchemeGroupVersion.WithKind("Service"))
339-
if err := ctrl.SetControllerReference(node, desired, r.Scheme); err != nil {
340-
return fmt.Errorf("setting owner reference: %w", err)
341-
}
342-
//nolint:staticcheck // migrating to typed ApplyConfiguration is a separate effort
343-
return r.Patch(ctx, desired, client.Apply, fieldOwner, client.ForceOwnership)
344-
}

internal/controller/node/monitor_test.go

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -503,9 +503,9 @@ func TestReconcileRunning_MonitorMode_SubmitsMonitorTask(t *testing.T) {
503503

504504
r, c := newProgressionReconciler(t, mock, node)
505505

506-
_, err := r.reconcileRunning(context.Background(), node)
506+
_, err := r.reconcileRunningTasks(context.Background(), node)
507507
if err != nil {
508-
t.Fatalf("reconcileRunning: %v", err)
508+
t.Fatalf("reconcileRunningTasks: %v", err)
509509
}
510510

511511
if len(mock.submitted) != 1 {

0 commit comments

Comments
 (0)