Skip to content

feat(operator): make node drain behavior configurable#264

Open
AnouarMohamed wants to merge 6 commits into
NVIDIA:mainfrom
AnouarMohamed:fix/issue-259-drain-config
Open

feat(operator): make node drain behavior configurable#264
AnouarMohamed wants to merge 6 commits into
NVIDIA:mainfrom
AnouarMohamed:fix/issue-259-drain-config

Conversation

@AnouarMohamed

@AnouarMohamed AnouarMohamed commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes #259.

Adds configurable drain behavior for interrupt-enabled Skyhooks through spec.drainConfig, including eviction vs direct delete, emptyDir handling, unmanaged pod handling, DaemonSet handling, drain timeout, and drain grace period.

Details

  • Adds the DrainConfig API field, validation, generated deepcopy, and CRD schema.
  • Replaces the hardcoded drain pod filter with a focused internal/drain decision package.
  • Persists drain start time on nodes so timeout can mark the node erroring and emit drain events.
  • Keeps podNonInterruptLabels as a pre-drain barrier before drain selection runs.
  • Updates reset cleanup to remove drain-start metadata.
  • Mirrors the CRD change into the Helm chart and documents defaults in docs/interrupt_flow.md.
  • Adds Chainsaw coverage for PDB bypass via direct delete and timeout surfacing.

Behavior Changes

  • Default drain matching now follows Kubernetes drain semantics more closely: already-terminating pods and mirror/static pods are ignored, unschedulable tolerations use Kubernetes ToleratesTaint matching, and DaemonSet pods are detected from their controller owner reference.
  • Drain timeout leaves the node cordoned and marks the node/Skyhook erroring. If the blocker later clears, reconcile can continue from current cluster state; the recommended production recovery remains clearing the blocker and running kubectl skyhook reset <skyhook> --confirm or kubectl skyhook node reset ....
  • CLI reset and node reset now select nodes with any resettable Skyhook metadata, including status, cordon, and drain-start metadata. Malformed nodeState annotations are reset with empty package state instead of causing the node to be skipped.

Validation

  • make fmt
  • go test -mod=vendor ./internal/drain
  • go test -mod=vendor -run '^$' ./internal/controller
  • ./bin/ginkgo -mod=vendor --focus 'DrainNode' --timeout=300s ./internal/controller
  • go test -mod=vendor -run '^$' ./...
  • git diff --check
  • Parsed k8s-tests/chainsaw/skyhook/drain-config/*.yaml with PyYAML

@coderabbitai

coderabbitai Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR adds an optional spec.drainConfig to Skyhook (CRD/chart/webhook/deepcopy), a new operator/internal/drain package to decide per-pod drain actions, per-Skyhook drain-start annotations on nodes (wrapper + mocks + tests), controller changes to apply drain.Options with timeout/grace handling and to perform evict/delete per DecidePod results, CLI reset updates to remove drain metadata, and Chainsaw e2e tests plus documentation and changelog updates.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • NVIDIA/nodewright#266: Modifies HandleConfigUpdates in the controller and touches similar controller code paths relevant to drain-related error wrapping and config update flows.

Suggested reviewers

  • mskalka
  • rice-riley
  • ayuskauskas
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed Title accurately describes the main change: making node drain behavior configurable via spec.drainConfig, which is the central theme of this PR.
Linked Issues check ✅ Passed The PR successfully implements all core objectives from issue #259: DrainConfig API with validation and deepcopy, configurable eviction vs direct delete, timeout with drain-start tracking, drain grace period, emptyDir handling, DaemonSet skip, preserved golden-workload barrier, and comprehensive test coverage including unit tests and Chainsaw e2e tests for PDB bypass and timeout behavior.
Out of Scope Changes check ✅ Passed All changes are directly aligned with issue #259 objectives: DrainConfig implementation, drain logic refactoring, node drain-start tracking, reset cleanup updates, and test coverage are all specified as in-scope requirements.
Description check ✅ Passed The PR description clearly explains the feature: configurable drain behavior for Skyhooks through spec.drainConfig, including specific implementation details and validation steps.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@k8s-tests/chainsaw/skyhook/drain-config/chainsaw-test.yaml`:
- Around line 20-22: The test manifest for the Chainsaw suite (metadata.name:
drain-config) is missing the required Chainsaw pool assignment; add an explicit
metadata.labels.pool entry under the resource metadata (e.g.,
metadata.labels.pool: "lifecycle" or another appropriate value from
core|interrupt|uninstall|lifecycle) so CI can route the test according to its
mutation profile; update the chainsaw-test.yaml metadata block to include the
labels object with the chosen pool value.
- Around line 116-125: The loop currently checks only for any event with
reason=Drain for involvedObject.name=drain-config-timeout; update the kubectl
check in the script to assert the timeout-specific event emitted by the
controller by matching the action (DrainTimeout) or the message text that
contains "DrainTimeout" instead of just reason=Drain. Locate the kubectl
invocation in the script (the line using kubectl -n skyhook get events
--field-selector
involvedObject.kind=Skyhook,involvedObject.name=drain-config-timeout,reason=Drain
-o name | grep .) and change the field-selector or subsequent grep to require
action=DrainTimeout (or grep the event message for "DrainTimeout") so the loop
only succeeds when the timeout-specific event is present.

In `@k8s-tests/chainsaw/skyhook/drain-config/disable-eviction.yaml`:
- Around line 36-39: The workload container currently uses an unpinned external
tag ("image: busybox:latest") which violates the image-source policy; update the
container with name "workload" by replacing the image field to a pinned digest
or semver tag hosted under the approved registries (e.g.,
ghcr.io/nvidia/nodewright/* or ghcr.io/nvidia/skyhook/*) so it no longer
references busybox:latest — ensure the new image is a specific version or digest
and adjust any related pull policy if necessary.

In `@k8s-tests/chainsaw/skyhook/drain-config/timeout.yaml`:
- Around line 30-33: The manifest uses an unapproved floating image for the
container named "workload" (image: busybox:latest); replace it with a pinned
image from an approved NVIDIA registry (e.g.,
ghcr.io/nvidia/nodewright/<approved-image>:<fixed-tag> or
ghcr.io/nvidia/skyhook/<approved-image>:<fixed-tag>) and ensure the YAML 'image'
field is updated accordingly (keep the existing command). Optionally add
imagePullPolicy: IfNotPresent to ensure the pinned tag is used.

In `@operator/cmd/cli/app/node/node_reset.go`:
- Around line 54-61: The selection predicate currently only checks annotations
via hasResettableAnnotation, so nodes that have had their annotations cleared
but still retain the label (statusLabelPrefix+opts.skyhookName) won't be
reselected; update the predicate and/or hasResettableAnnotation to also inspect
node.Labels for the same resettable keys (specifically check for
statusLabelPrefix+opts.skyhookName and any other label-only keys referenced in
lines 155-157) so retries will find nodes that still have residual labels;
ensure you reference hasResettableAnnotation, statusLabelPrefix, and
opts.skyhookName when adding the label checks.
- Around line 159-167: The current logic in node_reset.go skips resetting a node
when json.Unmarshal into nodeState fails; instead, modify the error path so that
on malformed annotation (json.Unmarshal error) you log the warning (using
cliCtx.GlobalFlags.Verbose and fmt.Fprintf to cmd.ErrOrStderr()) but do not
continue—set nodeState to an empty v1alpha1.NodeState (treat package list as
empty) and proceed with the normal reset/clear of annotations and labels; keep
the annotationKey/node.Annotations lookup and only suppress the abort/continue
on unmarshal failure so the subsequent code that clears metadata still runs.

In `@operator/cmd/cli/app/reset.go`:
- Around line 64-70: The eligibility check currently in hasResettableAnnotation
only inspects annotations, so nodes with only the stale status label
(statusLabelPrefix+skyhookName) are missed; update the logic to also inspect
labels: either extend hasResettableAnnotation to accept labels map[string]string
(or add a new helper) and check for the presence of the same keys in the labels
map, and ensure the other call sites (including where
statusLabelPrefix+skyhookName is removed around the reset flow referenced near
the other usage) use the updated helper so nodes with label-only leftovers are
detected and reset.
- Around line 137-145: The current logic in reset.go skips the node entirely on
a malformed annotation by calling continue inside the json.Unmarshal error
branch; instead, remove the continue so the node still proceeds through reset
with an empty fallback v1alpha1.NodeState (nodeState variable) and emit the same
verbose warning via cliCtx.GlobalFlags.Verbose and fmt.Fprintf to
cmd.ErrOrStderr() when annotation parsing fails; ensure annotationKey is still
checked and that nodeState remains the zero value when unmarshalling errors
occur so drainStart_, status_ and other reset metadata get cleaned up.

In `@operator/internal/controller/skyhook_controller.go`:
- Around line 1767-1775: The controller is calling controller-runtime
delete/eviction APIs directly (r.Client.SubResource("eviction").Create and
r.Delete) which bypasses the dal/wrapper seam; refactor these calls to use the
repository abstraction instead: add or reuse appropriate methods on the
dal/wrapper (e.g., PodDAL.Evict(ctx, pod, options) and PodDAL.Delete(ctx, pod,
options) or similarly-named wrapper methods) and replace the direct
r.Client.SubResource("eviction").Create and r.Delete usages in the reconcile
flow with calls to those dal/wrapper methods so tests can mock them.

In `@operator/internal/drain/drain_test.go`:
- Around line 51-195: The tests in drain_test.go should be converted from many
individual It blocks into a single Ginkgo DescribeTable: create a DescribeTable
that takes a pod (use basePod() variants), options (DefaultOptions() modified
per case), and the expected Decision (e.g., Decision{Action: ActionEvict,
Reason: ReasonEviction}), then add Entry(...) rows for each current case
(default behavior, eviction disabled, non-running phase, terminating,
unschedulable toleration, daemonset default and when IgnoreDaemonSets=false,
kube-system, mirror pod, unmanaged default and when Force=false, emptyDir
default and when DeleteEmptyDirData=false). Replace the individual It tests with
Entries calling DecidePod(pod, options) and Expect(...).Reference symbols:
DescribeTable/Entry, basePod, DecidePod, DefaultOptions, Decision,
ActionEvict/ActionDelete/ActionIgnore/ActionBlock and Reason* constants.

In `@operator/internal/drain/drain.go`:
- Around line 126-132: The toleratesUnschedulable function currently checks only
toleration.Key and must instead use Kubernetes' taint matching: in
toleratesUnschedulable, construct a corev1.Taint with Key:
corev1.TaintNodeUnschedulable and Effect: corev1.TaintEffectNoSchedule and call
each toleration's ToleratesTaint (e.g., toleration.ToleratesTaint(&taint,
false)) returning true if any toleration tolerates it; mirror the approach used
by CheckTaintToleration. Also add unit tests for toleratesUnschedulable covering
tolerations with mismatched Effect, Operator, and Value so key-only matches
cannot regress.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: d40b1c75-9f43-46ea-928c-51d35c7d7ec4

📥 Commits

Reviewing files that changed from the base of the PR and between b264019 and bf419d9.

📒 Files selected for processing (27)
  • chart/templates/skyhook-crd.yaml
  • docs/interrupt_flow.md
  • k8s-tests/chainsaw/skyhook/drain-config/chainsaw-test.yaml
  • k8s-tests/chainsaw/skyhook/drain-config/disable-eviction.yaml
  • k8s-tests/chainsaw/skyhook/drain-config/timeout-assert.yaml
  • k8s-tests/chainsaw/skyhook/drain-config/timeout.yaml
  • operator/CHANGELOG.md
  • operator/api/v1alpha1/skyhook_types.go
  • operator/api/v1alpha1/skyhook_webhook.go
  • operator/api/v1alpha1/skyhook_webhook_test.go
  • operator/api/v1alpha1/zz_generated.deepcopy.go
  • operator/cmd/cli/app/node/node_reset.go
  • operator/cmd/cli/app/node/node_reset_test.go
  • operator/cmd/cli/app/node/node_status.go
  • operator/cmd/cli/app/reset.go
  • operator/cmd/cli/app/reset_test.go
  • operator/config/crd/bases/skyhook.nvidia.com_skyhooks.yaml
  • operator/internal/controller/mock/SkyhookNodes.go
  • operator/internal/controller/skyhook_controller.go
  • operator/internal/controller/skyhook_controller_test.go
  • operator/internal/drain/drain.go
  • operator/internal/drain/drain_suite_test.go
  • operator/internal/drain/drain_test.go
  • operator/internal/wrapper/mock/SkyhookNode.go
  • operator/internal/wrapper/mock/SkyhookNodeOnly.go
  • operator/internal/wrapper/node.go
  • operator/internal/wrapper/node_test.go

Comment thread k8s-tests/chainsaw/skyhook/drain-config/chainsaw-test.yaml
Comment thread k8s-tests/chainsaw/skyhook/drain-config/chainsaw-test.yaml
Comment thread k8s-tests/chainsaw/skyhook/drain-config/disable-eviction.yaml
Comment thread k8s-tests/chainsaw/skyhook/drain-config/timeout.yaml
Comment thread operator/cmd/cli/app/node/node_reset.go Outdated
Comment thread operator/cmd/cli/app/reset.go Outdated
Comment thread operator/cmd/cli/app/reset.go
Comment thread operator/internal/controller/skyhook_controller.go Outdated
Comment thread operator/internal/drain/drain_test.go
Comment thread operator/internal/drain/drain.go

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (11)
operator/internal/controller/skyhook_controller.go (1)

1767-1775: 🛠️ Refactor suggestion | 🟠 Major | 🏗️ Heavy lift

Move pod eviction/delete behind the DAL or wrapper seam.

This new drain path still mutates pods through raw controller-runtime client calls, which bypasses the repo’s controller abstraction and makes the flow harder to mock consistently in controller tests.

As per coding guidelines Use the dal and wrapper layers instead of reaching for controller-runtime client.Client directly from new controller code. These layers are the seam tests mock against.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@operator/internal/controller/skyhook_controller.go` around lines 1767 - 1775,
The drain code is directly mutating pods via
r.Client.SubResource("eviction").Create(...) and r.Delete(...), bypassing the
project's DAL/wrapper seam; instead, introduce and call the existing DAL/wrapper
methods (e.g., a PodRepository or PodWrapper method such as EvictPod(ctx, pod,
options) and DeletePod(ctx, pod, options) or add them if missing) and replace
the direct calls to policyv1.Eviction creation and r.Delete with those wrapper
calls (refer to the eviction creation block using policyv1.Eviction and the
drain.ActionDelete branch calling r.Delete) so tests can mock the DAL/wrapper
rather than controller-runtime client.Client directly.
operator/internal/drain/drain_test.go (1)

51-195: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Collapse these decision cases into a DescribeTable.

These cases are all the same test shape with different pod/options/expected outcomes, so keeping them as separate It blocks makes the suite harder to extend and drifts from the repo’s required Ginkgo style.

As per coding guidelines Use Ginkgo/Gomega for all unit tests, not stdlib t.Run. Match the testing file you're adding to. Use DescribeTable/Entry for table-driven tests.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@operator/internal/drain/drain_test.go` around lines 51 - 195, Collapse the
repeated It blocks into a single Ginkgo DescribeTable that iterates over test
cases by calling DecidePod for each input; create table entries that provide the
pod (use basePod() with per-entry modifications), the options (DefaultOptions()
with per-entry overrides), and the expected Decision values (Decision{Action:
..., Reason: ...}), and assert the Decision equals the expected value plus the
relevant boolean checks for BlocksDrain() and RequiresAction() where applicable.
Locate the repeated assertions around DecidePod, DefaultOptions, basePod, and
Decision (and constants like ActionEvict, ActionDelete, ActionIgnore,
ActionBlock and reasons such as ReasonEviction, ReasonDelete, ReasonPhase,
ReasonTerminating, ReasonUnschedulableToleration, ReasonDaemonSet,
ReasonKubeSystem, ReasonMirrorPod, ReasonUnmanaged, ReasonEmptyDir) and convert
each It case into a DescribeTable Entry with a descriptive name and the
pod/options/expected result to keep behavior identical. Ensure entries include
cases that also assert BlocksDrain() and RequiresAction() when the original test
did so.
operator/internal/drain/drain.go (1)

126-129: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use Kubernetes taint matching instead of a key-only check.

This still treats any toleration with Key == node.kubernetes.io/unschedulable as sufficient, even when the Effect, Operator, or Value would not actually tolerate the node’s unschedulable taint. That can incorrectly ignore pods and let drain finish early.

Suggested fix
 func toleratesUnschedulable(pod *corev1.Pod) bool {
+	taint := &corev1.Taint{
+		Key:    corev1.TaintNodeUnschedulable,
+		Effect: corev1.TaintEffectNoSchedule,
+	}
 	for _, toleration := range pod.Spec.Tolerations {
-		if toleration.Key == corev1.TaintNodeUnschedulable {
+		if toleration.ToleratesTaint(taint) {
 			return true
 		}
 	}
 	return false
 }
#!/bin/bash
set -euo pipefail

FILE="operator/internal/drain/drain.go"
VENDOR_TOL="operator/vendor/k8s.io/api/core/v1/toleration.go"
VENDOR_TAINT="operator/vendor/k8s.io/api/core/v1/well_known_taints.go"

echo "Current implementation:"
sed -n '120,135p' "$FILE" | cat -n

echo
echo "Kubernetes toleration matcher:"
rg -n "func \\(t \\*?Toleration\\) ToleratesTaint" "$VENDOR_TOL"
sed -n '1,120p' "$VENDOR_TOL" | cat -n

echo
echo "Unschedulable taint constant:"
rg -n "TaintNodeUnschedulable|NoSchedule" "$VENDOR_TAINT"
sed -n '1,80p' "$VENDOR_TAINT" | cat -n
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@operator/internal/drain/drain.go` around lines 126 - 129, The
toleratesUnschedulable function currently checks only toleration.Key; update it
to use Kubernetes' full taint matching by constructing the unschedulable taint
and calling the Toleration.ToleratesTaint matcher for each toleration.
Specifically, in toleratesUnschedulable iterate pod.Spec.Tolerations and return
true when toleration.ToleratesTaint(&corev1.Taint{Key:
corev1.TaintNodeUnschedulable, Effect: corev1.TaintEffectNoSchedule}) (or the
appropriate Effect constant) returns true so Operator/Effect/Value are
respected; otherwise return false. Ensure you reference the
Toleration.ToleratesTaint method and
corev1.TaintNodeUnschedulable/corev1.TaintEffectNoSchedule symbols.
operator/cmd/cli/app/reset.go (2)

64-71: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Selection misses nodes with label-only residue.

hasResettableAnnotation only inspects node.Annotations, but line 228 removes the statusLabelPrefix+skyhookName label. If a prior reset attempt clears annotations but fails during label removal, subsequent retries will never select that node again, leaving a permanent stale status label.

💡 Suggested fix

Extend hasResettableAnnotation to also check node.Labels for statusLabelPrefix+skyhookName, or introduce a separate hasResettableLabel helper and OR the results in the selection predicate at line 133.

-func hasResettableAnnotation(annotations map[string]string, annotationKeys []string) bool {
+func hasResettableMetadata(annotations, labels map[string]string, annotationKeys []string, skyhookName string) bool {
 	for _, annotationKey := range annotationKeys {
 		if _, ok := annotations[annotationKey]; ok {
 			return true
 		}
 	}
+	if _, ok := labels[statusLabelPrefix+skyhookName]; ok {
+		return true
+	}
 	return false
 }

Then update call site:

-	if !hasResettableAnnotation(node.Annotations, annotationKeys) {
+	if !hasResettableMetadata(node.Annotations, node.Labels, annotationKeys, skyhookName) {

Also applies to: 133-134

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@operator/cmd/cli/app/reset.go` around lines 64 - 71, hasResettableAnnotation
currently only checks node.Annotations, so nodes that were left with only the
status label (statusLabelPrefix+skyhookName) after a failed reset won’t be
selected; update hasResettableAnnotation to also inspect node.Labels for the
statusLabelPrefix+skyhookName (or add a small hasResettableLabel helper) and
ensure the selection predicate that calls hasResettableAnnotation (the node
selection used for resets) ORs the annotation and label checks so nodes with
label-only residue are included.

137-145: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't let malformed nodeState block cleanup.

When json.Unmarshal fails (line 143), the continue skips this node entirely, leaving drainStart_, status_, cordon_, and other reset metadata behind. For a recovery command, the safer behavior is to warn (preserving the existing verbose message) but proceed with an empty NodeState so all Skyhook-related annotations and labels still get cleared.

🛡️ Suggested fix
 		nodeState := v1alpha1.NodeState{}
 		if annotation, ok := node.Annotations[annotationKey]; ok {
 			if err := json.Unmarshal([]byte(annotation), &nodeState); err != nil {
 				if cliCtx.GlobalFlags.Verbose {
 					_, _ = fmt.Fprintf(cmd.ErrOrStderr(), "Warning: skipping node %q - invalid annotation: %v\n", node.Name, err)
 				}
-				continue
+				// Proceed with empty NodeState so we still clear drain/status/cordon metadata
+				nodeState = v1alpha1.NodeState{}
 			}
 		}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@operator/cmd/cli/app/reset.go` around lines 137 - 145, The code currently
bails out on a malformed annotation during json.Unmarshal (in the nodeState
handling around annotationKey / v1alpha1.NodeState), which leaves Skyhook
metadata uncleared; instead, keep the existing verbose warning
(cliCtx.GlobalFlags.Verbose and fmt.Fprintf to cmd.ErrOrStderr()) but do not
return/continue on error — allow nodeState to remain the zero value and proceed
so the reset logic can still clear drainStart_, status_, cordon_,
labels/annotations etc.; remove the continue after json.Unmarshal error handling
and rely on the initialized nodeState.
operator/cmd/cli/app/node/node_reset.go (2)

54-61: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Selection misses nodes with label-only residue.

hasResettableAnnotation only checks node.Annotations, but line 227 removes the statusLabelPrefix+opts.skyhookName label. If a prior reset cleared annotations but failed on the label, retries will not reselect that node, leaving a stale status label.

💡 Suggested fix

Extend the helper to also inspect labels, or add a separate label check in the selection predicate. Example:

-func hasResettableAnnotation(annotations map[string]string, annotationKeys []string) bool {
+func hasResettableMetadata(annotations, labels map[string]string, annotationKeys []string, skyhookName string) bool {
 	for _, annotationKey := range annotationKeys {
 		if _, ok := annotations[annotationKey]; ok {
 			return true
 		}
 	}
+	if _, ok := labels[statusLabelPrefix+skyhookName]; ok {
+		return true
+	}
 	return false
 }

Then at line 155:

-		if !hasResettableAnnotation(node.Annotations, annotationKeys) {
+		if !hasResettableMetadata(node.Annotations, node.Labels, annotationKeys, opts.skyhookName) {

Also applies to: 155-157

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@operator/cmd/cli/app/node/node_reset.go` around lines 54 - 61, The selection
currently misses nodes that only have the stale status label because
hasResettableAnnotation only checks node.Annotations; update the selection to
also consider node.Labels by modifying hasResettableAnnotation (or creating a
new helper) to accept both annotations and labels and check for the same
annotationKeys plus the label key constructed as
statusLabelPrefix+opts.skyhookName (or add an additional check in the predicate
used when filtering nodes) so that nodes with either matching annotations or the
specific status label are returned by the selection logic.

159-167: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Malformed nodeState should not prevent a node reset.

The continue on line 165 skips resetting this node when json.Unmarshal fails, leaving drainStart_, status_, cordon_, and other reset metadata behind. Since this is a recovery command, the safer behavior is to warn (keeping the existing verbose message) but proceed with an empty NodeState so annotations and labels still get cleared.

🛡️ Suggested fix
 		nodeState := v1alpha1.NodeState{}
 		if annotation, ok := node.Annotations[annotationKey]; ok {
 			if err := json.Unmarshal([]byte(annotation), &nodeState); err != nil {
 				if cliCtx.GlobalFlags.Verbose {
 					_, _ = fmt.Fprintf(cmd.ErrOrStderr(), "Warning: skipping node %q - invalid annotation: %v\n", nodeName, err)
 				}
-				continue
+				// Proceed with empty NodeState to still clear drain/status/cordon metadata
+				nodeState = v1alpha1.NodeState{}
 			}
 		}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@operator/cmd/cli/app/node/node_reset.go` around lines 159 - 167, The code
currently bails out of resetting the node when json.Unmarshal into nodeState
fails; instead, keep nodeState as the zero value and proceed with the reset. In
the block around nodeState and json.Unmarshal (variables: nodeState, annotation,
node.Annotations, json.Unmarshal), remove the continue on error and retain the
existing verbose warning, so malformed annotations are logged but the rest of
the reset logic still runs to clear annotations/labels and other metadata.
k8s-tests/chainsaw/skyhook/drain-config/timeout.yaml (1)

30-33: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use a pinned image from an approved registry.

The workload container uses busybox:latest, which violates the image-source policy. Replace with a pinned image from ghcr.io/nvidia/nodewright/* or ghcr.io/nvidia/skyhook/*.

As per coding guidelines, "Container images must be pulled from ghcr.io/nvidia/nodewright/* or ghcr.io/nvidia/skyhook/* registries as per current distribution".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@k8s-tests/chainsaw/skyhook/drain-config/timeout.yaml` around lines 30 - 33,
The container "workload" in timeout.yaml uses an unapproved image
"busybox:latest"; update the image field for the container named workload to a
pinned image from the approved registries (ghcr.io/nvidia/nodewright/* or
ghcr.io/nvidia/skyhook/*) using a specific tag or digest (not :latest), e.g.,
replace the image value with a concrete ghcr.io/nvidia/skyhook/<image>:<version>
or `@sha256`:<digest> to satisfy the image-source policy.
k8s-tests/chainsaw/skyhook/drain-config/chainsaw-test.yaml (2)

116-125: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Assert the timeout-specific drain event.

The loop checks for any reason=Drain event but doesn't verify it's the timeout-specific DrainTimeout action. This can pass without validating the timeout behavior. Please match on the timeout-specific action or message.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@k8s-tests/chainsaw/skyhook/drain-config/chainsaw-test.yaml` around lines 116
- 125, The current script loop only looks for events with reason=Drain which can
match non-timeout drains; update the kubectl check in the script block (the loop
under script: content) to assert the timeout-specific event by either changing
the field-selector to reason=DrainTimeout (i.e., --field-selector
involvedObject.kind=Skyhook,involvedObject.name=drain-config-timeout,reason=DrainTimeout)
or by querying events with -o jsonpath and asserting the .message contains the
timeout text (e.g., grep "timeout") so the test only succeeds when the
DrainTimeout-specific event is emitted.

20-22: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add an explicit Chainsaw pool label.

This test suite is missing metadata.labels.pool. CI needs the pool assignment to route the test according to its mutation profile (e.g., pool: interrupt or pool: lifecycle).

Based on learnings, "In Chainsaw e2e test YAMLs, the pool label (core, interrupt, uninstall, lifecycle) must be assigned based on the cluster-state mutation/runtime mutation profile the test exercises".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@k8s-tests/chainsaw/skyhook/drain-config/chainsaw-test.yaml` around lines 20 -
22, The manifest for the resource with metadata.name "drain-config" must include
an explicit metadata.labels.pool entry; add a metadata.labels block and set pool
to the correct Chainsaw test pool (one of: core, interrupt, uninstall,
lifecycle) that matches the cluster/runtime mutation profile this test exercises
(e.g., pool: interrupt or pool: lifecycle) so CI can route the test correctly.
k8s-tests/chainsaw/skyhook/drain-config/disable-eviction.yaml (1)

36-39: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use a pinned image from an approved registry.

The workload container uses busybox:latest, which violates the repository's image-source policy. Switch to a pinned image under ghcr.io/nvidia/nodewright/* or ghcr.io/nvidia/skyhook/*.

As per coding guidelines, "Container images must be pulled from ghcr.io/nvidia/nodewright/* or ghcr.io/nvidia/skyhook/* registries as per current distribution".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@k8s-tests/chainsaw/skyhook/drain-config/disable-eviction.yaml` around lines
36 - 39, The workload container currently uses an unapproved image
"busybox:latest"; update the image field for the container named "workload" in
disable-eviction.yaml to a pinned digest or tag from an approved registry
(ghcr.io/nvidia/nodewright/* or ghcr.io/nvidia/skyhook/*), e.g., replace
"busybox:latest" with a specific ghcr.io/nvidia/... image and tag/digest to
ensure reproducibility and policy compliance; leave the container name and
command unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@operator/internal/controller/skyhook_controller.go`:
- Around line 1716-1735: The drain-timeout branch currently sets
skyhookNode.SetStatus(v1alpha1.StatusErroring) but returns (false, nil), which
keeps ProcessInterrupt/RunSkyhookPackages in the normal short requeue loop;
change the return to signal the caller to stop the regular retry loop — e.g.,
return true, nil — so that after drainTimedOut(...) and setting StatusErroring
the reconciler does not immediately requeue the node every 2s; update the branch
that checks drainTimedOut(drainStartedAt, drainConfig.Timeout, now.Time) to
return the appropriate stop-retry value (true, nil) and ensure callers of
ProcessInterrupt/RunSkyhookPackages honor that contract.

In `@operator/internal/wrapper/node.go`:
- Around line 556-557: Replace the inline-constructed drainStart annotation key
deletion with the existing helper: instead of delete(node.Annotations,
fmt.Sprintf("%s/drainStart_%s", v1alpha1.METADATA_PREFIX, node.skyhook.Name))
call the drainStartAnnotationKey helper (used by StartDrain, DrainStartedAt,
ClearDrainStart) to compute the key and delete that entry; keep the other delete
for cordon_... as-is. This ensures consistent key formatting via
drainStartAnnotationKey().

---

Duplicate comments:
In `@k8s-tests/chainsaw/skyhook/drain-config/chainsaw-test.yaml`:
- Around line 116-125: The current script loop only looks for events with
reason=Drain which can match non-timeout drains; update the kubectl check in the
script block (the loop under script: content) to assert the timeout-specific
event by either changing the field-selector to reason=DrainTimeout (i.e.,
--field-selector
involvedObject.kind=Skyhook,involvedObject.name=drain-config-timeout,reason=DrainTimeout)
or by querying events with -o jsonpath and asserting the .message contains the
timeout text (e.g., grep "timeout") so the test only succeeds when the
DrainTimeout-specific event is emitted.
- Around line 20-22: The manifest for the resource with metadata.name
"drain-config" must include an explicit metadata.labels.pool entry; add a
metadata.labels block and set pool to the correct Chainsaw test pool (one of:
core, interrupt, uninstall, lifecycle) that matches the cluster/runtime mutation
profile this test exercises (e.g., pool: interrupt or pool: lifecycle) so CI can
route the test correctly.

In `@k8s-tests/chainsaw/skyhook/drain-config/disable-eviction.yaml`:
- Around line 36-39: The workload container currently uses an unapproved image
"busybox:latest"; update the image field for the container named "workload" in
disable-eviction.yaml to a pinned digest or tag from an approved registry
(ghcr.io/nvidia/nodewright/* or ghcr.io/nvidia/skyhook/*), e.g., replace
"busybox:latest" with a specific ghcr.io/nvidia/... image and tag/digest to
ensure reproducibility and policy compliance; leave the container name and
command unchanged.

In `@k8s-tests/chainsaw/skyhook/drain-config/timeout.yaml`:
- Around line 30-33: The container "workload" in timeout.yaml uses an unapproved
image "busybox:latest"; update the image field for the container named workload
to a pinned image from the approved registries (ghcr.io/nvidia/nodewright/* or
ghcr.io/nvidia/skyhook/*) using a specific tag or digest (not :latest), e.g.,
replace the image value with a concrete ghcr.io/nvidia/skyhook/<image>:<version>
or `@sha256`:<digest> to satisfy the image-source policy.

In `@operator/cmd/cli/app/node/node_reset.go`:
- Around line 54-61: The selection currently misses nodes that only have the
stale status label because hasResettableAnnotation only checks node.Annotations;
update the selection to also consider node.Labels by modifying
hasResettableAnnotation (or creating a new helper) to accept both annotations
and labels and check for the same annotationKeys plus the label key constructed
as statusLabelPrefix+opts.skyhookName (or add an additional check in the
predicate used when filtering nodes) so that nodes with either matching
annotations or the specific status label are returned by the selection logic.
- Around line 159-167: The code currently bails out of resetting the node when
json.Unmarshal into nodeState fails; instead, keep nodeState as the zero value
and proceed with the reset. In the block around nodeState and json.Unmarshal
(variables: nodeState, annotation, node.Annotations, json.Unmarshal), remove the
continue on error and retain the existing verbose warning, so malformed
annotations are logged but the rest of the reset logic still runs to clear
annotations/labels and other metadata.

In `@operator/cmd/cli/app/reset.go`:
- Around line 64-71: hasResettableAnnotation currently only checks
node.Annotations, so nodes that were left with only the status label
(statusLabelPrefix+skyhookName) after a failed reset won’t be selected; update
hasResettableAnnotation to also inspect node.Labels for the
statusLabelPrefix+skyhookName (or add a small hasResettableLabel helper) and
ensure the selection predicate that calls hasResettableAnnotation (the node
selection used for resets) ORs the annotation and label checks so nodes with
label-only residue are included.
- Around line 137-145: The code currently bails out on a malformed annotation
during json.Unmarshal (in the nodeState handling around annotationKey /
v1alpha1.NodeState), which leaves Skyhook metadata uncleared; instead, keep the
existing verbose warning (cliCtx.GlobalFlags.Verbose and fmt.Fprintf to
cmd.ErrOrStderr()) but do not return/continue on error — allow nodeState to
remain the zero value and proceed so the reset logic can still clear
drainStart_, status_, cordon_, labels/annotations etc.; remove the continue
after json.Unmarshal error handling and rely on the initialized nodeState.

In `@operator/internal/controller/skyhook_controller.go`:
- Around line 1767-1775: The drain code is directly mutating pods via
r.Client.SubResource("eviction").Create(...) and r.Delete(...), bypassing the
project's DAL/wrapper seam; instead, introduce and call the existing DAL/wrapper
methods (e.g., a PodRepository or PodWrapper method such as EvictPod(ctx, pod,
options) and DeletePod(ctx, pod, options) or add them if missing) and replace
the direct calls to policyv1.Eviction creation and r.Delete with those wrapper
calls (refer to the eviction creation block using policyv1.Eviction and the
drain.ActionDelete branch calling r.Delete) so tests can mock the DAL/wrapper
rather than controller-runtime client.Client directly.

In `@operator/internal/drain/drain_test.go`:
- Around line 51-195: Collapse the repeated It blocks into a single Ginkgo
DescribeTable that iterates over test cases by calling DecidePod for each input;
create table entries that provide the pod (use basePod() with per-entry
modifications), the options (DefaultOptions() with per-entry overrides), and the
expected Decision values (Decision{Action: ..., Reason: ...}), and assert the
Decision equals the expected value plus the relevant boolean checks for
BlocksDrain() and RequiresAction() where applicable. Locate the repeated
assertions around DecidePod, DefaultOptions, basePod, and Decision (and
constants like ActionEvict, ActionDelete, ActionIgnore, ActionBlock and reasons
such as ReasonEviction, ReasonDelete, ReasonPhase, ReasonTerminating,
ReasonUnschedulableToleration, ReasonDaemonSet, ReasonKubeSystem,
ReasonMirrorPod, ReasonUnmanaged, ReasonEmptyDir) and convert each It case into
a DescribeTable Entry with a descriptive name and the pod/options/expected
result to keep behavior identical. Ensure entries include cases that also assert
BlocksDrain() and RequiresAction() when the original test did so.

In `@operator/internal/drain/drain.go`:
- Around line 126-129: The toleratesUnschedulable function currently checks only
toleration.Key; update it to use Kubernetes' full taint matching by constructing
the unschedulable taint and calling the Toleration.ToleratesTaint matcher for
each toleration. Specifically, in toleratesUnschedulable iterate
pod.Spec.Tolerations and return true when
toleration.ToleratesTaint(&corev1.Taint{Key: corev1.TaintNodeUnschedulable,
Effect: corev1.TaintEffectNoSchedule}) (or the appropriate Effect constant)
returns true so Operator/Effect/Value are respected; otherwise return false.
Ensure you reference the Toleration.ToleratesTaint method and
corev1.TaintNodeUnschedulable/corev1.TaintEffectNoSchedule symbols.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 1e03117b-a559-4846-a8b8-ed3b20d72464

📥 Commits

Reviewing files that changed from the base of the PR and between bf419d9 and c7adb91.

📒 Files selected for processing (27)
  • chart/templates/skyhook-crd.yaml
  • docs/interrupt_flow.md
  • k8s-tests/chainsaw/skyhook/drain-config/chainsaw-test.yaml
  • k8s-tests/chainsaw/skyhook/drain-config/disable-eviction.yaml
  • k8s-tests/chainsaw/skyhook/drain-config/timeout-assert.yaml
  • k8s-tests/chainsaw/skyhook/drain-config/timeout.yaml
  • operator/CHANGELOG.md
  • operator/api/v1alpha1/skyhook_types.go
  • operator/api/v1alpha1/skyhook_webhook.go
  • operator/api/v1alpha1/skyhook_webhook_test.go
  • operator/api/v1alpha1/zz_generated.deepcopy.go
  • operator/cmd/cli/app/node/node_reset.go
  • operator/cmd/cli/app/node/node_reset_test.go
  • operator/cmd/cli/app/node/node_status.go
  • operator/cmd/cli/app/reset.go
  • operator/cmd/cli/app/reset_test.go
  • operator/config/crd/bases/skyhook.nvidia.com_skyhooks.yaml
  • operator/internal/controller/mock/SkyhookNodes.go
  • operator/internal/controller/skyhook_controller.go
  • operator/internal/controller/skyhook_controller_test.go
  • operator/internal/drain/drain.go
  • operator/internal/drain/drain_suite_test.go
  • operator/internal/drain/drain_test.go
  • operator/internal/wrapper/mock/SkyhookNode.go
  • operator/internal/wrapper/mock/SkyhookNodeOnly.go
  • operator/internal/wrapper/node.go
  • operator/internal/wrapper/node_test.go

Comment thread operator/internal/controller/skyhook_controller.go Outdated
Comment thread operator/internal/wrapper/node.go Outdated
@AnouarMohamed AnouarMohamed force-pushed the fix/issue-259-drain-config branch from c7adb91 to 0440375 Compare June 2, 2026 14:39

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (4)
operator/internal/controller/skyhook_controller.go (2)

1767-1775: 🛠️ Refactor suggestion | 🟠 Major | 🏗️ Heavy lift

Drain actions bypass the DAL/wrapper seam.

The drain loop calls r.Client.SubResource("eviction").Create (line 1768) and r.Delete (line 1774) directly, which bypasses the project's test seam and diverges from the established controller pattern. As per coding guidelines, new controller code should use dal/wrapper layers so tests can mock these mutations without hitting the real client.

As per coding guidelines: "Use the dal and wrapper layers instead of reaching for controller-runtime client.Client directly from new controller code. These layers are the seam tests mock against."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@operator/internal/controller/skyhook_controller.go` around lines 1767 - 1775,
The drain loop is calling the controller-runtime client directly
(r.Client.SubResource("eviction").Create and r.Delete) which bypasses the
testable DAL/wrapper seam; replace those direct calls with the project
DAL/wrapper APIs (e.g., call the DAL's EvictPod/Eviction method instead of
r.Client.SubResource("eviction").Create and call the wrapper/DAL
DeletePod/Delete method instead of r.Delete), adjust or add small DAL/wrapper
functions that accept the same ctx, pod and options (e.g., EvictPod(ctx, pod,
evictionDeleteOptions) and DeletePod(ctx, pod, deleteOptions...)), update
imports and the calling site to use those methods, and ensure tests can mock the
DAL/wrapper rather than hitting controller-runtime directly.

1716-1736: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Timeout expiry continues requeueing without a recovery path.

After the drain timeout expires and the node is marked StatusErroring, the function returns (false, nil), which keeps ProcessInterruptRunSkyhookPackages in the normal 2-second requeue loop indefinitely. This contradicts the timeout contract and provides no path forward—either manual intervention to resolve the blocking pods or a backoff/max-retry limit should stop the tight requeue cycle once the timeout fires.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@operator/internal/controller/skyhook_controller.go` around lines 1716 - 1736,
The handler currently marks the node StatusErroring but returns (false, nil)
which keeps ProcessInterrupt → RunSkyhookPackages in a tight 2s requeue; change
the control flow so that once drainTimedOut(...) triggers and you call
skyhookNode.SetStatus(v1alpha1.StatusErroring) you either (a) return an error
(e.g. fmt.Errorf("drain timed out for skyhook %s node %s") ) to surface a
backoff/retry from the controller runtime, or (b) return a terminal/stop signal
to the caller (e.g. a boolean that prevents immediate requeue) so the
reconciliation uses a longer backoff or requires manual intervention; update the
branch where drainTimedOut(...) is handled (references: drainTimedOut,
skyhookNode.SetStatus, EventsReasonSkyhookDrain, ProcessInterrupt,
RunSkyhookPackages) to implement one of these two behaviors and add a clear
event message indicating manual intervention is required.
operator/internal/wrapper/node.go (1)

557-557: 🧹 Nitpick | 🔵 Trivial | 💤 Low value

Use drainStartAnnotationKey() helper for consistency.

Line 557 inline-constructs the drain-start key, but the helper at line 176 provides the same logic. Using the helper improves consistency with StartDrain, DrainStartedAt, and ClearDrainStart.

♻️ Suggested refactor
 	delete(node.Annotations, fmt.Sprintf("%s/cordon_%s", v1alpha1.METADATA_PREFIX, node.skyhook.Name))
-	delete(node.Annotations, fmt.Sprintf("%s/drainStart_%s", v1alpha1.METADATA_PREFIX, node.skyhook.Name))
+	delete(node.Annotations, node.drainStartAnnotationKey())
 	delete(node.Annotations, fmt.Sprintf("%s/nodeState_%s", v1alpha1.METADATA_PREFIX, node.skyhook.Name))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@operator/internal/wrapper/node.go` at line 557, Replace the inline
construction of the drain-start annotation key with the existing helper: instead
of calling delete(node.Annotations, fmt.Sprintf("%s/drainStart_%s",
v1alpha1.METADATA_PREFIX, node.skyhook.Name)) use the
drainStartAnnotationKey(...) helper to produce the key; update the delete call
to delete(node.Annotations, drainStartAnnotationKey(node.skyhook.Name)) so it
stays consistent with StartDrain, DrainStartedAt, and ClearDrainStart.
operator/internal/drain/drain_test.go (1)

51-231: 🧹 Nitpick | 🔵 Trivial

Convert remaining tests to DescribeTable.

The toleration matching tests (lines 106-140) already use DescribeTable, but the remaining test cases are still individual It blocks. Converting all decision tests to a single DescribeTable would improve maintainability and follow the repo's Ginkgo guidance.

As per coding guidelines, use DescribeTable/Entry for table-driven tests.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@operator/internal/drain/drain_test.go` around lines 51 - 231, Group the
remaining individual It tests into a single DescribeTable that drives DecidePod
with table entries defining: a test name, a base pod modified per case (use
basePod() then mutate fields like Status.Phase, DeletionTimestamp,
Spec.Tolerations, OwnerReferences, Namespace, Annotations, Spec.Volumes),
optional Options (DefaultOptions() modified such as DisableEviction,
IgnoreDaemonSets, Force, DeleteEmptyDirData), and the expected Decision value
(e.g., Decision{Action: ActionEvict, Reason: ReasonEviction}, Decision{Action:
ActionIgnore, Reason: ReasonPhase}, Decision{Action: ActionDelete, Reason:
ReasonDelete}, Decision{Action: ActionBlock, Reason: ReasonUnmanaged}, etc.
Replace each It block with an Entry that calls DecidePod(pod, options) and
asserts equality to the expected Decision and, when relevant, checks
BlocksDrain() and RequiresAction()—referencing DecidePod, DefaultOptions,
basePod, Decision, BlocksDrain, RequiresAction, and the option fields
(DisableEviction, IgnoreDaemonSets, Force, DeleteEmptyDirData) to locate and
implement each case.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@operator/internal/controller/skyhook_controller.go`:
- Around line 1767-1775: The drain loop is calling the controller-runtime client
directly (r.Client.SubResource("eviction").Create and r.Delete) which bypasses
the testable DAL/wrapper seam; replace those direct calls with the project
DAL/wrapper APIs (e.g., call the DAL's EvictPod/Eviction method instead of
r.Client.SubResource("eviction").Create and call the wrapper/DAL
DeletePod/Delete method instead of r.Delete), adjust or add small DAL/wrapper
functions that accept the same ctx, pod and options (e.g., EvictPod(ctx, pod,
evictionDeleteOptions) and DeletePod(ctx, pod, deleteOptions...)), update
imports and the calling site to use those methods, and ensure tests can mock the
DAL/wrapper rather than hitting controller-runtime directly.
- Around line 1716-1736: The handler currently marks the node StatusErroring but
returns (false, nil) which keeps ProcessInterrupt → RunSkyhookPackages in a
tight 2s requeue; change the control flow so that once drainTimedOut(...)
triggers and you call skyhookNode.SetStatus(v1alpha1.StatusErroring) you either
(a) return an error (e.g. fmt.Errorf("drain timed out for skyhook %s node %s") )
to surface a backoff/retry from the controller runtime, or (b) return a
terminal/stop signal to the caller (e.g. a boolean that prevents immediate
requeue) so the reconciliation uses a longer backoff or requires manual
intervention; update the branch where drainTimedOut(...) is handled (references:
drainTimedOut, skyhookNode.SetStatus, EventsReasonSkyhookDrain,
ProcessInterrupt, RunSkyhookPackages) to implement one of these two behaviors
and add a clear event message indicating manual intervention is required.

In `@operator/internal/drain/drain_test.go`:
- Around line 51-231: Group the remaining individual It tests into a single
DescribeTable that drives DecidePod with table entries defining: a test name, a
base pod modified per case (use basePod() then mutate fields like Status.Phase,
DeletionTimestamp, Spec.Tolerations, OwnerReferences, Namespace, Annotations,
Spec.Volumes), optional Options (DefaultOptions() modified such as
DisableEviction, IgnoreDaemonSets, Force, DeleteEmptyDirData), and the expected
Decision value (e.g., Decision{Action: ActionEvict, Reason: ReasonEviction},
Decision{Action: ActionIgnore, Reason: ReasonPhase}, Decision{Action:
ActionDelete, Reason: ReasonDelete}, Decision{Action: ActionBlock, Reason:
ReasonUnmanaged}, etc. Replace each It block with an Entry that calls
DecidePod(pod, options) and asserts equality to the expected Decision and, when
relevant, checks BlocksDrain() and RequiresAction()—referencing DecidePod,
DefaultOptions, basePod, Decision, BlocksDrain, RequiresAction, and the option
fields (DisableEviction, IgnoreDaemonSets, Force, DeleteEmptyDirData) to locate
and implement each case.

In `@operator/internal/wrapper/node.go`:
- Line 557: Replace the inline construction of the drain-start annotation key
with the existing helper: instead of calling delete(node.Annotations,
fmt.Sprintf("%s/drainStart_%s", v1alpha1.METADATA_PREFIX, node.skyhook.Name))
use the drainStartAnnotationKey(...) helper to produce the key; update the
delete call to delete(node.Annotations,
drainStartAnnotationKey(node.skyhook.Name)) so it stays consistent with
StartDrain, DrainStartedAt, and ClearDrainStart.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 61be648d-f7c5-430b-aa33-1386cc14acea

📥 Commits

Reviewing files that changed from the base of the PR and between c7adb91 and 0440375.

📒 Files selected for processing (27)
  • chart/templates/skyhook-crd.yaml
  • docs/interrupt_flow.md
  • k8s-tests/chainsaw/skyhook/drain-config/chainsaw-test.yaml
  • k8s-tests/chainsaw/skyhook/drain-config/disable-eviction.yaml
  • k8s-tests/chainsaw/skyhook/drain-config/timeout-assert.yaml
  • k8s-tests/chainsaw/skyhook/drain-config/timeout.yaml
  • operator/CHANGELOG.md
  • operator/api/v1alpha1/skyhook_types.go
  • operator/api/v1alpha1/skyhook_webhook.go
  • operator/api/v1alpha1/skyhook_webhook_test.go
  • operator/api/v1alpha1/zz_generated.deepcopy.go
  • operator/cmd/cli/app/node/node_reset.go
  • operator/cmd/cli/app/node/node_reset_test.go
  • operator/cmd/cli/app/node/node_status.go
  • operator/cmd/cli/app/reset.go
  • operator/cmd/cli/app/reset_test.go
  • operator/config/crd/bases/skyhook.nvidia.com_skyhooks.yaml
  • operator/internal/controller/mock/SkyhookNodes.go
  • operator/internal/controller/skyhook_controller.go
  • operator/internal/controller/skyhook_controller_test.go
  • operator/internal/drain/drain.go
  • operator/internal/drain/drain_suite_test.go
  • operator/internal/drain/drain_test.go
  • operator/internal/wrapper/mock/SkyhookNode.go
  • operator/internal/wrapper/mock/SkyhookNodeOnly.go
  • operator/internal/wrapper/node.go
  • operator/internal/wrapper/node_test.go

@lockwobr

lockwobr commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator

you should rebase, just merged a few pr's that should and seem too stabilize a few of the flaky tests.

Signed-off-by: AnouarMohamed <m.anouar@mundiapolis.ma>
@AnouarMohamed AnouarMohamed force-pushed the fix/issue-259-drain-config branch from 0440375 to 33874d1 Compare June 4, 2026 13:37

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
operator/internal/controller/skyhook_controller.go (1)

1253-1290: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Wrap raw errors with operation context in HandleConfigUpdates.

Several changed paths return bare err, which drops call-site context and makes failures harder to diagnose.

Proposed fix
-				return false, false, err
+				return false, false, fmt.Errorf("checking package pod existence on node %s: %w", node.GetNode().Name, err)
...
-							return false, false, err
+							return false, false, fmt.Errorf("listing package pods on node %s: %w", node.GetNode().Name, err)
...
-									return false, false, err
+									return false, false, fmt.Errorf("deleting erroring pod %s/%s: %w", pod.Namespace, pod.Name, err)

As per coding guidelines Use stdlib errors + fmt.Errorf("…: %w", err) with %w for error wrapping ... Never return a bare err from a function that does more than one thing.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@operator/internal/controller/skyhook_controller.go` around lines 1253 - 1290,
In HandleConfigUpdates, several returns propagate bare errors (e.g., results
from r.PodExists, r.dal.GetPods and r.Delete) losing context; update each error
return to wrap the underlying error with fmt.Errorf and %w plus a short
operation context (for example "checking pod existence for node %s package %s:
%w", "listing pods for node %s package %s: %w", "deleting pod %s on node %s:
%w") referencing the unique symbols r.PodExists, node.IsPackageComplete,
node.PackageStatus, r.dal.GetPods and r.Delete so callers receive actionable
context. Ensure all branches that currently do `return ..., err` become `return
..., fmt.Errorf("...: %w", err)`.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@k8s-tests/chainsaw/skyhook/drain-config/timeout.yaml`:
- Around line 50-59: The drain-timeout package currently omits explicit
uninstall settings; update spec.packages.drain-timeout to include an uninstall
block with explicit boolean fields (uninstall.enabled and uninstall.apply) so
lifecycle is deterministic in tests—for example add uninstall: { enabled: true,
apply: true } under drain-timeout in timeout.yaml (adjust the true/false values
if your test requires disabled behavior).

---

Outside diff comments:
In `@operator/internal/controller/skyhook_controller.go`:
- Around line 1253-1290: In HandleConfigUpdates, several returns propagate bare
errors (e.g., results from r.PodExists, r.dal.GetPods and r.Delete) losing
context; update each error return to wrap the underlying error with fmt.Errorf
and %w plus a short operation context (for example "checking pod existence for
node %s package %s: %w", "listing pods for node %s package %s: %w", "deleting
pod %s on node %s: %w") referencing the unique symbols r.PodExists,
node.IsPackageComplete, node.PackageStatus, r.dal.GetPods and r.Delete so
callers receive actionable context. Ensure all branches that currently do
`return ..., err` become `return ..., fmt.Errorf("...: %w", err)`.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: f352df6e-d4b5-4904-b104-6b2fed9dc997

📥 Commits

Reviewing files that changed from the base of the PR and between 0440375 and 33874d1.

📒 Files selected for processing (27)
  • chart/templates/skyhook-crd.yaml
  • docs/interrupt_flow.md
  • k8s-tests/chainsaw/skyhook/drain-config/chainsaw-test.yaml
  • k8s-tests/chainsaw/skyhook/drain-config/disable-eviction.yaml
  • k8s-tests/chainsaw/skyhook/drain-config/timeout-assert.yaml
  • k8s-tests/chainsaw/skyhook/drain-config/timeout.yaml
  • operator/CHANGELOG.md
  • operator/api/v1alpha1/skyhook_types.go
  • operator/api/v1alpha1/skyhook_webhook.go
  • operator/api/v1alpha1/skyhook_webhook_test.go
  • operator/api/v1alpha1/zz_generated.deepcopy.go
  • operator/cmd/cli/app/node/node_reset.go
  • operator/cmd/cli/app/node/node_reset_test.go
  • operator/cmd/cli/app/node/node_status.go
  • operator/cmd/cli/app/reset.go
  • operator/cmd/cli/app/reset_test.go
  • operator/config/crd/bases/skyhook.nvidia.com_skyhooks.yaml
  • operator/internal/controller/mock/SkyhookNodes.go
  • operator/internal/controller/skyhook_controller.go
  • operator/internal/controller/skyhook_controller_test.go
  • operator/internal/drain/drain.go
  • operator/internal/drain/drain_suite_test.go
  • operator/internal/drain/drain_test.go
  • operator/internal/wrapper/mock/SkyhookNode.go
  • operator/internal/wrapper/mock/SkyhookNodeOnly.go
  • operator/internal/wrapper/node.go
  • operator/internal/wrapper/node_test.go

Comment thread k8s-tests/chainsaw/skyhook/drain-config/timeout.yaml
@AnouarMohamed AnouarMohamed changed the title Make node drain behavior configurable feat(operator): make node drain behavior configurable Jun 4, 2026
Comment thread operator/api/v1alpha1/skyhook_types.go Outdated
Signed-off-by: AnouarMohamed <m.anouar@mundiapolis.ma>

@ayuskauskas ayuskauskas left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this — the internal/drain package is clean and the test coverage (unit + decision-table + chainsaw) is strong. One behavior I'd like documented before merge:

When a drain timeout expires, DrainNode marks the node erroring and stops issuing further evict/delete actions, leaving the node cordoned and the package stage halted. If the blocking condition doesn't clear on its own (e.g. force: false / deleteEmptyDirData: false, or a PDB that stays at zero allowed disruptions), the node will not recover automatically — it stays erroring until an operator intervenes.

Could you add a short "recovering from a drain timeout" note to docs/interrupt_flow.md covering:

  • that the node stays cordoned and the Skyhook goes erroring on timeout,
  • how to recover (e.g. kubectl skyhook reset <skyhook> --confirm, and/or removing the underlying blocker), and
  • whether a blocker that clears after the timeout auto-resumes the drain or still requires a reset.

This makes the failure mode actionable for users who hit it in production.

(Two smaller blocking items are left as inline comments: a missing license header and the CHANGELOG placement.)

Comment thread operator/CHANGELOG.md Outdated

### New Features

- Add `spec.drainConfig` so interrupt drains can tune eviction, direct deletion, emptyDir handling, unmanaged-pod handling, DaemonSet skipping, timeout, and grace-period behavior.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bullet was added under ## [operator/v0.16.0] (line 18), which is an already-released version — v0.16.1 (line 5) sits above it. Adding a new feature here retroactively rewrites the notes of a shipped release.

It should go in a new top section — ## [Unreleased] or the next version heading (e.g. ## [operator/v0.17.0]) — placed above the v0.16.1 entry.

@lockwobr lockwobr left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The feature itself looks largely sound — the drain decision logic is isolated and well unit-tested, and the timeout→erroring path actually works in CI (the node carries both drainStart_* and status_*: erroring). The reason e2e/interrupt is red on every k8s version is mostly a malformed assertion in the new test, not the operator. Details inline. Also two structural items: consolidate the drain helpers into the drain package with direct tests, and call out a couple of bundled behavior changes.

Note: e2e/interrupt fails on all four versions (1.32–1.35); e2e/core and unit-tests pass.

skyhook.nvidia.com/status_drain-config-timeout: erroring
annotations:
skyhook.nvidia.com/status_drain-config-timeout: erroring
(contains(@, 'skyhook.nvidia.com/drainStart_drain-config-timeout')): true

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what's actually failing CI on every k8s version. chainsaw evaluates this and errors:

metadata.annotations.(contains(@, 'skyhook.nvidia.com/drainStart_drain-config-timeout')): Internal error: invalid type for: map[...], expected: []functions.JpType{"array", "string"}

contains(@, key) only works on an array or string, but @ here is the annotations map, so it's a type error — a hard assertion failure, not a retryable mismatch.

Important: the operator behavior is correct. The catch dump shows the node really does carry skyhook.nvidia.com/drainStart_drain-config-timeout: 2026-...Z and skyhook.nvidia.com/status_drain-config-timeout: erroring. So this is a test-assertion bug, not a product bug — the timeout→erroring + drainStart persistence path works. Rewrite the existence check in a valid form, e.g. assert the specific key is non-null:

annotations:
  skyhook.nvidia.com/status_drain-config-timeout: erroring
  ("skyhook.nvidia.com/drainStart_drain-config-timeout" in keys(@)): true

(or drop the dynamic check and assert the concrete key with a non-empty matcher).

status:
status: erroring
nodeStatus:
(values(@)):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Secondary: the Nodewright assert also ran to the 240s timeout. Once the Node assert above is fixed, double-check this (values(@)): [erroring] matches reality — with one node status.nodeStatus should be {node: erroring} → values [erroring], but verify there isn't a transient extra entry. The hard error on line 25 is the primary blocker; this one may just be collateral from the same step failing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I checked this after fixing the node assertion. I kept the nodeStatus values check and removed the cluster-scoped namespace expectation.

func ShouldEvict(pod *corev1.Pod) bool {
switch pod.Status.Phase {
case corev1.PodRunning, corev1.PodPending:
func resolvedDrainOptions(config *v1alpha1.DrainConfig) drain.Options {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consolidation: move resolvedDrainOptions, drainDeleteOptions, drainEvictionDeleteOptions, and drainTimedOut out of the controller and into internal/drain, so it's the single home for drain behavior. Depending on v1alpha1 and controller-runtime client from drain is fine (no cycle — v1alpha1 doesn't import drain). Suggested shapes:

  • drain.OptionsFromConfig(*v1alpha1.DrainConfig) Options
  • drain.TimedOut(startedAt *metav1.Time, timeout *metav1.Duration, now time.Time) bool
  • (o Options) DeleteOptions() []client.DeleteOption and (o Options) EvictionDeleteOptions() *metav1.DeleteOptions

The controller is then left with just IsDrained/DrainNode orchestration.

And please add direct unit tests for the moved logic (currently only hit indirectly via the envtest DrainNode specs):

  • OptionsFromConfig: nil config → defaults; each field override; grace-period rounding — 0s→0, 500ms→1, 1s→1, 1500ms→2 (the % time.Second round-up branch is untested).
  • TimedOut: nil start → false; nil/zero timeout → false; exactly at deadline → true; just before → false.
  • delete-option builders: nil grace → nil; set grace → carries the value.

They're pure functions, so they slot into drain_test.go next to the DecidePod table.

return d.Action == ActionEvict || d.Action == ActionDelete
}

func DecidePod(pod *corev1.Pod, options Options) Decision {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default (no drainConfig) path is subtly not byte-for-byte the old ShouldEvict, even though the PR frames defaults as preserving existing behavior:

  • unschedulable toleration now uses real ToleratesTaint (NoSchedule effect) vs the old "any toleration whose key is unschedulable" — stricter;
  • terminating pods (DeletionTimestamp \!= nil) and mirror/static pods are now ignored — new;
  • DaemonSet detection moved from the len(ownerReferences) > 1 heuristic to GetControllerOf(...).Kind == "DaemonSet".

These are all more correct, but they change the default drain filter that existing interrupt Nodewright already exercise. Worth a one-line "behavior change" note in the docs so upgraders aren't surprised. Should also make sure something about this is the change log.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, thanks. I added this as a behavior change in docs/interrupt_flow.md, operator/CHANGELOG.md, and the PR body.

for _, node := range nodeList.Items {
annotation, ok := node.Annotations[annotationKey]
if !ok {
if !hasResettableMetadata(node.Annotations, node.Labels, annotationKeys, labelKeys) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CLI changes here are contract-required (the operator added the drainStart_<skyhook> node annotation, so reset/node reset must clean it up — good). But this also bundles a behavior change to reset that isn't about drain config: selection moved from "node has the nodeState_ annotation" to "node has any resettable metadata", and a malformed nodeState no longer skips the node (it's reset with empty package state). See the test churn where a case flips from "No nodes have state" to "Successfully reset 1 node". This is arguably more correct, but please call it out in the PR description's behavior-changes (and ideally note it in the CLI changelog).

@AnouarMohamed AnouarMohamed Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I added this to the PR behavior changes and operator/cmd/cli/CHANGELOG.md.

Signed-off-by: AnouarMohamed <m.anouar@mundiapolis.ma>
@github-actions github-actions Bot added doc Documentation change (PR path label; doc issues use the Documentation type) component/operator Skyhook operator (controller-manager) component/chart Helm chart component/cli kubectl-skyhook CLI plugin component/ci CI workflows, GitHub Actions, and repo tooling component/tests End-to-end / chainsaw test suites (k8s-tests) labels Jun 5, 2026
Signed-off-by: AnouarMohamed <m.anouar@mundiapolis.ma>
@AnouarMohamed

AnouarMohamed commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

I pushed the fixes and resolved the conflicts. Could you take another look when you have a chance?

@AnouarMohamed AnouarMohamed requested a review from lockwobr June 5, 2026 20:12
@AnouarMohamed AnouarMohamed requested a review from ayuskauskas June 5, 2026 20:12
Signed-off-by: AnouarMohamed <m.anouar@mundiapolis.ma>
Signed-off-by: AnouarMohamed <m.anouar@mundiapolis.ma>
@AnouarMohamed AnouarMohamed requested a review from a team June 15, 2026 01:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component/chart Helm chart component/ci CI workflows, GitHub Actions, and repo tooling component/cli kubectl-skyhook CLI plugin component/operator Skyhook operator (controller-manager) component/tests End-to-end / chainsaw test suites (k8s-tests) doc Documentation change (PR path label; doc issues use the Documentation type)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make node-drain behavior configurable (mirror kubectl drain options)

3 participants