feat(operator): make node drain behavior configurable#264
feat(operator): make node drain behavior configurable#264AnouarMohamed wants to merge 6 commits into
Conversation
bf419d9 to
c7adb91
Compare
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR adds an optional spec.drainConfig to Skyhook (CRD/chart/webhook/deepcopy), a new operator/internal/drain package to decide per-pod drain actions, per-Skyhook drain-start annotations on nodes (wrapper + mocks + tests), controller changes to apply drain.Options with timeout/grace handling and to perform evict/delete per DecidePod results, CLI reset updates to remove drain metadata, and Chainsaw e2e tests plus documentation and changelog updates. Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 11
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@k8s-tests/chainsaw/skyhook/drain-config/chainsaw-test.yaml`:
- Around line 20-22: The test manifest for the Chainsaw suite (metadata.name:
drain-config) is missing the required Chainsaw pool assignment; add an explicit
metadata.labels.pool entry under the resource metadata (e.g.,
metadata.labels.pool: "lifecycle" or another appropriate value from
core|interrupt|uninstall|lifecycle) so CI can route the test according to its
mutation profile; update the chainsaw-test.yaml metadata block to include the
labels object with the chosen pool value.
- Around line 116-125: The loop currently checks only for any event with
reason=Drain for involvedObject.name=drain-config-timeout; update the kubectl
check in the script to assert the timeout-specific event emitted by the
controller by matching the action (DrainTimeout) or the message text that
contains "DrainTimeout" instead of just reason=Drain. Locate the kubectl
invocation in the script (the line using kubectl -n skyhook get events
--field-selector
involvedObject.kind=Skyhook,involvedObject.name=drain-config-timeout,reason=Drain
-o name | grep .) and change the field-selector or subsequent grep to require
action=DrainTimeout (or grep the event message for "DrainTimeout") so the loop
only succeeds when the timeout-specific event is present.
In `@k8s-tests/chainsaw/skyhook/drain-config/disable-eviction.yaml`:
- Around line 36-39: The workload container currently uses an unpinned external
tag ("image: busybox:latest") which violates the image-source policy; update the
container with name "workload" by replacing the image field to a pinned digest
or semver tag hosted under the approved registries (e.g.,
ghcr.io/nvidia/nodewright/* or ghcr.io/nvidia/skyhook/*) so it no longer
references busybox:latest — ensure the new image is a specific version or digest
and adjust any related pull policy if necessary.
In `@k8s-tests/chainsaw/skyhook/drain-config/timeout.yaml`:
- Around line 30-33: The manifest uses an unapproved floating image for the
container named "workload" (image: busybox:latest); replace it with a pinned
image from an approved NVIDIA registry (e.g.,
ghcr.io/nvidia/nodewright/<approved-image>:<fixed-tag> or
ghcr.io/nvidia/skyhook/<approved-image>:<fixed-tag>) and ensure the YAML 'image'
field is updated accordingly (keep the existing command). Optionally add
imagePullPolicy: IfNotPresent to ensure the pinned tag is used.
In `@operator/cmd/cli/app/node/node_reset.go`:
- Around line 54-61: The selection predicate currently only checks annotations
via hasResettableAnnotation, so nodes that have had their annotations cleared
but still retain the label (statusLabelPrefix+opts.skyhookName) won't be
reselected; update the predicate and/or hasResettableAnnotation to also inspect
node.Labels for the same resettable keys (specifically check for
statusLabelPrefix+opts.skyhookName and any other label-only keys referenced in
lines 155-157) so retries will find nodes that still have residual labels;
ensure you reference hasResettableAnnotation, statusLabelPrefix, and
opts.skyhookName when adding the label checks.
- Around line 159-167: The current logic in node_reset.go skips resetting a node
when json.Unmarshal into nodeState fails; instead, modify the error path so that
on malformed annotation (json.Unmarshal error) you log the warning (using
cliCtx.GlobalFlags.Verbose and fmt.Fprintf to cmd.ErrOrStderr()) but do not
continue—set nodeState to an empty v1alpha1.NodeState (treat package list as
empty) and proceed with the normal reset/clear of annotations and labels; keep
the annotationKey/node.Annotations lookup and only suppress the abort/continue
on unmarshal failure so the subsequent code that clears metadata still runs.
In `@operator/cmd/cli/app/reset.go`:
- Around line 64-70: The eligibility check currently in hasResettableAnnotation
only inspects annotations, so nodes with only the stale status label
(statusLabelPrefix+skyhookName) are missed; update the logic to also inspect
labels: either extend hasResettableAnnotation to accept labels map[string]string
(or add a new helper) and check for the presence of the same keys in the labels
map, and ensure the other call sites (including where
statusLabelPrefix+skyhookName is removed around the reset flow referenced near
the other usage) use the updated helper so nodes with label-only leftovers are
detected and reset.
- Around line 137-145: The current logic in reset.go skips the node entirely on
a malformed annotation by calling continue inside the json.Unmarshal error
branch; instead, remove the continue so the node still proceeds through reset
with an empty fallback v1alpha1.NodeState (nodeState variable) and emit the same
verbose warning via cliCtx.GlobalFlags.Verbose and fmt.Fprintf to
cmd.ErrOrStderr() when annotation parsing fails; ensure annotationKey is still
checked and that nodeState remains the zero value when unmarshalling errors
occur so drainStart_, status_ and other reset metadata get cleaned up.
In `@operator/internal/controller/skyhook_controller.go`:
- Around line 1767-1775: The controller is calling controller-runtime
delete/eviction APIs directly (r.Client.SubResource("eviction").Create and
r.Delete) which bypasses the dal/wrapper seam; refactor these calls to use the
repository abstraction instead: add or reuse appropriate methods on the
dal/wrapper (e.g., PodDAL.Evict(ctx, pod, options) and PodDAL.Delete(ctx, pod,
options) or similarly-named wrapper methods) and replace the direct
r.Client.SubResource("eviction").Create and r.Delete usages in the reconcile
flow with calls to those dal/wrapper methods so tests can mock them.
In `@operator/internal/drain/drain_test.go`:
- Around line 51-195: The tests in drain_test.go should be converted from many
individual It blocks into a single Ginkgo DescribeTable: create a DescribeTable
that takes a pod (use basePod() variants), options (DefaultOptions() modified
per case), and the expected Decision (e.g., Decision{Action: ActionEvict,
Reason: ReasonEviction}), then add Entry(...) rows for each current case
(default behavior, eviction disabled, non-running phase, terminating,
unschedulable toleration, daemonset default and when IgnoreDaemonSets=false,
kube-system, mirror pod, unmanaged default and when Force=false, emptyDir
default and when DeleteEmptyDirData=false). Replace the individual It tests with
Entries calling DecidePod(pod, options) and Expect(...).Reference symbols:
DescribeTable/Entry, basePod, DecidePod, DefaultOptions, Decision,
ActionEvict/ActionDelete/ActionIgnore/ActionBlock and Reason* constants.
In `@operator/internal/drain/drain.go`:
- Around line 126-132: The toleratesUnschedulable function currently checks only
toleration.Key and must instead use Kubernetes' taint matching: in
toleratesUnschedulable, construct a corev1.Taint with Key:
corev1.TaintNodeUnschedulable and Effect: corev1.TaintEffectNoSchedule and call
each toleration's ToleratesTaint (e.g., toleration.ToleratesTaint(&taint,
false)) returning true if any toleration tolerates it; mirror the approach used
by CheckTaintToleration. Also add unit tests for toleratesUnschedulable covering
tolerations with mismatched Effect, Operator, and Value so key-only matches
cannot regress.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: d40b1c75-9f43-46ea-928c-51d35c7d7ec4
📒 Files selected for processing (27)
chart/templates/skyhook-crd.yamldocs/interrupt_flow.mdk8s-tests/chainsaw/skyhook/drain-config/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/drain-config/disable-eviction.yamlk8s-tests/chainsaw/skyhook/drain-config/timeout-assert.yamlk8s-tests/chainsaw/skyhook/drain-config/timeout.yamloperator/CHANGELOG.mdoperator/api/v1alpha1/skyhook_types.gooperator/api/v1alpha1/skyhook_webhook.gooperator/api/v1alpha1/skyhook_webhook_test.gooperator/api/v1alpha1/zz_generated.deepcopy.gooperator/cmd/cli/app/node/node_reset.gooperator/cmd/cli/app/node/node_reset_test.gooperator/cmd/cli/app/node/node_status.gooperator/cmd/cli/app/reset.gooperator/cmd/cli/app/reset_test.gooperator/config/crd/bases/skyhook.nvidia.com_skyhooks.yamloperator/internal/controller/mock/SkyhookNodes.gooperator/internal/controller/skyhook_controller.gooperator/internal/controller/skyhook_controller_test.gooperator/internal/drain/drain.gooperator/internal/drain/drain_suite_test.gooperator/internal/drain/drain_test.gooperator/internal/wrapper/mock/SkyhookNode.gooperator/internal/wrapper/mock/SkyhookNodeOnly.gooperator/internal/wrapper/node.gooperator/internal/wrapper/node_test.go
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (11)
operator/internal/controller/skyhook_controller.go (1)
1767-1775: 🛠️ Refactor suggestion | 🟠 Major | 🏗️ Heavy liftMove pod eviction/delete behind the DAL or wrapper seam.
This new drain path still mutates pods through raw controller-runtime client calls, which bypasses the repo’s controller abstraction and makes the flow harder to mock consistently in controller tests.
As per coding guidelines
Use the dal and wrapper layers instead of reaching for controller-runtime client.Client directly from new controller code. These layers are the seam tests mock against.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@operator/internal/controller/skyhook_controller.go` around lines 1767 - 1775, The drain code is directly mutating pods via r.Client.SubResource("eviction").Create(...) and r.Delete(...), bypassing the project's DAL/wrapper seam; instead, introduce and call the existing DAL/wrapper methods (e.g., a PodRepository or PodWrapper method such as EvictPod(ctx, pod, options) and DeletePod(ctx, pod, options) or add them if missing) and replace the direct calls to policyv1.Eviction creation and r.Delete with those wrapper calls (refer to the eviction creation block using policyv1.Eviction and the drain.ActionDelete branch calling r.Delete) so tests can mock the DAL/wrapper rather than controller-runtime client.Client directly.operator/internal/drain/drain_test.go (1)
51-195: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick winCollapse these decision cases into a
DescribeTable.These cases are all the same test shape with different pod/options/expected outcomes, so keeping them as separate
Itblocks makes the suite harder to extend and drifts from the repo’s required Ginkgo style.As per coding guidelines
Use Ginkgo/Gomega for all unit tests, not stdlib t.Run. Match the testing file you're adding to. Use DescribeTable/Entry for table-driven tests.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@operator/internal/drain/drain_test.go` around lines 51 - 195, Collapse the repeated It blocks into a single Ginkgo DescribeTable that iterates over test cases by calling DecidePod for each input; create table entries that provide the pod (use basePod() with per-entry modifications), the options (DefaultOptions() with per-entry overrides), and the expected Decision values (Decision{Action: ..., Reason: ...}), and assert the Decision equals the expected value plus the relevant boolean checks for BlocksDrain() and RequiresAction() where applicable. Locate the repeated assertions around DecidePod, DefaultOptions, basePod, and Decision (and constants like ActionEvict, ActionDelete, ActionIgnore, ActionBlock and reasons such as ReasonEviction, ReasonDelete, ReasonPhase, ReasonTerminating, ReasonUnschedulableToleration, ReasonDaemonSet, ReasonKubeSystem, ReasonMirrorPod, ReasonUnmanaged, ReasonEmptyDir) and convert each It case into a DescribeTable Entry with a descriptive name and the pod/options/expected result to keep behavior identical. Ensure entries include cases that also assert BlocksDrain() and RequiresAction() when the original test did so.operator/internal/drain/drain.go (1)
126-129:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUse Kubernetes taint matching instead of a key-only check.
This still treats any toleration with
Key == node.kubernetes.io/unschedulableas sufficient, even when theEffect,Operator, orValuewould not actually tolerate the node’s unschedulable taint. That can incorrectly ignore pods and let drain finish early.Suggested fix
func toleratesUnschedulable(pod *corev1.Pod) bool { + taint := &corev1.Taint{ + Key: corev1.TaintNodeUnschedulable, + Effect: corev1.TaintEffectNoSchedule, + } for _, toleration := range pod.Spec.Tolerations { - if toleration.Key == corev1.TaintNodeUnschedulable { + if toleration.ToleratesTaint(taint) { return true } } return false }#!/bin/bash set -euo pipefail FILE="operator/internal/drain/drain.go" VENDOR_TOL="operator/vendor/k8s.io/api/core/v1/toleration.go" VENDOR_TAINT="operator/vendor/k8s.io/api/core/v1/well_known_taints.go" echo "Current implementation:" sed -n '120,135p' "$FILE" | cat -n echo echo "Kubernetes toleration matcher:" rg -n "func \\(t \\*?Toleration\\) ToleratesTaint" "$VENDOR_TOL" sed -n '1,120p' "$VENDOR_TOL" | cat -n echo echo "Unschedulable taint constant:" rg -n "TaintNodeUnschedulable|NoSchedule" "$VENDOR_TAINT" sed -n '1,80p' "$VENDOR_TAINT" | cat -n🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@operator/internal/drain/drain.go` around lines 126 - 129, The toleratesUnschedulable function currently checks only toleration.Key; update it to use Kubernetes' full taint matching by constructing the unschedulable taint and calling the Toleration.ToleratesTaint matcher for each toleration. Specifically, in toleratesUnschedulable iterate pod.Spec.Tolerations and return true when toleration.ToleratesTaint(&corev1.Taint{Key: corev1.TaintNodeUnschedulable, Effect: corev1.TaintEffectNoSchedule}) (or the appropriate Effect constant) returns true so Operator/Effect/Value are respected; otherwise return false. Ensure you reference the Toleration.ToleratesTaint method and corev1.TaintNodeUnschedulable/corev1.TaintEffectNoSchedule symbols.operator/cmd/cli/app/reset.go (2)
64-71:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winSelection misses nodes with label-only residue.
hasResettableAnnotationonly inspectsnode.Annotations, but line 228 removes thestatusLabelPrefix+skyhookNamelabel. If a prior reset attempt clears annotations but fails during label removal, subsequent retries will never select that node again, leaving a permanent stale status label.💡 Suggested fix
Extend
hasResettableAnnotationto also checknode.LabelsforstatusLabelPrefix+skyhookName, or introduce a separatehasResettableLabelhelper and OR the results in the selection predicate at line 133.-func hasResettableAnnotation(annotations map[string]string, annotationKeys []string) bool { +func hasResettableMetadata(annotations, labels map[string]string, annotationKeys []string, skyhookName string) bool { for _, annotationKey := range annotationKeys { if _, ok := annotations[annotationKey]; ok { return true } } + if _, ok := labels[statusLabelPrefix+skyhookName]; ok { + return true + } return false }Then update call site:
- if !hasResettableAnnotation(node.Annotations, annotationKeys) { + if !hasResettableMetadata(node.Annotations, node.Labels, annotationKeys, skyhookName) {Also applies to: 133-134
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@operator/cmd/cli/app/reset.go` around lines 64 - 71, hasResettableAnnotation currently only checks node.Annotations, so nodes that were left with only the status label (statusLabelPrefix+skyhookName) after a failed reset won’t be selected; update hasResettableAnnotation to also inspect node.Labels for the statusLabelPrefix+skyhookName (or add a small hasResettableLabel helper) and ensure the selection predicate that calls hasResettableAnnotation (the node selection used for resets) ORs the annotation and label checks so nodes with label-only residue are included.
137-145:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon't let malformed nodeState block cleanup.
When
json.Unmarshalfails (line 143), thecontinueskips this node entirely, leavingdrainStart_,status_,cordon_, and other reset metadata behind. For a recovery command, the safer behavior is to warn (preserving the existing verbose message) but proceed with an emptyNodeStateso all Skyhook-related annotations and labels still get cleared.🛡️ Suggested fix
nodeState := v1alpha1.NodeState{} if annotation, ok := node.Annotations[annotationKey]; ok { if err := json.Unmarshal([]byte(annotation), &nodeState); err != nil { if cliCtx.GlobalFlags.Verbose { _, _ = fmt.Fprintf(cmd.ErrOrStderr(), "Warning: skipping node %q - invalid annotation: %v\n", node.Name, err) } - continue + // Proceed with empty NodeState so we still clear drain/status/cordon metadata + nodeState = v1alpha1.NodeState{} } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@operator/cmd/cli/app/reset.go` around lines 137 - 145, The code currently bails out on a malformed annotation during json.Unmarshal (in the nodeState handling around annotationKey / v1alpha1.NodeState), which leaves Skyhook metadata uncleared; instead, keep the existing verbose warning (cliCtx.GlobalFlags.Verbose and fmt.Fprintf to cmd.ErrOrStderr()) but do not return/continue on error — allow nodeState to remain the zero value and proceed so the reset logic can still clear drainStart_, status_, cordon_, labels/annotations etc.; remove the continue after json.Unmarshal error handling and rely on the initialized nodeState.operator/cmd/cli/app/node/node_reset.go (2)
54-61:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winSelection misses nodes with label-only residue.
hasResettableAnnotationonly checksnode.Annotations, but line 227 removes thestatusLabelPrefix+opts.skyhookNamelabel. If a prior reset cleared annotations but failed on the label, retries will not reselect that node, leaving a stale status label.💡 Suggested fix
Extend the helper to also inspect labels, or add a separate label check in the selection predicate. Example:
-func hasResettableAnnotation(annotations map[string]string, annotationKeys []string) bool { +func hasResettableMetadata(annotations, labels map[string]string, annotationKeys []string, skyhookName string) bool { for _, annotationKey := range annotationKeys { if _, ok := annotations[annotationKey]; ok { return true } } + if _, ok := labels[statusLabelPrefix+skyhookName]; ok { + return true + } return false }Then at line 155:
- if !hasResettableAnnotation(node.Annotations, annotationKeys) { + if !hasResettableMetadata(node.Annotations, node.Labels, annotationKeys, opts.skyhookName) {Also applies to: 155-157
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@operator/cmd/cli/app/node/node_reset.go` around lines 54 - 61, The selection currently misses nodes that only have the stale status label because hasResettableAnnotation only checks node.Annotations; update the selection to also consider node.Labels by modifying hasResettableAnnotation (or creating a new helper) to accept both annotations and labels and check for the same annotationKeys plus the label key constructed as statusLabelPrefix+opts.skyhookName (or add an additional check in the predicate used when filtering nodes) so that nodes with either matching annotations or the specific status label are returned by the selection logic.
159-167:⚠️ Potential issue | 🟠 Major | ⚡ Quick winMalformed nodeState should not prevent a node reset.
The
continueon line 165 skips resetting this node whenjson.Unmarshalfails, leavingdrainStart_,status_,cordon_, and other reset metadata behind. Since this is a recovery command, the safer behavior is to warn (keeping the existing verbose message) but proceed with an emptyNodeStateso annotations and labels still get cleared.🛡️ Suggested fix
nodeState := v1alpha1.NodeState{} if annotation, ok := node.Annotations[annotationKey]; ok { if err := json.Unmarshal([]byte(annotation), &nodeState); err != nil { if cliCtx.GlobalFlags.Verbose { _, _ = fmt.Fprintf(cmd.ErrOrStderr(), "Warning: skipping node %q - invalid annotation: %v\n", nodeName, err) } - continue + // Proceed with empty NodeState to still clear drain/status/cordon metadata + nodeState = v1alpha1.NodeState{} } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@operator/cmd/cli/app/node/node_reset.go` around lines 159 - 167, The code currently bails out of resetting the node when json.Unmarshal into nodeState fails; instead, keep nodeState as the zero value and proceed with the reset. In the block around nodeState and json.Unmarshal (variables: nodeState, annotation, node.Annotations, json.Unmarshal), remove the continue on error and retain the existing verbose warning, so malformed annotations are logged but the rest of the reset logic still runs to clear annotations/labels and other metadata.k8s-tests/chainsaw/skyhook/drain-config/timeout.yaml (1)
30-33:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUse a pinned image from an approved registry.
The workload container uses
busybox:latest, which violates the image-source policy. Replace with a pinned image fromghcr.io/nvidia/nodewright/*orghcr.io/nvidia/skyhook/*.As per coding guidelines, "Container images must be pulled from ghcr.io/nvidia/nodewright/* or ghcr.io/nvidia/skyhook/* registries as per current distribution".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@k8s-tests/chainsaw/skyhook/drain-config/timeout.yaml` around lines 30 - 33, The container "workload" in timeout.yaml uses an unapproved image "busybox:latest"; update the image field for the container named workload to a pinned image from the approved registries (ghcr.io/nvidia/nodewright/* or ghcr.io/nvidia/skyhook/*) using a specific tag or digest (not :latest), e.g., replace the image value with a concrete ghcr.io/nvidia/skyhook/<image>:<version> or `@sha256`:<digest> to satisfy the image-source policy.k8s-tests/chainsaw/skyhook/drain-config/chainsaw-test.yaml (2)
116-125:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAssert the timeout-specific drain event.
The loop checks for any
reason=Drainevent but doesn't verify it's the timeout-specificDrainTimeoutaction. This can pass without validating the timeout behavior. Please match on the timeout-specific action or message.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@k8s-tests/chainsaw/skyhook/drain-config/chainsaw-test.yaml` around lines 116 - 125, The current script loop only looks for events with reason=Drain which can match non-timeout drains; update the kubectl check in the script block (the loop under script: content) to assert the timeout-specific event by either changing the field-selector to reason=DrainTimeout (i.e., --field-selector involvedObject.kind=Skyhook,involvedObject.name=drain-config-timeout,reason=DrainTimeout) or by querying events with -o jsonpath and asserting the .message contains the timeout text (e.g., grep "timeout") so the test only succeeds when the DrainTimeout-specific event is emitted.
20-22:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd an explicit Chainsaw
poollabel.This test suite is missing
metadata.labels.pool. CI needs the pool assignment to route the test according to its mutation profile (e.g.,pool: interruptorpool: lifecycle).Based on learnings, "In Chainsaw e2e test YAMLs, the
poollabel (core,interrupt,uninstall,lifecycle) must be assigned based on the cluster-state mutation/runtime mutation profile the test exercises".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@k8s-tests/chainsaw/skyhook/drain-config/chainsaw-test.yaml` around lines 20 - 22, The manifest for the resource with metadata.name "drain-config" must include an explicit metadata.labels.pool entry; add a metadata.labels block and set pool to the correct Chainsaw test pool (one of: core, interrupt, uninstall, lifecycle) that matches the cluster/runtime mutation profile this test exercises (e.g., pool: interrupt or pool: lifecycle) so CI can route the test correctly.k8s-tests/chainsaw/skyhook/drain-config/disable-eviction.yaml (1)
36-39:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUse a pinned image from an approved registry.
The workload container uses
busybox:latest, which violates the repository's image-source policy. Switch to a pinned image underghcr.io/nvidia/nodewright/*orghcr.io/nvidia/skyhook/*.As per coding guidelines, "Container images must be pulled from ghcr.io/nvidia/nodewright/* or ghcr.io/nvidia/skyhook/* registries as per current distribution".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@k8s-tests/chainsaw/skyhook/drain-config/disable-eviction.yaml` around lines 36 - 39, The workload container currently uses an unapproved image "busybox:latest"; update the image field for the container named "workload" in disable-eviction.yaml to a pinned digest or tag from an approved registry (ghcr.io/nvidia/nodewright/* or ghcr.io/nvidia/skyhook/*), e.g., replace "busybox:latest" with a specific ghcr.io/nvidia/... image and tag/digest to ensure reproducibility and policy compliance; leave the container name and command unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@operator/internal/controller/skyhook_controller.go`:
- Around line 1716-1735: The drain-timeout branch currently sets
skyhookNode.SetStatus(v1alpha1.StatusErroring) but returns (false, nil), which
keeps ProcessInterrupt/RunSkyhookPackages in the normal short requeue loop;
change the return to signal the caller to stop the regular retry loop — e.g.,
return true, nil — so that after drainTimedOut(...) and setting StatusErroring
the reconciler does not immediately requeue the node every 2s; update the branch
that checks drainTimedOut(drainStartedAt, drainConfig.Timeout, now.Time) to
return the appropriate stop-retry value (true, nil) and ensure callers of
ProcessInterrupt/RunSkyhookPackages honor that contract.
In `@operator/internal/wrapper/node.go`:
- Around line 556-557: Replace the inline-constructed drainStart annotation key
deletion with the existing helper: instead of delete(node.Annotations,
fmt.Sprintf("%s/drainStart_%s", v1alpha1.METADATA_PREFIX, node.skyhook.Name))
call the drainStartAnnotationKey helper (used by StartDrain, DrainStartedAt,
ClearDrainStart) to compute the key and delete that entry; keep the other delete
for cordon_... as-is. This ensures consistent key formatting via
drainStartAnnotationKey().
---
Duplicate comments:
In `@k8s-tests/chainsaw/skyhook/drain-config/chainsaw-test.yaml`:
- Around line 116-125: The current script loop only looks for events with
reason=Drain which can match non-timeout drains; update the kubectl check in the
script block (the loop under script: content) to assert the timeout-specific
event by either changing the field-selector to reason=DrainTimeout (i.e.,
--field-selector
involvedObject.kind=Skyhook,involvedObject.name=drain-config-timeout,reason=DrainTimeout)
or by querying events with -o jsonpath and asserting the .message contains the
timeout text (e.g., grep "timeout") so the test only succeeds when the
DrainTimeout-specific event is emitted.
- Around line 20-22: The manifest for the resource with metadata.name
"drain-config" must include an explicit metadata.labels.pool entry; add a
metadata.labels block and set pool to the correct Chainsaw test pool (one of:
core, interrupt, uninstall, lifecycle) that matches the cluster/runtime mutation
profile this test exercises (e.g., pool: interrupt or pool: lifecycle) so CI can
route the test correctly.
In `@k8s-tests/chainsaw/skyhook/drain-config/disable-eviction.yaml`:
- Around line 36-39: The workload container currently uses an unapproved image
"busybox:latest"; update the image field for the container named "workload" in
disable-eviction.yaml to a pinned digest or tag from an approved registry
(ghcr.io/nvidia/nodewright/* or ghcr.io/nvidia/skyhook/*), e.g., replace
"busybox:latest" with a specific ghcr.io/nvidia/... image and tag/digest to
ensure reproducibility and policy compliance; leave the container name and
command unchanged.
In `@k8s-tests/chainsaw/skyhook/drain-config/timeout.yaml`:
- Around line 30-33: The container "workload" in timeout.yaml uses an unapproved
image "busybox:latest"; update the image field for the container named workload
to a pinned image from the approved registries (ghcr.io/nvidia/nodewright/* or
ghcr.io/nvidia/skyhook/*) using a specific tag or digest (not :latest), e.g.,
replace the image value with a concrete ghcr.io/nvidia/skyhook/<image>:<version>
or `@sha256`:<digest> to satisfy the image-source policy.
In `@operator/cmd/cli/app/node/node_reset.go`:
- Around line 54-61: The selection currently misses nodes that only have the
stale status label because hasResettableAnnotation only checks node.Annotations;
update the selection to also consider node.Labels by modifying
hasResettableAnnotation (or creating a new helper) to accept both annotations
and labels and check for the same annotationKeys plus the label key constructed
as statusLabelPrefix+opts.skyhookName (or add an additional check in the
predicate used when filtering nodes) so that nodes with either matching
annotations or the specific status label are returned by the selection logic.
- Around line 159-167: The code currently bails out of resetting the node when
json.Unmarshal into nodeState fails; instead, keep nodeState as the zero value
and proceed with the reset. In the block around nodeState and json.Unmarshal
(variables: nodeState, annotation, node.Annotations, json.Unmarshal), remove the
continue on error and retain the existing verbose warning, so malformed
annotations are logged but the rest of the reset logic still runs to clear
annotations/labels and other metadata.
In `@operator/cmd/cli/app/reset.go`:
- Around line 64-71: hasResettableAnnotation currently only checks
node.Annotations, so nodes that were left with only the status label
(statusLabelPrefix+skyhookName) after a failed reset won’t be selected; update
hasResettableAnnotation to also inspect node.Labels for the
statusLabelPrefix+skyhookName (or add a small hasResettableLabel helper) and
ensure the selection predicate that calls hasResettableAnnotation (the node
selection used for resets) ORs the annotation and label checks so nodes with
label-only residue are included.
- Around line 137-145: The code currently bails out on a malformed annotation
during json.Unmarshal (in the nodeState handling around annotationKey /
v1alpha1.NodeState), which leaves Skyhook metadata uncleared; instead, keep the
existing verbose warning (cliCtx.GlobalFlags.Verbose and fmt.Fprintf to
cmd.ErrOrStderr()) but do not return/continue on error — allow nodeState to
remain the zero value and proceed so the reset logic can still clear
drainStart_, status_, cordon_, labels/annotations etc.; remove the continue
after json.Unmarshal error handling and rely on the initialized nodeState.
In `@operator/internal/controller/skyhook_controller.go`:
- Around line 1767-1775: The drain code is directly mutating pods via
r.Client.SubResource("eviction").Create(...) and r.Delete(...), bypassing the
project's DAL/wrapper seam; instead, introduce and call the existing DAL/wrapper
methods (e.g., a PodRepository or PodWrapper method such as EvictPod(ctx, pod,
options) and DeletePod(ctx, pod, options) or add them if missing) and replace
the direct calls to policyv1.Eviction creation and r.Delete with those wrapper
calls (refer to the eviction creation block using policyv1.Eviction and the
drain.ActionDelete branch calling r.Delete) so tests can mock the DAL/wrapper
rather than controller-runtime client.Client directly.
In `@operator/internal/drain/drain_test.go`:
- Around line 51-195: Collapse the repeated It blocks into a single Ginkgo
DescribeTable that iterates over test cases by calling DecidePod for each input;
create table entries that provide the pod (use basePod() with per-entry
modifications), the options (DefaultOptions() with per-entry overrides), and the
expected Decision values (Decision{Action: ..., Reason: ...}), and assert the
Decision equals the expected value plus the relevant boolean checks for
BlocksDrain() and RequiresAction() where applicable. Locate the repeated
assertions around DecidePod, DefaultOptions, basePod, and Decision (and
constants like ActionEvict, ActionDelete, ActionIgnore, ActionBlock and reasons
such as ReasonEviction, ReasonDelete, ReasonPhase, ReasonTerminating,
ReasonUnschedulableToleration, ReasonDaemonSet, ReasonKubeSystem,
ReasonMirrorPod, ReasonUnmanaged, ReasonEmptyDir) and convert each It case into
a DescribeTable Entry with a descriptive name and the pod/options/expected
result to keep behavior identical. Ensure entries include cases that also assert
BlocksDrain() and RequiresAction() when the original test did so.
In `@operator/internal/drain/drain.go`:
- Around line 126-129: The toleratesUnschedulable function currently checks only
toleration.Key; update it to use Kubernetes' full taint matching by constructing
the unschedulable taint and calling the Toleration.ToleratesTaint matcher for
each toleration. Specifically, in toleratesUnschedulable iterate
pod.Spec.Tolerations and return true when
toleration.ToleratesTaint(&corev1.Taint{Key: corev1.TaintNodeUnschedulable,
Effect: corev1.TaintEffectNoSchedule}) (or the appropriate Effect constant)
returns true so Operator/Effect/Value are respected; otherwise return false.
Ensure you reference the Toleration.ToleratesTaint method and
corev1.TaintNodeUnschedulable/corev1.TaintEffectNoSchedule symbols.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 1e03117b-a559-4846-a8b8-ed3b20d72464
📒 Files selected for processing (27)
chart/templates/skyhook-crd.yamldocs/interrupt_flow.mdk8s-tests/chainsaw/skyhook/drain-config/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/drain-config/disable-eviction.yamlk8s-tests/chainsaw/skyhook/drain-config/timeout-assert.yamlk8s-tests/chainsaw/skyhook/drain-config/timeout.yamloperator/CHANGELOG.mdoperator/api/v1alpha1/skyhook_types.gooperator/api/v1alpha1/skyhook_webhook.gooperator/api/v1alpha1/skyhook_webhook_test.gooperator/api/v1alpha1/zz_generated.deepcopy.gooperator/cmd/cli/app/node/node_reset.gooperator/cmd/cli/app/node/node_reset_test.gooperator/cmd/cli/app/node/node_status.gooperator/cmd/cli/app/reset.gooperator/cmd/cli/app/reset_test.gooperator/config/crd/bases/skyhook.nvidia.com_skyhooks.yamloperator/internal/controller/mock/SkyhookNodes.gooperator/internal/controller/skyhook_controller.gooperator/internal/controller/skyhook_controller_test.gooperator/internal/drain/drain.gooperator/internal/drain/drain_suite_test.gooperator/internal/drain/drain_test.gooperator/internal/wrapper/mock/SkyhookNode.gooperator/internal/wrapper/mock/SkyhookNodeOnly.gooperator/internal/wrapper/node.gooperator/internal/wrapper/node_test.go
c7adb91 to
0440375
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (4)
operator/internal/controller/skyhook_controller.go (2)
1767-1775: 🛠️ Refactor suggestion | 🟠 Major | 🏗️ Heavy liftDrain actions bypass the DAL/wrapper seam.
The drain loop calls
r.Client.SubResource("eviction").Create(line 1768) andr.Delete(line 1774) directly, which bypasses the project's test seam and diverges from the established controller pattern. As per coding guidelines, new controller code should use dal/wrapper layers so tests can mock these mutations without hitting the real client.As per coding guidelines: "Use the
dalandwrapperlayers instead of reaching forcontroller-runtimeclient.Clientdirectly from new controller code. These layers are the seam tests mock against."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@operator/internal/controller/skyhook_controller.go` around lines 1767 - 1775, The drain loop is calling the controller-runtime client directly (r.Client.SubResource("eviction").Create and r.Delete) which bypasses the testable DAL/wrapper seam; replace those direct calls with the project DAL/wrapper APIs (e.g., call the DAL's EvictPod/Eviction method instead of r.Client.SubResource("eviction").Create and call the wrapper/DAL DeletePod/Delete method instead of r.Delete), adjust or add small DAL/wrapper functions that accept the same ctx, pod and options (e.g., EvictPod(ctx, pod, evictionDeleteOptions) and DeletePod(ctx, pod, deleteOptions...)), update imports and the calling site to use those methods, and ensure tests can mock the DAL/wrapper rather than hitting controller-runtime directly.
1716-1736:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftTimeout expiry continues requeueing without a recovery path.
After the drain timeout expires and the node is marked
StatusErroring, the function returns(false, nil), which keepsProcessInterrupt→RunSkyhookPackagesin the normal 2-second requeue loop indefinitely. This contradicts the timeout contract and provides no path forward—either manual intervention to resolve the blocking pods or a backoff/max-retry limit should stop the tight requeue cycle once the timeout fires.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@operator/internal/controller/skyhook_controller.go` around lines 1716 - 1736, The handler currently marks the node StatusErroring but returns (false, nil) which keeps ProcessInterrupt → RunSkyhookPackages in a tight 2s requeue; change the control flow so that once drainTimedOut(...) triggers and you call skyhookNode.SetStatus(v1alpha1.StatusErroring) you either (a) return an error (e.g. fmt.Errorf("drain timed out for skyhook %s node %s") ) to surface a backoff/retry from the controller runtime, or (b) return a terminal/stop signal to the caller (e.g. a boolean that prevents immediate requeue) so the reconciliation uses a longer backoff or requires manual intervention; update the branch where drainTimedOut(...) is handled (references: drainTimedOut, skyhookNode.SetStatus, EventsReasonSkyhookDrain, ProcessInterrupt, RunSkyhookPackages) to implement one of these two behaviors and add a clear event message indicating manual intervention is required.operator/internal/wrapper/node.go (1)
557-557: 🧹 Nitpick | 🔵 Trivial | 💤 Low valueUse drainStartAnnotationKey() helper for consistency.
Line 557 inline-constructs the drain-start key, but the helper at line 176 provides the same logic. Using the helper improves consistency with StartDrain, DrainStartedAt, and ClearDrainStart.
♻️ Suggested refactor
delete(node.Annotations, fmt.Sprintf("%s/cordon_%s", v1alpha1.METADATA_PREFIX, node.skyhook.Name)) - delete(node.Annotations, fmt.Sprintf("%s/drainStart_%s", v1alpha1.METADATA_PREFIX, node.skyhook.Name)) + delete(node.Annotations, node.drainStartAnnotationKey()) delete(node.Annotations, fmt.Sprintf("%s/nodeState_%s", v1alpha1.METADATA_PREFIX, node.skyhook.Name))🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@operator/internal/wrapper/node.go` at line 557, Replace the inline construction of the drain-start annotation key with the existing helper: instead of calling delete(node.Annotations, fmt.Sprintf("%s/drainStart_%s", v1alpha1.METADATA_PREFIX, node.skyhook.Name)) use the drainStartAnnotationKey(...) helper to produce the key; update the delete call to delete(node.Annotations, drainStartAnnotationKey(node.skyhook.Name)) so it stays consistent with StartDrain, DrainStartedAt, and ClearDrainStart.operator/internal/drain/drain_test.go (1)
51-231: 🧹 Nitpick | 🔵 TrivialConvert remaining tests to DescribeTable.
The toleration matching tests (lines 106-140) already use DescribeTable, but the remaining test cases are still individual It blocks. Converting all decision tests to a single DescribeTable would improve maintainability and follow the repo's Ginkgo guidance.
As per coding guidelines, use DescribeTable/Entry for table-driven tests.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@operator/internal/drain/drain_test.go` around lines 51 - 231, Group the remaining individual It tests into a single DescribeTable that drives DecidePod with table entries defining: a test name, a base pod modified per case (use basePod() then mutate fields like Status.Phase, DeletionTimestamp, Spec.Tolerations, OwnerReferences, Namespace, Annotations, Spec.Volumes), optional Options (DefaultOptions() modified such as DisableEviction, IgnoreDaemonSets, Force, DeleteEmptyDirData), and the expected Decision value (e.g., Decision{Action: ActionEvict, Reason: ReasonEviction}, Decision{Action: ActionIgnore, Reason: ReasonPhase}, Decision{Action: ActionDelete, Reason: ReasonDelete}, Decision{Action: ActionBlock, Reason: ReasonUnmanaged}, etc. Replace each It block with an Entry that calls DecidePod(pod, options) and asserts equality to the expected Decision and, when relevant, checks BlocksDrain() and RequiresAction()—referencing DecidePod, DefaultOptions, basePod, Decision, BlocksDrain, RequiresAction, and the option fields (DisableEviction, IgnoreDaemonSets, Force, DeleteEmptyDirData) to locate and implement each case.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@operator/internal/controller/skyhook_controller.go`:
- Around line 1767-1775: The drain loop is calling the controller-runtime client
directly (r.Client.SubResource("eviction").Create and r.Delete) which bypasses
the testable DAL/wrapper seam; replace those direct calls with the project
DAL/wrapper APIs (e.g., call the DAL's EvictPod/Eviction method instead of
r.Client.SubResource("eviction").Create and call the wrapper/DAL
DeletePod/Delete method instead of r.Delete), adjust or add small DAL/wrapper
functions that accept the same ctx, pod and options (e.g., EvictPod(ctx, pod,
evictionDeleteOptions) and DeletePod(ctx, pod, deleteOptions...)), update
imports and the calling site to use those methods, and ensure tests can mock the
DAL/wrapper rather than hitting controller-runtime directly.
- Around line 1716-1736: The handler currently marks the node StatusErroring but
returns (false, nil) which keeps ProcessInterrupt → RunSkyhookPackages in a
tight 2s requeue; change the control flow so that once drainTimedOut(...)
triggers and you call skyhookNode.SetStatus(v1alpha1.StatusErroring) you either
(a) return an error (e.g. fmt.Errorf("drain timed out for skyhook %s node %s") )
to surface a backoff/retry from the controller runtime, or (b) return a
terminal/stop signal to the caller (e.g. a boolean that prevents immediate
requeue) so the reconciliation uses a longer backoff or requires manual
intervention; update the branch where drainTimedOut(...) is handled (references:
drainTimedOut, skyhookNode.SetStatus, EventsReasonSkyhookDrain,
ProcessInterrupt, RunSkyhookPackages) to implement one of these two behaviors
and add a clear event message indicating manual intervention is required.
In `@operator/internal/drain/drain_test.go`:
- Around line 51-231: Group the remaining individual It tests into a single
DescribeTable that drives DecidePod with table entries defining: a test name, a
base pod modified per case (use basePod() then mutate fields like Status.Phase,
DeletionTimestamp, Spec.Tolerations, OwnerReferences, Namespace, Annotations,
Spec.Volumes), optional Options (DefaultOptions() modified such as
DisableEviction, IgnoreDaemonSets, Force, DeleteEmptyDirData), and the expected
Decision value (e.g., Decision{Action: ActionEvict, Reason: ReasonEviction},
Decision{Action: ActionIgnore, Reason: ReasonPhase}, Decision{Action:
ActionDelete, Reason: ReasonDelete}, Decision{Action: ActionBlock, Reason:
ReasonUnmanaged}, etc. Replace each It block with an Entry that calls
DecidePod(pod, options) and asserts equality to the expected Decision and, when
relevant, checks BlocksDrain() and RequiresAction()—referencing DecidePod,
DefaultOptions, basePod, Decision, BlocksDrain, RequiresAction, and the option
fields (DisableEviction, IgnoreDaemonSets, Force, DeleteEmptyDirData) to locate
and implement each case.
In `@operator/internal/wrapper/node.go`:
- Line 557: Replace the inline construction of the drain-start annotation key
with the existing helper: instead of calling delete(node.Annotations,
fmt.Sprintf("%s/drainStart_%s", v1alpha1.METADATA_PREFIX, node.skyhook.Name))
use the drainStartAnnotationKey(...) helper to produce the key; update the
delete call to delete(node.Annotations,
drainStartAnnotationKey(node.skyhook.Name)) so it stays consistent with
StartDrain, DrainStartedAt, and ClearDrainStart.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 61be648d-f7c5-430b-aa33-1386cc14acea
📒 Files selected for processing (27)
chart/templates/skyhook-crd.yamldocs/interrupt_flow.mdk8s-tests/chainsaw/skyhook/drain-config/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/drain-config/disable-eviction.yamlk8s-tests/chainsaw/skyhook/drain-config/timeout-assert.yamlk8s-tests/chainsaw/skyhook/drain-config/timeout.yamloperator/CHANGELOG.mdoperator/api/v1alpha1/skyhook_types.gooperator/api/v1alpha1/skyhook_webhook.gooperator/api/v1alpha1/skyhook_webhook_test.gooperator/api/v1alpha1/zz_generated.deepcopy.gooperator/cmd/cli/app/node/node_reset.gooperator/cmd/cli/app/node/node_reset_test.gooperator/cmd/cli/app/node/node_status.gooperator/cmd/cli/app/reset.gooperator/cmd/cli/app/reset_test.gooperator/config/crd/bases/skyhook.nvidia.com_skyhooks.yamloperator/internal/controller/mock/SkyhookNodes.gooperator/internal/controller/skyhook_controller.gooperator/internal/controller/skyhook_controller_test.gooperator/internal/drain/drain.gooperator/internal/drain/drain_suite_test.gooperator/internal/drain/drain_test.gooperator/internal/wrapper/mock/SkyhookNode.gooperator/internal/wrapper/mock/SkyhookNodeOnly.gooperator/internal/wrapper/node.gooperator/internal/wrapper/node_test.go
|
you should rebase, just merged a few pr's that should and seem too stabilize a few of the flaky tests. |
Signed-off-by: AnouarMohamed <m.anouar@mundiapolis.ma>
0440375 to
33874d1
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
operator/internal/controller/skyhook_controller.go (1)
1253-1290:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winWrap raw errors with operation context in
HandleConfigUpdates.Several changed paths return bare
err, which drops call-site context and makes failures harder to diagnose.Proposed fix
- return false, false, err + return false, false, fmt.Errorf("checking package pod existence on node %s: %w", node.GetNode().Name, err) ... - return false, false, err + return false, false, fmt.Errorf("listing package pods on node %s: %w", node.GetNode().Name, err) ... - return false, false, err + return false, false, fmt.Errorf("deleting erroring pod %s/%s: %w", pod.Namespace, pod.Name, err)As per coding guidelines
Use stdlib errors + fmt.Errorf("…: %w", err) with %w for error wrapping ... Never return a bare err from a function that does more than one thing.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@operator/internal/controller/skyhook_controller.go` around lines 1253 - 1290, In HandleConfigUpdates, several returns propagate bare errors (e.g., results from r.PodExists, r.dal.GetPods and r.Delete) losing context; update each error return to wrap the underlying error with fmt.Errorf and %w plus a short operation context (for example "checking pod existence for node %s package %s: %w", "listing pods for node %s package %s: %w", "deleting pod %s on node %s: %w") referencing the unique symbols r.PodExists, node.IsPackageComplete, node.PackageStatus, r.dal.GetPods and r.Delete so callers receive actionable context. Ensure all branches that currently do `return ..., err` become `return ..., fmt.Errorf("...: %w", err)`.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@k8s-tests/chainsaw/skyhook/drain-config/timeout.yaml`:
- Around line 50-59: The drain-timeout package currently omits explicit
uninstall settings; update spec.packages.drain-timeout to include an uninstall
block with explicit boolean fields (uninstall.enabled and uninstall.apply) so
lifecycle is deterministic in tests—for example add uninstall: { enabled: true,
apply: true } under drain-timeout in timeout.yaml (adjust the true/false values
if your test requires disabled behavior).
---
Outside diff comments:
In `@operator/internal/controller/skyhook_controller.go`:
- Around line 1253-1290: In HandleConfigUpdates, several returns propagate bare
errors (e.g., results from r.PodExists, r.dal.GetPods and r.Delete) losing
context; update each error return to wrap the underlying error with fmt.Errorf
and %w plus a short operation context (for example "checking pod existence for
node %s package %s: %w", "listing pods for node %s package %s: %w", "deleting
pod %s on node %s: %w") referencing the unique symbols r.PodExists,
node.IsPackageComplete, node.PackageStatus, r.dal.GetPods and r.Delete so
callers receive actionable context. Ensure all branches that currently do
`return ..., err` become `return ..., fmt.Errorf("...: %w", err)`.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: f352df6e-d4b5-4904-b104-6b2fed9dc997
📒 Files selected for processing (27)
chart/templates/skyhook-crd.yamldocs/interrupt_flow.mdk8s-tests/chainsaw/skyhook/drain-config/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/drain-config/disable-eviction.yamlk8s-tests/chainsaw/skyhook/drain-config/timeout-assert.yamlk8s-tests/chainsaw/skyhook/drain-config/timeout.yamloperator/CHANGELOG.mdoperator/api/v1alpha1/skyhook_types.gooperator/api/v1alpha1/skyhook_webhook.gooperator/api/v1alpha1/skyhook_webhook_test.gooperator/api/v1alpha1/zz_generated.deepcopy.gooperator/cmd/cli/app/node/node_reset.gooperator/cmd/cli/app/node/node_reset_test.gooperator/cmd/cli/app/node/node_status.gooperator/cmd/cli/app/reset.gooperator/cmd/cli/app/reset_test.gooperator/config/crd/bases/skyhook.nvidia.com_skyhooks.yamloperator/internal/controller/mock/SkyhookNodes.gooperator/internal/controller/skyhook_controller.gooperator/internal/controller/skyhook_controller_test.gooperator/internal/drain/drain.gooperator/internal/drain/drain_suite_test.gooperator/internal/drain/drain_test.gooperator/internal/wrapper/mock/SkyhookNode.gooperator/internal/wrapper/mock/SkyhookNodeOnly.gooperator/internal/wrapper/node.gooperator/internal/wrapper/node_test.go
Signed-off-by: AnouarMohamed <m.anouar@mundiapolis.ma>
ayuskauskas
left a comment
There was a problem hiding this comment.
Thanks for this — the internal/drain package is clean and the test coverage (unit + decision-table + chainsaw) is strong. One behavior I'd like documented before merge:
When a drain timeout expires, DrainNode marks the node erroring and stops issuing further evict/delete actions, leaving the node cordoned and the package stage halted. If the blocking condition doesn't clear on its own (e.g. force: false / deleteEmptyDirData: false, or a PDB that stays at zero allowed disruptions), the node will not recover automatically — it stays erroring until an operator intervenes.
Could you add a short "recovering from a drain timeout" note to docs/interrupt_flow.md covering:
- that the node stays cordoned and the Skyhook goes
erroringon timeout, - how to recover (e.g.
kubectl skyhook reset <skyhook> --confirm, and/or removing the underlying blocker), and - whether a blocker that clears after the timeout auto-resumes the drain or still requires a reset.
This makes the failure mode actionable for users who hit it in production.
(Two smaller blocking items are left as inline comments: a missing license header and the CHANGELOG placement.)
|
|
||
| ### New Features | ||
|
|
||
| - Add `spec.drainConfig` so interrupt drains can tune eviction, direct deletion, emptyDir handling, unmanaged-pod handling, DaemonSet skipping, timeout, and grace-period behavior. |
There was a problem hiding this comment.
This bullet was added under ## [operator/v0.16.0] (line 18), which is an already-released version — v0.16.1 (line 5) sits above it. Adding a new feature here retroactively rewrites the notes of a shipped release.
It should go in a new top section — ## [Unreleased] or the next version heading (e.g. ## [operator/v0.17.0]) — placed above the v0.16.1 entry.
lockwobr
left a comment
There was a problem hiding this comment.
The feature itself looks largely sound — the drain decision logic is isolated and well unit-tested, and the timeout→erroring path actually works in CI (the node carries both drainStart_* and status_*: erroring). The reason e2e/interrupt is red on every k8s version is mostly a malformed assertion in the new test, not the operator. Details inline. Also two structural items: consolidate the drain helpers into the drain package with direct tests, and call out a couple of bundled behavior changes.
Note: e2e/interrupt fails on all four versions (1.32–1.35); e2e/core and unit-tests pass.
| skyhook.nvidia.com/status_drain-config-timeout: erroring | ||
| annotations: | ||
| skyhook.nvidia.com/status_drain-config-timeout: erroring | ||
| (contains(@, 'skyhook.nvidia.com/drainStart_drain-config-timeout')): true |
There was a problem hiding this comment.
This is what's actually failing CI on every k8s version. chainsaw evaluates this and errors:
metadata.annotations.(contains(@, 'skyhook.nvidia.com/drainStart_drain-config-timeout')): Internal error: invalid type for: map[...], expected: []functions.JpType{"array", "string"}
contains(@, key) only works on an array or string, but @ here is the annotations map, so it's a type error — a hard assertion failure, not a retryable mismatch.
Important: the operator behavior is correct. The catch dump shows the node really does carry skyhook.nvidia.com/drainStart_drain-config-timeout: 2026-...Z and skyhook.nvidia.com/status_drain-config-timeout: erroring. So this is a test-assertion bug, not a product bug — the timeout→erroring + drainStart persistence path works. Rewrite the existence check in a valid form, e.g. assert the specific key is non-null:
annotations:
skyhook.nvidia.com/status_drain-config-timeout: erroring
("skyhook.nvidia.com/drainStart_drain-config-timeout" in keys(@)): true(or drop the dynamic check and assert the concrete key with a non-empty matcher).
| status: | ||
| status: erroring | ||
| nodeStatus: | ||
| (values(@)): |
There was a problem hiding this comment.
Secondary: the Nodewright assert also ran to the 240s timeout. Once the Node assert above is fixed, double-check this (values(@)): [erroring] matches reality — with one node status.nodeStatus should be {node: erroring} → values [erroring], but verify there isn't a transient extra entry. The hard error on line 25 is the primary blocker; this one may just be collateral from the same step failing.
There was a problem hiding this comment.
Thanks, I checked this after fixing the node assertion. I kept the nodeStatus values check and removed the cluster-scoped namespace expectation.
| func ShouldEvict(pod *corev1.Pod) bool { | ||
| switch pod.Status.Phase { | ||
| case corev1.PodRunning, corev1.PodPending: | ||
| func resolvedDrainOptions(config *v1alpha1.DrainConfig) drain.Options { |
There was a problem hiding this comment.
Consolidation: move resolvedDrainOptions, drainDeleteOptions, drainEvictionDeleteOptions, and drainTimedOut out of the controller and into internal/drain, so it's the single home for drain behavior. Depending on v1alpha1 and controller-runtime client from drain is fine (no cycle — v1alpha1 doesn't import drain). Suggested shapes:
drain.OptionsFromConfig(*v1alpha1.DrainConfig) Optionsdrain.TimedOut(startedAt *metav1.Time, timeout *metav1.Duration, now time.Time) bool(o Options) DeleteOptions() []client.DeleteOptionand(o Options) EvictionDeleteOptions() *metav1.DeleteOptions
The controller is then left with just IsDrained/DrainNode orchestration.
And please add direct unit tests for the moved logic (currently only hit indirectly via the envtest DrainNode specs):
OptionsFromConfig: nil config → defaults; each field override; grace-period rounding —0s→0,500ms→1,1s→1,1500ms→2(the% time.Secondround-up branch is untested).TimedOut: nil start → false; nil/zero timeout → false; exactly at deadline → true; just before → false.- delete-option builders: nil grace → nil; set grace → carries the value.
They're pure functions, so they slot into drain_test.go next to the DecidePod table.
| return d.Action == ActionEvict || d.Action == ActionDelete | ||
| } | ||
|
|
||
| func DecidePod(pod *corev1.Pod, options Options) Decision { |
There was a problem hiding this comment.
The default (no drainConfig) path is subtly not byte-for-byte the old ShouldEvict, even though the PR frames defaults as preserving existing behavior:
- unschedulable toleration now uses real
ToleratesTaint(NoSchedule effect) vs the old "any toleration whose key is unschedulable" — stricter; - terminating pods (
DeletionTimestamp \!= nil) and mirror/static pods are now ignored — new; - DaemonSet detection moved from the
len(ownerReferences) > 1heuristic toGetControllerOf(...).Kind == "DaemonSet".
These are all more correct, but they change the default drain filter that existing interrupt Nodewright already exercise. Worth a one-line "behavior change" note in the docs so upgraders aren't surprised. Should also make sure something about this is the change log.
There was a problem hiding this comment.
Agreed, thanks. I added this as a behavior change in docs/interrupt_flow.md, operator/CHANGELOG.md, and the PR body.
| for _, node := range nodeList.Items { | ||
| annotation, ok := node.Annotations[annotationKey] | ||
| if !ok { | ||
| if !hasResettableMetadata(node.Annotations, node.Labels, annotationKeys, labelKeys) { |
There was a problem hiding this comment.
The CLI changes here are contract-required (the operator added the drainStart_<skyhook> node annotation, so reset/node reset must clean it up — good). But this also bundles a behavior change to reset that isn't about drain config: selection moved from "node has the nodeState_ annotation" to "node has any resettable metadata", and a malformed nodeState no longer skips the node (it's reset with empty package state). See the test churn where a case flips from "No nodes have state" to "Successfully reset 1 node". This is arguably more correct, but please call it out in the PR description's behavior-changes (and ideally note it in the CLI changelog).
There was a problem hiding this comment.
Good point, I added this to the PR behavior changes and operator/cmd/cli/CHANGELOG.md.
Signed-off-by: AnouarMohamed <m.anouar@mundiapolis.ma>
Signed-off-by: AnouarMohamed <m.anouar@mundiapolis.ma>
|
I pushed the fixes and resolved the conflicts. Could you take another look when you have a chance? |
Signed-off-by: AnouarMohamed <m.anouar@mundiapolis.ma>
Signed-off-by: AnouarMohamed <m.anouar@mundiapolis.ma>
Summary
Fixes #259.
Adds configurable drain behavior for interrupt-enabled Skyhooks through
spec.drainConfig, including eviction vs direct delete, emptyDir handling, unmanaged pod handling, DaemonSet handling, drain timeout, and drain grace period.Details
DrainConfigAPI field, validation, generated deepcopy, and CRD schema.internal/draindecision package.erroringand emit drain events.podNonInterruptLabelsas a pre-drain barrier before drain selection runs.docs/interrupt_flow.md.Behavior Changes
ToleratesTaintmatching, and DaemonSet pods are detected from their controller owner reference.erroring. If the blocker later clears, reconcile can continue from current cluster state; the recommended production recovery remains clearing the blocker and runningkubectl skyhook reset <skyhook> --confirmorkubectl skyhook node reset ....resetandnode resetnow select nodes with any resettable Skyhook metadata, including status, cordon, and drain-start metadata. MalformednodeStateannotations are reset with empty package state instead of causing the node to be skipped.Validation
make fmtgo test -mod=vendor ./internal/draingo test -mod=vendor -run '^$' ./internal/controller./bin/ginkgo -mod=vendor --focus 'DrainNode' --timeout=300s ./internal/controllergo test -mod=vendor -run '^$' ./...git diff --checkk8s-tests/chainsaw/skyhook/drain-config/*.yamlwith PyYAML