fix(controller): preserve shared Skyhook cordons#275
Conversation
|
Welcome to NodeWright, @fallintoplace! Thanks for your first pull request. Before review, please ensure:
A maintainer will review this soon. |
b7eb775 to
b3189d3
Compare
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThis PR refactors node cordon/uncordon handling to support multiple skyhooks managing cordon state independently. It adds two helpers—cordonAnnotationKey and hasSkyhookCordon—then updates Uncordon to delete only the caller's cordon annotation and clear Spec.Unschedulable only if no other skyhooks remain. Cordon and Reset now use the centralized annotation key. Tests add three Uncordon scenarios: sole owner, co-owner, and non-owner. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
operator/internal/wrapper/node.go (1)
478-490: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick winConsider adding a brief comment explaining the multi-skyhook coordination pattern.
The conditional clearing of
node.Spec.Unschedulablebased onhasSkyhookCordon(lines 485-487) implements a subtle multi-owner pattern that might not be immediately obvious to future maintainers. As per coding guidelines, code that is "unusual, surprising, or breaks a pattern" should include a comment explaining why.Suggested comment above line 485:
// Multiple skyhooks can cordon the same node; only mark it schedulable when all cordon owners have released.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@operator/internal/wrapper/node.go` around lines 478 - 490, Add a short clarifying comment above the conditional that clears node.Spec.Unschedulable in skyhookNode.Uncordon to explain the multi-owner cordon coordination: note that multiple skyhooks may set cordon annotations (use cordonAnnotationKey/hasSkyhookCordon) and we should only set Spec.Unschedulable = false when hasSkyhookCordon(...) returns false; place the comment immediately before the hasSkyhookCordon check to make the intent clear to future maintainers.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@operator/internal/wrapper/node.go`:
- Around line 478-490: Add a short clarifying comment above the conditional that
clears node.Spec.Unschedulable in skyhookNode.Uncordon to explain the
multi-owner cordon coordination: note that multiple skyhooks may set cordon
annotations (use cordonAnnotationKey/hasSkyhookCordon) and we should only set
Spec.Unschedulable = false when hasSkyhookCordon(...) returns false; place the
comment immediately before the hasSkyhookCordon check to make the intent clear
to future maintainers.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: d7afc486-bf99-4c82-af6f-9df06c441317
📒 Files selected for processing (2)
operator/internal/wrapper/node.gooperator/internal/wrapper/node_test.go
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@operator/internal/wrapper/node_test.go`:
- Around line 181-244: Tests in the Uncordon context duplicate the raw
annotation key pattern "skyhook.nvidia.com/cordon_*"; update the three It blocks
to use the shared helper/constant instead of hard-coded strings by calling
cordonAnnotationKey("my-skyhook") and cordonAnnotationKey("other-skyhook") (or
define local constants) when constructing node.ObjectMeta.Annotations and when
asserting Expect(...).To(HaveKeyWithValue(...)) so all references match the
canonical key format used by NewSkyhookNodeOnly and Uncordon.
In `@operator/internal/wrapper/node.go`:
- Around line 469-474: The Cordon() method writes to node.Annotations[...]
without ensuring the map exists, causing a panic if Annotations is nil; modify
Cordon() to check if node.Annotations == nil and if so initialize it
(make(map[string]string)) before assigning cordonAnnotationKey(node.skyhookName)
and setting node.Spec.Unschedulable and node.updated so that the map write is
safe; update the logic inside skyhookNode.Cordon to perform this
nil-check/initialization before any writes to node.Annotations.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 458295f6-a03e-459f-8045-cc56af8cba42
📒 Files selected for processing (2)
operator/internal/wrapper/node.gooperator/internal/wrapper/node_test.go
b3189d3 to
8b59e04
Compare
Signed-off-by: Minh Vu <vuhoangminh97@gmail.com>
8b59e04 to
4d8d5dc
Compare
There was a problem hiding this comment.
Review
Solid, correctly-scoped bug fix. A node selected by two Skyhooks that both cordon it would be prematurely uncordoned when the first one completed — this fixes that by only clearing spec.unschedulable once no skyhook.nvidia.com/cordon_* annotation remains. Notes below.
Suggestions (minor)
- Unit test gap worth closing — the whole point of the change is that
hasSkyhookCordonmatches thecordon_prefix specifically, not the genericskyhook.nvidia.com/prefix. Please add a case: node owned only bymy-skyhook's cordon plus an unrelated annotation (e.g.status_other/nodeState_other) →Uncordon()should still setunschedulable=false. This pins the behavior that a non-cordon Skyhook annotation doesn't keep the node cordoned. - DRY nit (low priority): the
cordon_literal now lives in three places — the CLI constcordonAnnotationPrefix(cmd/cli/app/reset.go), the newcordonAnnotationKey(), and the inline prefix insidehasSkyhookCordon. Consider definingcordonAnnotationKeyandhasSkyhookCordonagainst a single shared prefix constant in the wrapper package so the literal appears once.
Please add an e2e test verifying the fix
The unit tests exercise the wrapper in isolation, but the bug is fundamentally about two Skyhooks racing over one node's schedulability. Please add a chainsaw e2e (under k8s-tests/chainsaw/skyhook/) that:
- Deploys two Skyhooks, both requiring interrupts, selecting the same node.
- Asserts the node stays cordoned (
spec.unschedulable: true) after the first Skyhook reachescomplete. - Asserts the node only becomes schedulable (
spec.unschedulable: false, nocordon_*annotations) after both Skyhooks complete.
This is the regression guard that proves the fix end-to-end; a passing unit suite alone wouldn't have caught the original bug at the rollout level.
Documentation
Orphaned cordon annotation = stuck-unschedulable. With this change, a stale cordon_<gone-skyhook> annotation (left by a force-delete that bypasses the finalizer, or a failed cleanup) will now keep the node unschedulable indefinitely from the operator's view — hasSkyhookCordon keeps seeing the key. Previously any completing Skyhook would (incorrectly, but as a side-effect) free it. Recovery exists (kubectl skyhook reset / node reset strip cordon annotations), so this is an acceptable trade for correctness. Please update the docs/interrupt_flow.md to document this as well as high level logic introduced in this PR.
Summary
spec.unschedulable.skyhook.nvidia.com/cordon_*annotation remains.Why
Previously, one Skyhook completing could clear
spec.unschedulableeven though another Skyhook still had a cordon annotation on the same node. While touching that path,Cordon()also needed to tolerate nodes without annotations so recording cordon ownership cannot panic.Validation
go test ./internal/wrappergo test ./...