Skip to content

fix(controller): preserve shared Skyhook cordons#275

Open
fallintoplace wants to merge 1 commit into
NVIDIA:mainfrom
fallintoplace:fix/cordon-ownership
Open

fix(controller): preserve shared Skyhook cordons#275
fallintoplace wants to merge 1 commit into
NVIDIA:mainfrom
fallintoplace:fix/cordon-ownership

Conversation

@fallintoplace

@fallintoplace fallintoplace commented Jun 12, 2026

Copy link
Copy Markdown

Summary

  • Remove only the current Skyhook cordon annotation before deciding whether to clear spec.unschedulable.
  • Keep a node unschedulable while any skyhook.nvidia.com/cordon_* annotation remains.
  • Initialize node annotations before recording cordon state.
  • Add wrapper tests for cordon initialization plus sole-owner, shared-owner, and non-owner uncordon cases.

Why

Previously, one Skyhook completing could clear spec.unschedulable even though another Skyhook still had a cordon annotation on the same node. While touching that path, Cordon() also needed to tolerate nodes without annotations so recording cordon ownership cannot panic.

Validation

  • go test ./internal/wrapper
  • go test ./...

@github-actions

Copy link
Copy Markdown

Welcome to NodeWright, @fallintoplace! Thanks for your first pull request.

Before review, please ensure:

  • All commits are signed off per the DCO (git commit -s)
  • Commits follow Conventional Commits
  • CI checks pass (tests, lint, security scan)
  • The PR description explains the why behind your changes

A maintainer will review this soon.

@github-actions github-actions Bot added component/operator Skyhook operator (controller-manager) component/ci CI workflows, GitHub Actions, and repo tooling labels Jun 12, 2026
@fallintoplace fallintoplace force-pushed the fix/cordon-ownership branch from b7eb775 to b3189d3 Compare June 12, 2026 18:41
@fallintoplace fallintoplace changed the title Preserve shared Skyhook cordons on uncordon fix(controller): preserve shared Skyhook cordons Jun 12, 2026
@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 5d92f2ff-3713-4872-afb8-7cae750c8e90

📥 Commits

Reviewing files that changed from the base of the PR and between 8b59e04 and 4d8d5dc.

📒 Files selected for processing (2)
  • operator/internal/wrapper/node.go
  • operator/internal/wrapper/node_test.go

📝 Walkthrough

Walkthrough

This PR refactors node cordon/uncordon handling to support multiple skyhooks managing cordon state independently. It adds two helpers—cordonAnnotationKey and hasSkyhookCordon—then updates Uncordon to delete only the caller's cordon annotation and clear Spec.Unschedulable only if no other skyhooks remain. Cordon and Reset now use the centralized annotation key. Tests add three Uncordon scenarios: sole owner, co-owner, and non-owner.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately captures the main change: fixing cordon handling to preserve shared Skyhook cordons across multiple Skyhooks on the same node.
Description check ✅ Passed The description clearly relates to the changeset, explaining why the fix is needed (preventing one Skyhook from clearing spec.unschedulable when another still has a cordon) and validating the changes.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
operator/internal/wrapper/node.go (1)

478-490: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Consider adding a brief comment explaining the multi-skyhook coordination pattern.

The conditional clearing of node.Spec.Unschedulable based on hasSkyhookCordon (lines 485-487) implements a subtle multi-owner pattern that might not be immediately obvious to future maintainers. As per coding guidelines, code that is "unusual, surprising, or breaks a pattern" should include a comment explaining why.

Suggested comment above line 485:

// Multiple skyhooks can cordon the same node; only mark it schedulable when all cordon owners have released.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@operator/internal/wrapper/node.go` around lines 478 - 490, Add a short
clarifying comment above the conditional that clears node.Spec.Unschedulable in
skyhookNode.Uncordon to explain the multi-owner cordon coordination: note that
multiple skyhooks may set cordon annotations (use
cordonAnnotationKey/hasSkyhookCordon) and we should only set Spec.Unschedulable
= false when hasSkyhookCordon(...) returns false; place the comment immediately
before the hasSkyhookCordon check to make the intent clear to future
maintainers.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@operator/internal/wrapper/node.go`:
- Around line 478-490: Add a short clarifying comment above the conditional that
clears node.Spec.Unschedulable in skyhookNode.Uncordon to explain the
multi-owner cordon coordination: note that multiple skyhooks may set cordon
annotations (use cordonAnnotationKey/hasSkyhookCordon) and we should only set
Spec.Unschedulable = false when hasSkyhookCordon(...) returns false; place the
comment immediately before the hasSkyhookCordon check to make the intent clear
to future maintainers.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: d7afc486-bf99-4c82-af6f-9df06c441317

📥 Commits

Reviewing files that changed from the base of the PR and between 98fe42d and b7eb775.

📒 Files selected for processing (2)
  • operator/internal/wrapper/node.go
  • operator/internal/wrapper/node_test.go

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@operator/internal/wrapper/node_test.go`:
- Around line 181-244: Tests in the Uncordon context duplicate the raw
annotation key pattern "skyhook.nvidia.com/cordon_*"; update the three It blocks
to use the shared helper/constant instead of hard-coded strings by calling
cordonAnnotationKey("my-skyhook") and cordonAnnotationKey("other-skyhook") (or
define local constants) when constructing node.ObjectMeta.Annotations and when
asserting Expect(...).To(HaveKeyWithValue(...)) so all references match the
canonical key format used by NewSkyhookNodeOnly and Uncordon.

In `@operator/internal/wrapper/node.go`:
- Around line 469-474: The Cordon() method writes to node.Annotations[...]
without ensuring the map exists, causing a panic if Annotations is nil; modify
Cordon() to check if node.Annotations == nil and if so initialize it
(make(map[string]string)) before assigning cordonAnnotationKey(node.skyhookName)
and setting node.Spec.Unschedulable and node.updated so that the map write is
safe; update the logic inside skyhookNode.Cordon to perform this
nil-check/initialization before any writes to node.Annotations.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 458295f6-a03e-459f-8045-cc56af8cba42

📥 Commits

Reviewing files that changed from the base of the PR and between b7eb775 and b3189d3.

📒 Files selected for processing (2)
  • operator/internal/wrapper/node.go
  • operator/internal/wrapper/node_test.go

Comment thread operator/internal/wrapper/node_test.go
Comment thread operator/internal/wrapper/node.go
@fallintoplace fallintoplace force-pushed the fix/cordon-ownership branch from b3189d3 to 8b59e04 Compare June 12, 2026 18:50
Signed-off-by: Minh Vu <vuhoangminh97@gmail.com>
@fallintoplace fallintoplace force-pushed the fix/cordon-ownership branch from 8b59e04 to 4d8d5dc Compare June 12, 2026 18:56

@ayuskauskas ayuskauskas left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Solid, correctly-scoped bug fix. A node selected by two Skyhooks that both cordon it would be prematurely uncordoned when the first one completed — this fixes that by only clearing spec.unschedulable once no skyhook.nvidia.com/cordon_* annotation remains. Notes below.

Suggestions (minor)

  1. Unit test gap worth closing — the whole point of the change is that hasSkyhookCordon matches the cordon_ prefix specifically, not the generic skyhook.nvidia.com/ prefix. Please add a case: node owned only by my-skyhook's cordon plus an unrelated annotation (e.g. status_other / nodeState_other) → Uncordon() should still set unschedulable=false. This pins the behavior that a non-cordon Skyhook annotation doesn't keep the node cordoned.
  2. DRY nit (low priority): the cordon_ literal now lives in three places — the CLI const cordonAnnotationPrefix (cmd/cli/app/reset.go), the new cordonAnnotationKey(), and the inline prefix inside hasSkyhookCordon. Consider defining cordonAnnotationKey and hasSkyhookCordon against a single shared prefix constant in the wrapper package so the literal appears once.

Please add an e2e test verifying the fix

The unit tests exercise the wrapper in isolation, but the bug is fundamentally about two Skyhooks racing over one node's schedulability. Please add a chainsaw e2e (under k8s-tests/chainsaw/skyhook/) that:

  • Deploys two Skyhooks, both requiring interrupts, selecting the same node.
  • Asserts the node stays cordoned (spec.unschedulable: true) after the first Skyhook reaches complete.
  • Asserts the node only becomes schedulable (spec.unschedulable: false, no cordon_* annotations) after both Skyhooks complete.

This is the regression guard that proves the fix end-to-end; a passing unit suite alone wouldn't have caught the original bug at the rollout level.

Documentation

Orphaned cordon annotation = stuck-unschedulable. With this change, a stale cordon_<gone-skyhook> annotation (left by a force-delete that bypasses the finalizer, or a failed cleanup) will now keep the node unschedulable indefinitely from the operator's view — hasSkyhookCordon keeps seeing the key. Previously any completing Skyhook would (incorrectly, but as a side-effect) free it. Recovery exists (kubectl skyhook reset / node reset strip cordon annotations), so this is an acceptable trade for correctness. Please update the docs/interrupt_flow.md to document this as well as high level logic introduced in this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component/ci CI workflows, GitHub Actions, and repo tooling component/operator Skyhook operator (controller-manager)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants