Skip to content

fix(slinky): consolidate and guard pod-to-SLURM node name resolution#380

Draft
faganihajizada wants to merge 2 commits into
NVIDIA:mainfrom
faganihajizada:fix/slinky-empty-host-guard
Draft

fix(slinky): consolidate and guard pod-to-SLURM node name resolution#380
faganihajizada wants to merge 2 commits into
NVIDIA:mainfrom
faganihajizada:fix/slinky-empty-host-guard

Conversation

@faganihajizada

Copy link
Copy Markdown
Member

Description

Cleans up and hardens how the Slinky engine maps a slurmd pod to its SLURM node name.

The label-then-hostname resolution (the slurm.node.name label, falling back to pod.Spec.Hostname) was duplicated inline in both getClusterNodes and listPartitionNodes. This PR consolidates it into a single resolveSlurmNodeName helper and fixes a latent bug the duplication hid.

Refactor (behavior-preserving): extract resolveSlurmNodeName(pod) as the single source of truth for the label/hostname resolution; both call sites now use it. Each caller keeps its own surrounding logic, so behavior is unchanged.

Fix: getClusterNodes previously mapped a Ready pod's Kubernetes node to an empty SLURM node name when the pod had neither the slurm.node.name label nor a hostname, leaving a node -> "" entry in the node map (which surfaced downstream as a bogus instance -> "" in the compute-instance mapping). It now skips such pods with a warning, mirroring the existing guard in listPartitionNodes. Keeping empty values out of the map at the source means all downstream consumers (compute-instance mapping, GPU-clique domains, node reconciliation) can rely on every mapped node having a real SLURM name.

Tests:

  • TestResolveSlurmNodeName: label precedence, present-but-empty label (returns "", no hostname fallback), hostname fallback, and the empty case.
  • TestGetClusterNodes: label mapping, hostname fallback, the empty-name skip (regression guard for the fix), and the not-Ready skip.

No user-facing behavior or configuration changes; scontrol-based discovery is unchanged.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • All commits are signed off per DCO (git commit -s).

Consolidate the duplicated pod-to-SLURM-node-name resolution (the
slurm.node.name label with a fallback to pod.Spec.Hostname), previously
inlined in both getClusterNodes and listPartitionNodes, into a single
resolveSlurmNodeName helper.

Behavior-preserving: each caller keeps its own surrounding logic, so
listPartitionNodes still skips empty host names and getClusterNodes is
unchanged. Add a table-driven unit test covering label precedence, the
present-but-empty label semantic, hostname fallback, and the empty case.

Signed-off-by: Fagani Hajizada <fhajizada@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented Jul 2, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps

greptile-apps Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR consolidates the duplicated slurm.node.name-label-then-hostname resolution logic into a single resolveSlurmNodeName helper and fixes a latent bug in getClusterNodes where a pod with no resolvable SLURM name would silently insert a k8s-node → "" entry into nodeMap, propagating a bogus empty instance downstream.

  • resolveSlurmNodeName is extracted as the canonical resolution function and called by both getClusterNodes and listPartitionNodes; the former now skips unresolvable pods with a warning, matching the guard that already existed in the latter.
  • TestResolveSlurmNodeName covers label precedence, present-but-empty label (no hostname fallback), hostname fallback, and the fully-empty case; TestGetClusterNodes adds an integration-level regression guard for the empty-name skip.

Confidence Score: 4/5

Safe to merge; the refactor is behavior-preserving at both call sites and the bug fix closes a real gap where empty SLURM names could propagate to compute-instance mapping.

The logic is straightforward and the new tests cover the primary cases. The only gap is that TestGetClusterNodes does not exercise the label-present-but-empty branch of resolveSlurmNodeName, leaving one skip condition untested at the integration level.

pkg/engines/slinky/engine_test.go — the TestGetClusterNodes helper pod factory only adds the label key when the value is non-empty, so the present-but-empty-label scenario is not exercised there.

Important Files Changed

Filename Overview
pkg/engines/slinky/engine.go Extracts resolveSlurmNodeName helper and adds a guard in getClusterNodes to skip pods with an unresolvable SLURM name, preventing empty-key entries in nodeMap; both call sites now use the shared helper.
pkg/engines/slinky/engine_test.go Adds TestResolveSlurmNodeName (4 cases) and TestGetClusterNodes (4 pod scenarios); the present-but-empty-label skip path is tested at the helper level but not exercised in the integration test.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Pod list from k8s API] --> B{IsPodReady?}
    B -- No --> C[skip pod]
    B -- Yes --> D[resolveSlurmNodeName pod]
    D --> E{KeySlurmNodeName label present?}
    E -- Yes --> F[return label value - may be empty]
    E -- No --> G[return pod.Spec.Hostname - may be empty]
    F --> H{host == empty?}
    G --> H
    H -- Yes --> I[log Warning, skip pod]
    H -- No --> J[nodeMap pod.Spec.NodeName = host]
    J --> K[getComputeInstances / downstream consumers]
    K --> L{nodeMap lookup ok?}
    L -- No --> M[skip k8s node]
    L -- Yes --> N[build compute instance mapping]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[Pod list from k8s API] --> B{IsPodReady?}
    B -- No --> C[skip pod]
    B -- Yes --> D[resolveSlurmNodeName pod]
    D --> E{KeySlurmNodeName label present?}
    E -- Yes --> F[return label value - may be empty]
    E -- No --> G[return pod.Spec.Hostname - may be empty]
    F --> H{host == empty?}
    G --> H
    H -- Yes --> I[log Warning, skip pod]
    H -- No --> J[nodeMap pod.Spec.NodeName = host]
    J --> K[getComputeInstances / downstream consumers]
    K --> L{nodeMap lookup ok?}
    L -- No --> M[skip k8s node]
    L -- Yes --> N[build compute instance mapping]
Loading

Reviews (1): Last reviewed commit: "fix(slinky): skip pods without a resolva..." | Re-trigger Greptile

Comment thread pkg/engines/slinky/engine_test.go
@faganihajizada faganihajizada marked this pull request as draft July 2, 2026 12:24
getClusterNodes mapped a pod's Kubernetes node to an empty SLURM node
name when the pod had neither the slurm.node.name label nor a hostname,
producing a bogus instance->"" entry in the compute-instance mapping.
Guard against an empty resolved name (skip + warn), mirroring the
existing behavior in listPartitionNodes.

Add TestGetClusterNodes covering the label mapping, hostname fallback,
the empty-name skip (both absent-label and present-but-empty-label),
and the not-Ready skip.

Signed-off-by: Fagani Hajizada <fhajizada@nvidia.com>
@faganihajizada faganihajizada force-pushed the fix/slinky-empty-host-guard branch from 0cec04b to e164523 Compare July 2, 2026 12:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant