fix(slinky): consolidate and guard pod-to-SLURM node name resolution#380
fix(slinky): consolidate and guard pod-to-SLURM node name resolution#380faganihajizada wants to merge 2 commits into
Conversation
Consolidate the duplicated pod-to-SLURM-node-name resolution (the slurm.node.name label with a fallback to pod.Spec.Hostname), previously inlined in both getClusterNodes and listPartitionNodes, into a single resolveSlurmNodeName helper. Behavior-preserving: each caller keeps its own surrounding logic, so listPartitionNodes still skips empty host names and getClusterNodes is unchanged. Add a table-driven unit test covering label precedence, the present-but-empty label semantic, hostname fallback, and the empty case. Signed-off-by: Fagani Hajizada <fhajizada@nvidia.com>
Greptile SummaryThis PR consolidates the duplicated
Confidence Score: 4/5Safe to merge; the refactor is behavior-preserving at both call sites and the bug fix closes a real gap where empty SLURM names could propagate to compute-instance mapping. The logic is straightforward and the new tests cover the primary cases. The only gap is that TestGetClusterNodes does not exercise the label-present-but-empty branch of resolveSlurmNodeName, leaving one skip condition untested at the integration level. pkg/engines/slinky/engine_test.go — the TestGetClusterNodes helper pod factory only adds the label key when the value is non-empty, so the present-but-empty-label scenario is not exercised there. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Pod list from k8s API] --> B{IsPodReady?}
B -- No --> C[skip pod]
B -- Yes --> D[resolveSlurmNodeName pod]
D --> E{KeySlurmNodeName label present?}
E -- Yes --> F[return label value - may be empty]
E -- No --> G[return pod.Spec.Hostname - may be empty]
F --> H{host == empty?}
G --> H
H -- Yes --> I[log Warning, skip pod]
H -- No --> J[nodeMap pod.Spec.NodeName = host]
J --> K[getComputeInstances / downstream consumers]
K --> L{nodeMap lookup ok?}
L -- No --> M[skip k8s node]
L -- Yes --> N[build compute instance mapping]
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
A[Pod list from k8s API] --> B{IsPodReady?}
B -- No --> C[skip pod]
B -- Yes --> D[resolveSlurmNodeName pod]
D --> E{KeySlurmNodeName label present?}
E -- Yes --> F[return label value - may be empty]
E -- No --> G[return pod.Spec.Hostname - may be empty]
F --> H{host == empty?}
G --> H
H -- Yes --> I[log Warning, skip pod]
H -- No --> J[nodeMap pod.Spec.NodeName = host]
J --> K[getComputeInstances / downstream consumers]
K --> L{nodeMap lookup ok?}
L -- No --> M[skip k8s node]
L -- Yes --> N[build compute instance mapping]
Reviews (1): Last reviewed commit: "fix(slinky): skip pods without a resolva..." | Re-trigger Greptile |
getClusterNodes mapped a pod's Kubernetes node to an empty SLURM node name when the pod had neither the slurm.node.name label nor a hostname, producing a bogus instance->"" entry in the compute-instance mapping. Guard against an empty resolved name (skip + warn), mirroring the existing behavior in listPartitionNodes. Add TestGetClusterNodes covering the label mapping, hostname fallback, the empty-name skip (both absent-label and present-but-empty-label), and the not-Ready skip. Signed-off-by: Fagani Hajizada <fhajizada@nvidia.com>
0cec04b to
e164523
Compare
Description
Cleans up and hardens how the Slinky engine maps a slurmd pod to its SLURM node name.
The label-then-hostname resolution (the
slurm.node.namelabel, falling back topod.Spec.Hostname) was duplicated inline in bothgetClusterNodesandlistPartitionNodes. This PR consolidates it into a singleresolveSlurmNodeNamehelper and fixes a latent bug the duplication hid.Refactor (behavior-preserving): extract
resolveSlurmNodeName(pod)as the single source of truth for the label/hostname resolution; both call sites now use it. Each caller keeps its own surrounding logic, so behavior is unchanged.Fix:
getClusterNodespreviously mapped a Ready pod's Kubernetes node to an empty SLURM node name when the pod had neither theslurm.node.namelabel nor a hostname, leaving anode -> ""entry in the node map (which surfaced downstream as a bogusinstance -> ""in the compute-instance mapping). It now skips such pods with a warning, mirroring the existing guard inlistPartitionNodes. Keeping empty values out of the map at the source means all downstream consumers (compute-instance mapping, GPU-clique domains, node reconciliation) can rely on every mapped node having a real SLURM name.Tests:
TestResolveSlurmNodeName: label precedence, present-but-empty label (returns"", no hostname fallback), hostname fallback, and the empty case.TestGetClusterNodes: label mapping, hostname fallback, the empty-name skip (regression guard for the fix), and the not-Ready skip.No user-facing behavior or configuration changes;
scontrol-based discovery is unchanged.Checklist
git commit -s).