feat(engine/slinky): report why useGpuCliqueLabel found no matching nodes#379
feat(engine/slinky): report why useGpuCliqueLabel found no matching nodes#379giuliocalzo wants to merge 1 commit into
Conversation
…odes When useGpuCliqueLabel=true produces no block domains, the engine returned a generic "no matching nodes found" error that gave operators no way to tell whether the problem was missing slurmd pods, absent GPU clique labels, or node-data-broker annotations that had not landed yet. The error and per-node warnings now report how many nodes were scanned and why each was skipped, and list the nodes that carry the nvidia.com/gpu.clique label but are missing the node-data-broker-written topograph.nvidia.com/instance annotation (capped to keep the message bounded on large clusters). Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
Greptile SummaryThis PR improves the
Confidence Score: 5/5Safe to merge — changes are additive, isolated to the failure path of withGPUCliqueDomains, and cannot affect successful topology generation. The only code path touched is the len(domains) == 0 error branch: counters are incremented during the existing loop, and the new helper only runs when the function was already going to return an error. The success path is completely unchanged. Two targeted tests verify the new counter fields and the node-name list in the error message. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[withGPUCliqueDomains called] --> B[Init counters]
B --> C{For each node}
C --> D{SLURM name in nodeMap?}
D -- No --> E[noSlurmName++]
D -- Yes --> F{gpu.clique label non-empty?}
F -- No --> G[noCliqueLabel++]
F -- Yes --> H{instance annotation present?}
H -- No --> I[append to missingAnnotation]
H -- Yes --> J[domains.AddHost]
E --> C
G --> C
I --> C
J --> C
C -- done --> K{len domains == 0?}
K -- No --> L[Return graph with domains]
K -- Yes --> M[formatMissingAnnotationNodes]
M --> N[Return 502 with diagnostic breakdown]
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
A[withGPUCliqueDomains called] --> B[Init counters]
B --> C{For each node}
C --> D{SLURM name in nodeMap?}
D -- No --> E[noSlurmName++]
D -- Yes --> F{gpu.clique label non-empty?}
F -- No --> G[noCliqueLabel++]
F -- Yes --> H{instance annotation present?}
H -- No --> I[append to missingAnnotation]
H -- Yes --> J[domains.AddHost]
E --> C
G --> C
I --> C
J --> C
C -- done --> K{len domains == 0?}
K -- No --> L[Return graph with domains]
K -- Yes --> M[formatMissingAnnotationNodes]
M --> N[Return 502 with diagnostic breakdown]
Reviews (1): Last reviewed commit: "feat(engine/slinky): report why useGpuCl..." | Re-trigger Greptile |
Description
When the Slinky engine runs with
useGpuCliqueLabel=trueand cannot build any block domains, it previously returned a generic error:This gave operators no way to tell which nodes were examined or why each was skipped. In particular, the
topograph.nvidia.com/instanceannotation is written per-node by the node-data-broker DaemonSet, so a node with the clique label but no annotation points at a broker that hasn't annotated that specific node yet — but the old message hid this.This PR makes the failure actionable:
502error now reports how many nodes were scanned and a breakdown of why each was skipped: no Slurm node mapping (no Ready slurmd pod), missingnvidia.com/gpu.cliquelabel, or missing thetopograph.nvidia.com/instanceannotation.Example new message:
Checklist
git commit -s).