Skip to content

feat(engine/slinky): report why useGpuCliqueLabel found no matching nodes#379

Open
giuliocalzo wants to merge 1 commit into
NVIDIA:mainfrom
giuliocalzo:feat/slinky-clique-diagnostics
Open

feat(engine/slinky): report why useGpuCliqueLabel found no matching nodes#379
giuliocalzo wants to merge 1 commit into
NVIDIA:mainfrom
giuliocalzo:feat/slinky-clique-diagnostics

Conversation

@giuliocalzo

Copy link
Copy Markdown
Collaborator

Description

When the Slinky engine runs with useGpuCliqueLabel=true and cannot build any block domains, it previously returned a generic error:

useGpuCliqueLabel=true but no matching nodes found; check label "nvidia.com/gpu.clique" and annotation "topograph.nvidia.com/instance"

This gave operators no way to tell which nodes were examined or why each was skipped. In particular, the topograph.nvidia.com/instance annotation is written per-node by the node-data-broker DaemonSet, so a node with the clique label but no annotation points at a broker that hasn't annotated that specific node yet — but the old message hid this.

This PR makes the failure actionable:

  • The 502 error now reports how many nodes were scanned and a breakdown of why each was skipped: no Slurm node mapping (no Ready slurmd pod), missing nvidia.com/gpu.clique label, or missing the topograph.nvidia.com/instance annotation.
  • It lists the node names that carry the clique label but are missing the broker-written annotation (capped to keep the message bounded on large clusters).
  • The per-node warning now includes the node name, its Slurm name, and the clique value.

Example new message:

useGpuCliqueLabel=true but no matching nodes found; check label "nvidia.com/gpu.clique" and annotation "topograph.nvidia.com/instance". Scanned 8 node(s): 2 without a SLURM node mapping (no Ready slurmd pod), 3 missing the "nvidia.com/gpu.clique" label, 3 with the label but missing the "topograph.nvidia.com/instance" annotation (nodes missing annotation: gpu-node-1, gpu-node-4, gpu-node-7)

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • All commits are signed off per DCO (git commit -s).

…odes

When useGpuCliqueLabel=true produces no block domains, the engine returned a
generic "no matching nodes found" error that gave operators no way to tell
whether the problem was missing slurmd pods, absent GPU clique labels, or
node-data-broker annotations that had not landed yet.

The error and per-node warnings now report how many nodes were scanned and why
each was skipped, and list the nodes that carry the nvidia.com/gpu.clique label
but are missing the node-data-broker-written topograph.nvidia.com/instance
annotation (capped to keep the message bounded on large clusters).

Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
@greptile-apps

greptile-apps Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR improves the useGpuCliqueLabel error path in the Slinky engine so that when no block domains can be built, the 502 response now includes a structured breakdown of why each scanned node was skipped instead of a generic "no matching nodes found" message.

  • withGPUCliqueDomains gains three diagnostic counters (noSlurmName, noCliqueLabel, and a missingAnnotation slice) collected during the node loop; on failure these are formatted into a single actionable error string via the new formatMissingAnnotationNodes helper, which caps the listed names at 10 to keep messages bounded on large clusters.
  • Two new/updated tests cover the no-clique-label scenario (counter assertions added) and the clique-label-present-but-annotation-absent scenario introduced by this PR.

Confidence Score: 5/5

Safe to merge — changes are additive, isolated to the failure path of withGPUCliqueDomains, and cannot affect successful topology generation.

The only code path touched is the len(domains) == 0 error branch: counters are incremented during the existing loop, and the new helper only runs when the function was already going to return an error. The success path is completely unchanged. Two targeted tests verify the new counter fields and the node-name list in the error message.

No files require special attention.

Important Files Changed

Filename Overview
pkg/engines/slinky/engine.go Adds diagnostic counters (noSlurmName, noCliqueLabel, missingAnnotation list) to withGPUCliqueDomains and a new formatMissingAnnotationNodes helper; error message now reports per-skip-reason counts and up to 10 offending node names.
pkg/engines/slinky/engine_test.go Extends the existing no-matching-nodes test to assert the new counter fields, and adds TestWithGPUCliqueDomainsMissingBrokerAnnotation for the clique-label-present-but-annotation-absent case.
CHANGELOG.md Adds a Changed entry for the improved useGpuCliqueLabel diagnostic; placed correctly before the existing Fixed section per Keep-a-Changelog ordering.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[withGPUCliqueDomains called] --> B[Init counters]
    B --> C{For each node}
    C --> D{SLURM name in nodeMap?}
    D -- No --> E[noSlurmName++]
    D -- Yes --> F{gpu.clique label non-empty?}
    F -- No --> G[noCliqueLabel++]
    F -- Yes --> H{instance annotation present?}
    H -- No --> I[append to missingAnnotation]
    H -- Yes --> J[domains.AddHost]
    E --> C
    G --> C
    I --> C
    J --> C
    C -- done --> K{len domains == 0?}
    K -- No --> L[Return graph with domains]
    K -- Yes --> M[formatMissingAnnotationNodes]
    M --> N[Return 502 with diagnostic breakdown]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[withGPUCliqueDomains called] --> B[Init counters]
    B --> C{For each node}
    C --> D{SLURM name in nodeMap?}
    D -- No --> E[noSlurmName++]
    D -- Yes --> F{gpu.clique label non-empty?}
    F -- No --> G[noCliqueLabel++]
    F -- Yes --> H{instance annotation present?}
    H -- No --> I[append to missingAnnotation]
    H -- Yes --> J[domains.AddHost]
    E --> C
    G --> C
    I --> C
    J --> C
    C -- done --> K{len domains == 0?}
    K -- No --> L[Return graph with domains]
    K -- Yes --> M[formatMissingAnnotationNodes]
    M --> N[Return 502 with diagnostic breakdown]
Loading

Reviews (1): Last reviewed commit: "feat(engine/slinky): report why useGpuCl..." | Re-trigger Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant