Skip to content

NCCL Benchmark Fails: nvidia-smi Not Found in Driver Chroot Environment #968

@dnugmanov

Description

@dnugmanov

Bug Description

The NCCL benchmark cronjob consistently fails with nvidia-smi: command not found when attempting to check for running GPU processes. The issue occurs during the isRunningProcessOnGPU() function call, which tries to execute chroot /run/nvidia/driver nvidia-smi.

Error Details

Error Message:

{"error":"failed to execute nvidia-smi: exit status 127","level":"fatal","msg":"Failed to check running processes on GPU","slurmNode":"worker-0","time":"2025-06-04T14:16:56Z"}

Exit Code: 127 (command not found)

Environment

  • Slurm Operator Version: Based on image cr.eu-north1.nebius.cloud/soperator/nccl_benchmark:1.19.0-jammy-slurm24.05.5
  • Container Runtime: CRI-O

Investigation Results

root@worker-0:/# ls -la /run/nvidia/driver/
total 0
drwxr-xr-x 3 root root  60 May 27 12:39 .
drwxr-xr-x 6 root root 120 May 27 12:38 ..
drwxr-xr-x 2 root root  40 May 27 12:39 etc

root@worker-0:/# chroot /run/nvidia/driver nvidia-smi
chroot: failed to run command 'nvidia-smi': No such file or directory

root@worker-0:/# which nvidia-smi
/usr/bin/nvidia-smi

Could you please advise on this error?
Is this a misconfiguration on my side, or does the operator have strict placement requirements for the nvidia-smi binary in benchmark pods that I'm not meeting?

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions