Skip to content

Install NCCL Inspector and NCCL Inspector PreConf into images#2553

Open
dstaroff wants to merge 7 commits into
mainfrom
SCHED-1197/0
Open

Install NCCL Inspector and NCCL Inspector PreConf into images#2553
dstaroff wants to merge 7 commits into
mainfrom
SCHED-1197/0

Conversation

@dstaroff
Copy link
Copy Markdown
Collaborator

@dstaroff dstaroff commented May 26, 2026

Problem

NCCL Inspector and NCCL Inspector PreConf plugins are needed to be installed for an easy use.

  • NCCL Inspector is a NCCL plugin which could be used right away with every program using NCCL under the hood.
  • NCCL Inspector PreConf is a SPANK plugin which brings simplicity into NCCL Inspector pre-configuration for every Slurm job. Despite it's being installed into images, it needs to be activated as a plugin in plugstack.conf (matter of future PRs)

Tickets to cover

Solution

This PR introduces installation of the mentioned plugins from base image root FS into the Jail under proper locations.

Along with images for Login and Worker, it provides installation of the plugin into Slurm Check Job image, and disables the NCCL Inspector PreConf plugin by default.

The change in base image versions also brings new versions of libnccl2 and libnccl-dev packages for better performance and stability.

Testing

Tested the presence of the mentioned plugins on a dev cluster.

Release Notes

Feature: Updated the version of libnccl2 and libnccl-dev packages to improve performance and stability. Brings NCCL P2P metrics collection support for the NCCL Inspector.

Feature: Added NCCL Inspector profiler plugin for deep NCCL performance investigation.

Feature: Added NCCL Inspector PreConf SPANK plugin for simple pre-configuration of NCCL Inspector for all Slurm jobs.

@dstaroff dstaroff self-assigned this May 26, 2026
@dstaroff dstaroff added docker Pull requests that update Docker code images feature labels May 26, 2026
@dstaroff dstaroff changed the title SCHED-1197: Use latest [slurm_]training_diag images with NCCL Inspector (+PreConf) support Install NCCL Inspector and NCCL Inspector PreConf into images May 28, 2026
@dstaroff dstaroff marked this pull request as ready for review May 28, 2026 11:25
set -e # Exit immediately if any command returns a non-zero error code

apt-get update
apt -y install --reinstall spank-nccl-inspector-preconf
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you have ansible role for that. Why are you using bash here?

rm /opt/bin/install_nccld_debug_plugin.sh

# Install NCCL Inspector PreConf SPANK plugin
COPY images/common/scripts/install_spank_nccl_inspector_preconf.sh /opt/bin/
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm... but this package already in a base image and installed with ansible...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docker Pull requests that update Docker code feature images

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants