Fix multi-node H100 CI: CUDA compat, deploy improvements#781
Merged
Binyang2014 merged 30 commits intomainfrom Apr 14, 2026
Merged
Fix multi-node H100 CI: CUDA compat, deploy improvements#781Binyang2014 merged 30 commits intomainfrom
Binyang2014 merged 30 commits intomainfrom
Conversation
Add vmssName pipeline parameter and generate config, hostfile, and hostfile_mpi dynamically. Update run_tests.sh to derive the head host from hostfile_mpi instead of hardcoding it. Delete the static deploy files that previously hardcoded mscclpp-h100-multinode-ci hostnames. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ction The multi-nodes-test pipeline was failing on H100 GPUs with CUDA error 803 (cudaErrorSystemDriverMismatch) because it still included the cuda11.8 Docker image in its matrix. All other H100 CI jobs (ut, integration-test, nccl-api-test) already use only cuda12.9. This aligns the multi-node config accordingly. Also adds gpuArch: '90' to the deploy template call for consistent H100 builds, and improves the peer-access-test Makefile to detect GPU compute capability via nvidia-smi instead of relying solely on -arch=native, which silently falls back to an old default architecture inside Docker containers. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The host driver on the multi-node H100 VMs is CUDA 13.0 (driver 580.126.16), so the container image must match. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace recursive parallel-scp with tar+scp+untar to avoid per-file SSH overhead. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
Author
|
/azp run mscclpp-ut |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Tar contents directly (-C ${ROOT_DIR} .) instead of the parent
directory, and extract into ${DST_DIR} explicitly. The previous
approach used dirname/basename which produced wrong directory names
(e.g., 's' from '/__w/1/s/') causing 'No such file or directory'
in the container.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When a RegisteredMemory has both CudaIpc and IB transports, the import path was trying CudaIpc (PosixFd) even for cross-node memory. PosixFd uses unix domain sockets which are node-local, causing 'No such file or directory' crashes. For cross-node memory: - If Fabric is available, try it (works with IMEX daemon) - If Fabric fails and IB is available, fall back to IB - If neither works, throw a clear error Same-host behavior is unchanged. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
Author
|
/azp run mscclpp-ut |
|
Azure Pipelines successfully started running 1 pipeline(s). |
The static config file was removed. Generate SSH config at runtime from the dynamically created hostfile_mpi. For single-node tests where hostfile_mpi doesn't exist, skip config generation. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
Author
|
/azp run mscclpp-ut |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Resolve HEAD_HOST to its eth0 IP address to ensure TcpBootstrap connects on the correct interface, fixing timeout in ResumeWithIpPortPair test. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add continueOnError parameter to run-remote-task template and set it for the perf test step. The step will show as failed but subsequent steps (unit tests, python tests, benchmark) will still run. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Check isSameHost first (the common/simpler path) before handling the cross-node Fabric fallback logic, improving readability. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
Author
|
/azp run mscclpp-ut |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates the multi-node H100 CI/deploy flow to be less environment-specific and more robust across CUDA driver/toolkit mismatches, while also speeding up deployment.
Changes:
- Add a distinct exit code for CUDA init failure in
peer_access_test, and retry with CUDA compat libs only when needed during remote setup. - Remove hardcoded multi-node hostnames from tracked deploy files; generate deploy hostfiles/config dynamically in the pipeline and improve runtime GPU/baseline selection.
- Speed up remote deploy by switching from recursive
parallel-scpto tar+scp+untar, and tighten cross-node CUDA IPC behavior to avoid non-functional handle types.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/peer-access-test/peer_access_test.cu | Adds exit code (2) for CUDA init failure to enable conditional compat retry. |
| test/deploy/setup.sh | Generates SSH config dynamically and retries peer-access test with compat libs on init failure. |
| test/deploy/run_tests.sh | Uses build/bin paths, resolves head node IP, selects perf baseline by GPU type, centralizes mpirun env/args. |
| test/deploy/perf_ndmv5.jsonl | Adds/extends H100 (NDmv5) perf baseline entries. |
| test/deploy/hostfile_mpi | Removes hardcoded hostnames from repo (now generated in pipeline). |
| test/deploy/hostfile | Removes hardcoded hostnames from repo (now generated in pipeline). |
| test/deploy/deploy.sh | Deploys source via tarball to reduce per-file SSH overhead. |
| test/deploy/config | Removes hardcoded SSH config from repo (now generated in pipeline/setup). |
| src/core/registered_memory.cc | Restricts cross-node CUDA IPC to Fabric handles and allows IB fallback behavior. |
| docker/build.sh | Removes CUDA compat LD_LIBRARY_PATH injection from image build. |
| .azure-pipelines/templates/run-remote-task.yml | Adds continueOnError parameter passthrough for remote tasks. |
| .azure-pipelines/multi-nodes-test.yml | Updates H100 multi-node CI settings, generates deploy files at runtime, adjusts pool/subscription/resource group. |
- Gate CUDA compat-lib retry on PLATFORM==cuda to avoid misleading errors on HIP - Fix hostfile/hostfile_mpi leading whitespace from YAML indentation by using printf instead of echo - Fix /etc/hosts duplicate check by iterating hostEntries per line instead of matching the entire multi-line string Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add gdrdrv kernel module installation for CUDA VMs before Docker container launch. Skips if the module is already loaded. Applies to both single-node and multi-node CI pipelines. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
chhwang
requested changes
Apr 13, 2026
…back logic - Extract duplicated create/map/get into importCudaIpc lambda - Add comment explaining MNNVL failure as the caught error - Document CudaIpc | IB fallback use case in comments Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
Author
|
/azp run mscclpp-ut |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…host failures - Remove hasFabric pre-check; let GpuIpcMem::create try all handle types - Remove isSameHost branching for import; always try with IB fallback - Catch BaseError to cover both Error and CudaError/CuError - WARN on same-host CudaIpc failure (unexpected), INFO on cross-host (expected) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
chhwang
approved these changes
Apr 14, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
peer_access_testnow returns a distinct exit code (2) for CUDA init failure, andsetup.shconditionally adds compat libs only when needed. This fixescudaErrorSystemNotReady(error 803) when the host driver is newer than the container's compat libs.parallel-scpwith tar+scp+untar to avoid per-file SSH overhead.