-
Notifications
You must be signed in to change notification settings - Fork 92
Fix multi-node H100 CI: CUDA compat, deploy improvements #781
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
30 commits
Select commit
Hold shift + click to select a range
de8114e
update miltinode-pool
Binyang2014 77a30a8
update
Binyang2014 152eb4d
Remove hardcoded VMSS hostnames from deploy files
Binyang2014 7a0cf0d
Fix multi-node H100 CI: drop cuda11.8, add gpuArch, improve arch dete…
Binyang2014 545d367
Use cuda13.0 image for multi-node H100 CI
Binyang2014 6ca257d
testing
Binyang2014 a2ef206
fix CI
Binyang2014 d531e4d
Speed up deploy by archiving before scp
Binyang2014 98af765
Merge branch 'main' into binyli/multinode-ci
Binyang2014 db469b6
debug
Binyang2014 5926535
Fix tar extraction path in deploy.sh
Binyang2014 7f14ca2
update
Binyang2014 f8f8aff
Fix test binary paths: build/test/ -> build/bin/
Binyang2014 8c096b4
Add eth0 MPI TCP interface and deduplicate mpirun args
Binyang2014 22a20a4
Set MSCCLPP_SOCKET_IFNAME=eth0 for multi-node bootstrap
Binyang2014 724f889
Fix cross-node CudaIpc crash when Fabric/IMEX unavailable
Binyang2014 453e0ed
Generate SSH config dynamically from hostfile_mpi
Binyang2014 f138b13
Select perf baseline based on GPU type (H100 -> ndmv5)
Binyang2014 0ddd37f
Add H100 multi-node perf baselines to ndmv5
Binyang2014 5ad154a
Use eth0 IP for mp_unit_tests bootstrap endpoint
Binyang2014 50da168
Revert peer-access-test Makefile to use -arch=native
Binyang2014 feee30f
Allow RunMscclppTest to fail without blocking pipeline
Binyang2014 ff8d4b3
Reorder CudaIpc branch to check same-host before cross-node
Binyang2014 43ba04a
Address PR review comments for multi-node CI
Binyang2014 bf4f099
Merge branch 'main' into binyli/multinode-ci
Binyang2014 6dd4e5b
Install GDRCopy 2.5.2 kernel module on host VMs during deploy
Binyang2014 62b48a5
Refactor CudaIpc import to remove redundant patterns and clarify fall…
Binyang2014 4eac8d8
Merge branch 'main' into binyli/multinode-ci
Binyang2014 a2bcc15
Simplify CudaIpc fallback: remove hasFabric check, add WARN for same-…
Binyang2014 0b6f893
Merge branch 'main' into binyli/multinode-ci
Binyang2014 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,3 +1,10 @@ | ||
| {"name":"allreduce", "kernel":6, "ranks":8, "ranksPerNode":8, "algBw":3.98, "busBw":6.96, "size":24576, "time":6.18, "target":"latency"} | ||
| {"name":"allreduce", "kernel":6, "ranks":8, "ranksPerNode":8, "algBw":7.42, "busBw":12.99, "size":49152, "time":6.62, "target":"latency"} | ||
| {"name":"allreduce", "kernel":6, "ranks":8, "ranksPerNode":8, "algBw":10.67, "busBw":18.68, "size":73728, "time":6.91, "target":"latency"} | ||
| {"name":"allreduce", "kernel":6, "ranks":8, "ranksPerNode":8, "algBw":10.67, "busBw":18.68, "size":73728, "time":6.91, "target":"latency"} | ||
| {"name":"allgather", "kernel":2, "ranks":16,"ranksPerNode":8, "algBw":430.62,"busBw":403.70, "size":3221225472, "time":7480.40, "target":"throughput"} | ||
| {"name":"allreduce", "kernel":2, "ranks":16,"ranksPerNode":8, "algBw":0.54, "busBw":1.01, "size":8192, "time":15.10, "target":"latency"} | ||
| {"name":"allreduce", "kernel":3, "ranks":16,"ranksPerNode":8, "algBw":201.46,"busBw":377.74, "size":3221225472, "time":15989.38,"target":"throughput"} | ||
| {"name":"allreduce", "kernel":4, "ranks":16,"ranksPerNode":8, "algBw":118.49,"busBw":222.17, "size":25165824, "time":212.39, "target":"throughput"} | ||
| {"name":"allreduce", "kernel":4, "ranks":16,"ranksPerNode":8, "algBw":138.48,"busBw":259.65, "size":50331648, "time":363.40, "target":"throughput"} | ||
| {"name":"allreduce", "kernel":4, "ranks":16,"ranksPerNode":8, "algBw":166.72,"busBw":312.60, "size":3221225472, "time":19321.02,"target":"throughput"} | ||
| {"name":"alltoall", "kernel":0, "ranks":16,"ranksPerNode":8, "algBw":96.94, "busBw":90.88, "size":1073741824, "time":11076.24,"target":"throughput"} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.