Skip to content

Cluster Perf tests - EP benchmarking & RDMA Perf and Cluster Env mapping recommendation tool#734

Open
lcskrishna wants to merge 5 commits into
AMD-AGI:mainfrom
lcskrishna:csrikris-cluster-tests
Open

Cluster Perf tests - EP benchmarking & RDMA Perf and Cluster Env mapping recommendation tool#734
lcskrishna wants to merge 5 commits into
AMD-AGI:mainfrom
lcskrishna:csrikris-cluster-tests

Conversation

@lcskrishna
Copy link
Copy Markdown

This PR enables the following perf tests and tools into Primus.
Here are the summary of perf tests and tools added.

  • Large EP Performance tests (MoRI-EP) - benchmark/kernel/ep_bench - Used for Microbenchmarking Large Expert Parallelism using MoRI
  • RDMA Perf (IB_write) tests - benchmark/kernel/rdma_perf - Used to validate the NIC performance of a cluster between two nodes.
  • Cluster RDMA Env mapping Tool - tools/cluster-rdma-env-recommender - Details of cluster like Firmware, GID, RDMA -> NETDEV mapping and NIC vendor and few other recommendations.

Each of these perf tests and tools have their respective README.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds cluster RDMA discovery + recommendation tooling and introduces Slurm+Docker launchers for RDMA and MoRI EP microbenchmarks.

Changes:

  • Add a CLI tool to map RDMA→PCI→NetDev, detect NIC vendor, and emit recommended Docker + NCCL/rocSHMEM env exports.
  • Add two-node RDMA perf test harness (ib_write_bw) with TCP startup barrier helpers and Slurm launch scripts.
  • Add MoRI EP bench Slurm launcher plus a “slim” Dockerfile to run intra-/inter-node MoRI microbenchmarks.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
tools/cluster-rdma-env-recommender/cluster_rdma_env_recommender.py Implements RDMA inventory + recommendation output (Docker command + env vars).
tools/cluster-rdma-env-recommender/README.md Documents how to run the RDMA env recommender tool.
benchmark/kernel/rdma_perf/socket_wait.py Adds a helper to wait on a remote TCP port state for coordination.
benchmark/kernel/rdma_perf/socket_barrier.py Adds a TCP barrier used to synchronize container readiness across nodes.
benchmark/kernel/rdma_perf/run_slurm.sh Adds Slurm launcher that runs perftest inside Docker on allocated nodes.
benchmark/kernel/rdma_perf/run_rdma_tests.sh Adds in-container script that performs barrier + server/client ib_write_bw.
benchmark/kernel/rdma_perf/README.md Documents how to use the RDMA perf tests and common troubleshooting.
benchmark/kernel/ep_bench/run_slurm.sh Adds Slurm launcher for MoRI EP microbenchmarks inside Docker.
benchmark/kernel/ep_bench/run_mori_bench.sh Adds in-container script to run MoRI intra-/inter-node microbenchmarks.
benchmark/kernel/ep_bench/docker/Dockerfile.mori Adds a MoRI bench image recipe layered on vLLM ROCm base image.
benchmark/kernel/ep_bench/README.md Documents building/running the MoRI EP-bench launcher and image options.
Comments suppressed due to low confidence (1)

benchmark/kernel/rdma_perf/socket_barrier.py:1

  • Closing server_socket while the daemon thread is blocked in accept() will typically raise an OSError in the thread (and can print a stack trace). Wrap the accept() loop in a try/except for OSError and break cleanly when the socket is closed (or use a shutdown flag + timeout).
###############################################################################

Comment on lines +44 to +45
cmd = "ip route show default | awk '{print $5}'"
out = subprocess.check_output(cmd, shell=True, text=True).strip()
Comment on lines +60 to +64
pci_updated = pci.replace("0000:", "")
out = subprocess.check_output(
["lspci", "-s", pci, "-nn"],
text=True
).lower()
Comment on lines +227 to +231
print (f"{bnxt_rdma:>5}")
print (f"{rdmacm:>5}")
print (f"{ibverbs:>5}")
print (f"{libnl3:>5}")
print (f"{libnl3_router:>5}")
Comment on lines +310 to +313
print (f"{ionic_rdma:>5}")
print (f"{ionic_driver:>5}")
for so_file in ionic_so:
print (f"{so_file:>5}")
Comment on lines +389 to +391
if (len(gid_indexes) > 1):
print (" \n WARNING: multiple GID indeces detected, please check detailed report for mapping the env variables.")
nccl_env_variables.append(f"export NCCL_IB_GID_INDEX={max(list(gid_indexes))}")
Comment on lines +433 to +434
parser.add_argument("--html", help="Generate HTML report", action="store_true")
args = parser.parse_args()
Comment on lines +81 to +91
if [[ "${NODE_RANK}" -eq 0 ]]; then
echo "-------------------------------------------------" | tee -a "${LOG_FILE}"
echo "[${HOST_NAME}:${HOST_IP}] Running ib_write_bw as SERVER" | tee -a "${LOG_FILE}"
echo "-------------------------------------------------" | tee -a "${LOG_FILE}"

ib_write_bw -d "${IBDEVICES}" -q 4 -a --report_gbits -F -p "${IB_WRITE_BW_PORT}" \
2>&1 | tee -a "${LOG_FILE}"
else
echo "-------------------------------------------------" | tee -a "${LOG_FILE}"
echo "[${HOST_NAME}:${HOST_IP}] Running ib_write_bw as CLIENT against ${SERVER_IP}" | tee -a "${LOG_FILE}"
echo "-------------------------------------------------" | tee -a "${LOG_FILE}"
Comment on lines +93 to +95
echo "[${HOST_NAME}] Waiting for server port to open..." | tee -a "${LOG_FILE}"
sleep 30

Comment on lines +99 to +106
echo "[Node ${NODE_RANK}] Running MoRI INTERNODE dispatch/combine benchmark (v1, bf16)..."
torchrun --nnodes=$NNODES \
--node_rank=$NODE_RANK \
--nproc_per_node=1 \
--master_addr=$MASTER_ADDR \
--master_port=$MASTER_PORT \
"${INTERNODE_SCRIPT}" --cmd bench \
2>&1 | tee "${LOG_DIR}/mori_internode_v1_rank${NODE_RANK}.log"
Comment on lines +54 to +56
RUN git clone --recursive $(grep '^MORI_REPO:' versions.txt | cut -d' ' -f2) && \
cd mori && \
git checkout $(grep '^MORI_BRANCH:' /app/versions.txt | cut -d' ' -f2)
@lcskrishna
Copy link
Copy Markdown
Author

cc: @alfuyao-amd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants