Tolerate heterogeneous transport handles by rmahidhar · Pull Request #2719 · meta-pytorch/torchcomms

rmahidhar · 2026-05-28T17:06:44Z

Summary: Make Uniflow treat per-segment transport handles as capabilities instead of a batch schema. Segment import now keeps usable handles when one optional transport cannot be imported, preserves the first import error if no handle can be imported, and MultiTransport selects a single common transport across the whole batch. This is production behavior, not a benchmark workaround: for distributed KV-cache transfer, a peer may expose both NVLink and RDMA for most GPU cache segments while one process/topology can only import RDMA; the transfer should fall back to RDMA when it is common to every request and fail cleanly when no common transport exists. The RDMA DMA-BUF fallback comment also documents that DMA-BUF is the preferred GDR path while ibv_reg_mr remains the correctness fallback for valid VRAM allocations that cannot be exported as DMA-BUF.

Differential Revision: D106118519

Summary: Adds a unified multi-architecture `uniflow_disagg_bench_mast` fbpkg builder target for deploying the disaggregated benchmark to MAST on H100/x86_64 and GB200/aarch64 platforms. The aarch64 variant uses CUDA `13.0`, the x86_64 variant uses CUDA `12.8`, and the target leaves `hpc_comms.use_nccl` at its platform default instead of overriding it. Updates the UniFlow integration test to create agents on the main thread. Reviewed By: saifhhasan Differential Revision: D105276382

Summary: Fix UniFlow topology discovery to respect CUDA-visible GPUs and tolerate development RDMA topologies. GPU discovery now enumerates `CudaApi::getDeviceCount()` instead of NVML physical device count, resolves each CUDA-visible GPU to its NVML handle by normalized PCI bus ID, and treats NVML enrichment as best-effort so a CUDA-visible GPU is not dropped just because NVML link metadata is unavailable. The same topology path now also handles virtual RDMA devices such as RXE by recognizing `/sys/devices/virtual/` IB devices, representing them without PCI ancestry, and using the reported RDMA port speed as their CPU-link bandwidth with a conservative 10 Gbps fallback when ibverbs reports zero. This keeps production physical NIC/GPU behavior unchanged while allowing constrained `CUDA_VISIBLE_DEVICES` and software-RDMA/dev-test environments to build a usable topology instead of failing discovery. Reviewed By: saifhhasan Differential Revision: D104611386

Summary: Make Uniflow treat per-segment transport handles as capabilities instead of a batch schema. Segment import now keeps usable handles when one optional transport cannot be imported, preserves the first import error if no handle can be imported, and `MultiTransport` selects a single common transport across the whole batch. This is production behavior, not a benchmark workaround: for distributed KV-cache transfer, a peer may expose both `NVLink` and `RDMA` for most GPU cache segments while one process/topology can only import `RDMA`; the transfer should fall back to `RDMA` when it is common to every request and fail cleanly when no common transport exists. The RDMA DMA-BUF fallback comment also documents that DMA-BUF is the preferred GDR path while `ibv_reg_mr` remains the correctness fallback for valid VRAM allocations that cannot be exported as DMA-BUF. Differential Revision: D106118519

meta-codesync · 2026-05-28T17:06:56Z

@rmahidhar has exported this pull request. If you are a Meta employee, you can view the originating Diff in D106118519.

Mahidhar Ramesh Rajala added 3 commits May 28, 2026 10:06

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 28, 2026

meta-codesync Bot added fb-exported meta-exported labels May 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tolerate heterogeneous transport handles#2719

Tolerate heterogeneous transport handles#2719
rmahidhar wants to merge 3 commits into
meta-pytorch:mainfrom
rmahidhar:export-D106118519

rmahidhar commented May 28, 2026

Uh oh!

meta-codesync Bot commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rmahidhar commented May 28, 2026

Uh oh!

meta-codesync Bot commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant