Tolerate heterogeneous transport handles#2719
Open
rmahidhar wants to merge 3 commits into
Open
Conversation
added 3 commits
May 28, 2026 10:06
Summary: Adds a unified multi-architecture `uniflow_disagg_bench_mast` fbpkg builder target for deploying the disaggregated benchmark to MAST on H100/x86_64 and GB200/aarch64 platforms. The aarch64 variant uses CUDA `13.0`, the x86_64 variant uses CUDA `12.8`, and the target leaves `hpc_comms.use_nccl` at its platform default instead of overriding it. Updates the UniFlow integration test to create agents on the main thread. Reviewed By: saifhhasan Differential Revision: D105276382
Summary: Fix UniFlow topology discovery to respect CUDA-visible GPUs and tolerate development RDMA topologies. GPU discovery now enumerates `CudaApi::getDeviceCount()` instead of NVML physical device count, resolves each CUDA-visible GPU to its NVML handle by normalized PCI bus ID, and treats NVML enrichment as best-effort so a CUDA-visible GPU is not dropped just because NVML link metadata is unavailable. The same topology path now also handles virtual RDMA devices such as RXE by recognizing `/sys/devices/virtual/` IB devices, representing them without PCI ancestry, and using the reported RDMA port speed as their CPU-link bandwidth with a conservative 10 Gbps fallback when ibverbs reports zero. This keeps production physical NIC/GPU behavior unchanged while allowing constrained `CUDA_VISIBLE_DEVICES` and software-RDMA/dev-test environments to build a usable topology instead of failing discovery. Reviewed By: saifhhasan Differential Revision: D104611386
Summary: Make Uniflow treat per-segment transport handles as capabilities instead of a batch schema. Segment import now keeps usable handles when one optional transport cannot be imported, preserves the first import error if no handle can be imported, and `MultiTransport` selects a single common transport across the whole batch. This is production behavior, not a benchmark workaround: for distributed KV-cache transfer, a peer may expose both `NVLink` and `RDMA` for most GPU cache segments while one process/topology can only import `RDMA`; the transfer should fall back to `RDMA` when it is common to every request and fail cleanly when no common transport exists. The RDMA DMA-BUF fallback comment also documents that DMA-BUF is the preferred GDR path while `ibv_reg_mr` remains the correctness fallback for valid VRAM allocations that cannot be exported as DMA-BUF. Differential Revision: D106118519
Contributor
|
@rmahidhar has exported this pull request. If you are a Meta employee, you can view the originating Diff in D106118519. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary: Make Uniflow treat per-segment transport handles as capabilities instead of a batch schema. Segment import now keeps usable handles when one optional transport cannot be imported, preserves the first import error if no handle can be imported, and
MultiTransportselects a single common transport across the whole batch. This is production behavior, not a benchmark workaround: for distributed KV-cache transfer, a peer may expose bothNVLinkandRDMAfor most GPU cache segments while one process/topology can only importRDMA; the transfer should fall back toRDMAwhen it is common to every request and fail cleanly when no common transport exists. The RDMA DMA-BUF fallback comment also documents that DMA-BUF is the preferred GDR path whileibv_reg_mrremains the correctness fallback for valid VRAM allocations that cannot be exported as DMA-BUF.Differential Revision: D106118519