Fuse self-copy into AG send for Ring AllGather and AllReduce by Scusemua · Pull Request #2706 · meta-pytorch/torchcomms

Scusemua · 2026-05-27T17:18:28Z

Summary:
Fuses the AllGather self-copy into the first RDMA send using a dual-destination MemcpyAndSelfCopy CopyOp. Instead of copying own shard to recvbuf[self] and then re-reading it for RDMA staging (2 reads + 2 writes per byte), the fused path reads the source once and writes to both staging and recvbuf[self] simultaneously (1 read + 2 writes). This optimization was discovered by Subodh and Santosh on GB200, where the extra read pass showed up as a major bottleneck with dual-NIC configurations.

Changes:

Promotes MemcpyAndSelfCopy from DirectNvl.cu's anonymous namespace to CopyOp.cuh as a shared, reusable CopyOp
Applies the fused self-copy to the AG phase of RingAllReduce.cu (both unidirectional and bidir paths)
Applies the same optimization to standalone RingAllgather.cu
Removes the local duplicate from DirectNvl.cu

For bidir AG, both forward and reverse sends read from the original source (own_src + off), avoiding a data dependency on the fused write. The ring_group.sync() barrier between RS and AG phases is preserved.

The optimization is invisible on H100 (single NIC) but provides ~7.5% improvement on GB200 at 1GB (D104725448 benchmarks: 115.73 → 124.36 GB/s).

Based on D104725448 (Subodh).

Differential Revision: D104868628

Summary: Add a composed Ring AllReduce launcher to the Pipes collectives library. The launcher sequentially launches the existing `ring_reduce_scatter_kernel` and `ring_allgather_kernel` on a user-specified CUDA stream using the "Option B" buffer layout (no scratch buffer; RS writes directly to `recvbuf[my_rank * chunk_bytes]`, AG reads from the same position). Supports 1, 2, or 4 rings with configurable block count and timeout. Differential Revision: D104701021

Summary: Wire Pipes-based Ring AllReduce into the MCCL/Ctran dispatch as a new `pipesring` algorithm. Add three new CVARs (`NCCL_CTRAN_PIPES_SENDRECV_ENABLE`, `_MAX_GROUPS`, `_PIPELINE_DEPTH`) to enable pipelined send/recv staging buffers on the IBGDA transport, as required by ring collectives' `send()`/`recv()`/`forward()` protocol. Add `AllReducePipesRing.cc` implementing the Ctran-side dispatch that builds ring topology, gets per-peer IBGDA transport handles, and launches the Pipes AllReduce. Uses the scratch-free "Option B" buffer layout. Currently supports float32 + Sum only, `nLocalRanks=1` (IB-only, matching PAFT inter-replica topology). Differential Revision: D104701020

Summary: Pull Request resolved: meta-pytorch#2701 Add auto-tuning support for the pipesring AllReduce algorithm with two new CVARs (`NCCL_CTRAN_ALLREDUCE_PIPES_NUM_RINGS`, `_NUM_BLOCKS`) that default to -1 (auto-select based on message size). Add a comprehensive benchmark comparing Pipes Ring AllReduce against NCCL AllReduce across message sizes from 256KB to 512MB with 1 and 2 rings. Also fix the ncclx `AlgoConfig.cc` switch exhaustiveness for the new `pipesring` enum value. Differential Revision: D104703792

…h#2702) Summary: Pull Request resolved: meta-pytorch#2702 Replace the two-kernel AllReduce composition (separate RS + AG launches) with a single fused `ring_allreduce_kernel` that executes both ReduceScatter and AllGather phases within one kernel invocation. This eliminates the ~5-15us inter-kernel gap and shares setup code (thread group, ring partition, `TiledBuffer`, `pipeline_window`, stride). Transport step-state counters persist naturally within a single kernel; no barrier needed between phases. Also add Ctran ctring AllReduce as a third comparison arm in the benchmark, showing Pipes Ring vs Ctran Ring (GPE) vs NCCL (auto-select) to measure the speedup over the current MCCL production AllReduce algorithm. Differential Revision: D104709661

Summary: Switch the fused Ring AllReduce kernel's ReduceScatter phase from `TileReduce` to `TileReduceStaged` (introduced in D104600612). `TileReduceStaged` decouples the two memory load streams (staging buffer from IB + local GPU memory) from the accumulation step, giving the compiler more freedom to schedule loads in parallel. Measured +2-3% bandwidth improvement at 32-128MB for standalone `ReduceScatter`. Differential Revision: D104758065

Summary: Adds bi-directional AllGather to the Pipes fused Ring AllReduce kernel. Instead of propagating reduced shards in one direction (W-1 steps), data is sent both clockwise and counter-clockwise simultaneously, reducing the AG phase from W-1 to ceil((W-1)/2) steps. For W=8, this cuts AG from 7 to 4 steps (1.75x), giving ~1.27x overall AllReduce speedup at large message sizes. The implementation exploits a key property of Pipes' IBGDA transport: each P2pIbgdaTransportDevice has fully independent send and recv channels (separate stepState indices, staging buffers, signal buffers). After ReduceScatter, the reverse-direction channels (prev's send, next's recv) are dormant, so bi-directional AG simply activates them with zero additional buffers or transport changes. This is ~30 lines of kernel code vs ~1000 lines in Ctran's equivalent implementation. **Key design decisions:** - `EnableBidirAg` compile-time template parameter with `if constexpr` guard, so zero overhead when disabled - All blocks participate in both directions (no block splitting): RDMA puts to different peers overlap on full-duplex IB links - `nRanks > 2` guard in production dispatch: `W=2` is degenerate (`prev==next`) - New CVAR `NCCL_CTRAN_ALLREDUCE_PIPES_BIDIR_AG_MIN_SIZE` with min-size semantics (`>= threshold`), auto-tune default at 16MB Differential Revision: D104758061

Summary: Fuses the AllGather self-copy into the first RDMA send using a dual-destination `MemcpyAndSelfCopy` CopyOp. Instead of copying own shard to `recvbuf[self]` and then re-reading it for RDMA staging (2 reads + 2 writes per byte), the fused path reads the source once and writes to both staging and `recvbuf[self]` simultaneously (1 read + 2 writes). This optimization was discovered by Subodh and Santosh on GB200, where the extra read pass showed up as a major bottleneck with dual-NIC configurations. **Changes:** - Promotes `MemcpyAndSelfCopy` from `DirectNvl.cu`'s anonymous namespace to `CopyOp.cuh` as a shared, reusable CopyOp - Applies the fused self-copy to the AG phase of `RingAllReduce.cu` (both unidirectional and bidir paths) - Applies the same optimization to standalone `RingAllgather.cu` - Removes the local duplicate from `DirectNvl.cu` For bidir AG, both forward and reverse sends read from the original source (`own_src + off`), avoiding a data dependency on the fused write. The `ring_group.sync()` barrier between RS and AG phases is preserved. The optimization is invisible on H100 (single NIC) but provides ~7.5% improvement on GB200 at 1GB (D104725448 benchmarks: 115.73 → 124.36 GB/s). Based on D104725448 (Subodh). Differential Revision: D104868628

meta-codesync · 2026-05-27T17:18:51Z

@Scusemua has exported this pull request. If you are a Meta employee, you can view the originating Diff in D104868628.

Ben Carver and others added 7 commits May 26, 2026 11:51

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 27, 2026

meta-codesync Bot added fb-exported meta-exported labels May 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fuse self-copy into AG send for Ring AllGather and AllReduce#2706

Fuse self-copy into AG send for Ring AllGather and AllReduce#2706
Scusemua wants to merge 7 commits into
meta-pytorch:mainfrom
Scusemua:export-D104868628

Scusemua commented May 27, 2026

Uh oh!

meta-codesync Bot commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Scusemua commented May 27, 2026

Uh oh!

meta-codesync Bot commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant