Bi-directional AllGather in fused Ring AllReduce (#2705) by Scusemua · Pull Request #2705 · meta-pytorch/torchcomms

Scusemua · 2026-05-27T17:05:36Z

Summary:

Adds bi-directional AllGather to the Pipes fused Ring AllReduce kernel. Instead of propagating reduced shards in one direction (W-1 steps), data is sent both clockwise and counter-clockwise simultaneously, reducing the AG phase from W-1 to ceil((W-1)/2) steps. For W=8, this cuts AG from 7 to 4 steps (1.75x).

The implementation exploits a key property of Pipes' IBGDA transport: each P2pIbgdaTransportDevice has fully independent send and recv channels (separate stepState indices, staging buffers, signal buffers). After ReduceScatter, the reverse-direction channels (prev's send, next's recv) are dormant, so bi-directional AG simply activates them with zero additional buffers or transport changes.

Key design decisions:

EnableBidirAg compile-time template parameter with if constexpr guard, so zero overhead when disabled
All blocks participate in both directions (no block splitting): RDMA puts to different peers overlap on full-duplex IB links
Launcher silently falls back to unidirectional when num_ranks <= 2 (degenerate case where prev == next)
New CVAR NCCL_CTRAN_ALLREDUCE_PIPES_BIDIR_AG_MIN_SIZE defaults to -1 (disabled in production). Set to 0 to enable for all sizes, -2 for auto-tune at >= 16MB, or a positive value for a custom threshold

Current status: BiDir AG is functionally correct but does not yet improve performance on H100 IB (benchmark shows ~3-9% regression vs unidirectional at 256-512MB). It is disabled by default and available for future tuning on different hardware/topologies where the step-count reduction may outweigh the per-step overhead increase.

Tests:

Ring1Bidir / Ring2Bidir suites: mirror existing Ring1/Ring2 configs with enable_bidir_ag = true
MultiInvocationCorrectness test: verifies step-state accumulation across 5 invocations
TailCorrectness test: verifies non-divisible element counts with remainder fallback
ring_allreduce_2gpu_test BUCK target at ppn:2: exercises W=2 edge case where launcher falls back to unidirectional
make_bidir_rings() unit tests: validates {forward, reverse} ring topology utility

Benchmarks:

_bd configs: interleaved bidir (1 ring, EnableBidirAg=true)
_rv configs: reverse ring (2 rings {fwd,rev}, EnableBidirAg=false)
Existing _2R configs (coprime strides) serve as multi-ring baseline

Differential Revision: D104758061

meta-codesync · 2026-05-27T17:05:45Z

@Scusemua has exported this pull request. If you are a Meta employee, you can view the originating Diff in D104758061.

Summary: Adds bi-directional AllGather to the Pipes fused Ring AllReduce kernel. Instead of propagating reduced shards in one direction (`W-1` steps), data is sent both clockwise and counter-clockwise simultaneously, reducing the AG phase from `W-1` to `ceil((W-1)/2)` steps. For `W=8`, this cuts AG from 7 to 4 steps (1.75x), giving ~1.27x overall AllReduce speedup at large message sizes. The implementation exploits a key property of Pipes' IBGDA transport: each P2pIbgdaTransportDevice has fully independent send and recv channels (separate stepState indices, staging buffers, signal buffers). After ReduceScatter, the reverse-direction channels (prev's send, next's recv) are dormant, so bi-directional AG simply activates them with zero additional buffers or transport changes. This is ~30 lines of kernel code vs ~1000 lines in Ctran's equivalent implementation. **Key design decisions:** - `EnableBidirAg` compile-time template parameter with `if constexpr` guard, so zero overhead when disabled - All blocks participate in both directions (no block splitting): RDMA puts to different peers overlap on full-duplex IB links - `nRanks > 2` guard in production dispatch: `W=2` is degenerate (`prev==next`) - New CVAR `NCCL_CTRAN_ALLREDUCE_PIPES_BIDIR_AG_MIN_SIZE` with min-size semantics (`>= threshold`), auto-tune default at 16MB Differential Revision: D104758061

Summary: Adds bi-directional AllGather to the Pipes fused Ring AllReduce kernel. Instead of propagating reduced shards in one direction (W-1 steps), data is sent both clockwise and counter-clockwise simultaneously, reducing the AG phase from W-1 to ceil((W-1)/2) steps. For W=8, this cuts AG from 7 to 4 steps (1.75x). The implementation exploits a key property of Pipes' IBGDA transport: each `P2pIbgdaTransportDevice` has fully independent send and recv channels (separate stepState indices, staging buffers, signal buffers). After ReduceScatter, the reverse-direction channels (`prev`'s send, `next`'s recv) are dormant, so bi-directional AG simply activates them with zero additional buffers or transport changes. **Key design decisions:** - `EnableBidirAg` compile-time template parameter with `if constexpr` guard, so zero overhead when disabled - All blocks participate in both directions (no block splitting): RDMA puts to different peers overlap on full-duplex IB links - Launcher silently falls back to unidirectional when `num_ranks <= 2` (degenerate case where `prev == next`) - New CVAR `NCCL_CTRAN_ALLREDUCE_PIPES_BIDIR_AG_MIN_SIZE` defaults to `-1` (disabled in production). Set to `0` to enable for all sizes, `-2` for auto-tune at >= 16MB, or a positive value for a custom threshold **Current status:** BiDir AG is functionally correct but does not yet improve performance on H100 IB (benchmark shows ~3-9% regression vs unidirectional at 256-512MB). It is disabled by default and available for future tuning on different hardware/topologies where the step-count reduction may outweigh the per-step overhead increase. **Tests:** - `Ring1Bidir` / `Ring2Bidir` suites: mirror existing Ring1/Ring2 configs with `enable_bidir_ag = true` - `MultiInvocationCorrectness` test: verifies step-state accumulation across 5 invocations - `TailCorrectness` test: verifies non-divisible element counts with remainder fallback - `ring_allreduce_2gpu_test` BUCK target at ppn:2: exercises W=2 edge case where launcher falls back to unidirectional - `make_bidir_rings()` unit tests: validates {forward, reverse} ring topology utility **Benchmarks:** - `_bd` configs: interleaved bidir (1 ring, `EnableBidirAg=true`) - `_rv` configs: reverse ring (2 rings `{fwd,rev}`, `EnableBidirAg=false`) - Existing `_2R` configs (coprime strides) serve as multi-ring baseline Differential Revision: D104758061

…orch#2689) Summary: Add a composed Ring AllReduce launcher to the Pipes collectives library. The launcher sequentially launches the existing `ring_reduce_scatter_kernel` and `ring_allgather_kernel` on a user-specified CUDA stream using the "Option B" buffer layout (no scratch buffer; RS writes directly to `recvbuf[my_rank * chunk_bytes]`, AG reads from the same position). Supports 1, 2, or 4 rings with configurable block count and timeout. Reviewed By: siyengar Differential Revision: D104701021

…pytorch#2700) Summary: Wire Pipes-based Ring AllReduce into the Ctran dispatch as a new `pipesring` algorithm. Add three new CVARs (`NCCL_CTRAN_PIPES_SENDRECV_ENABLE`, `_MAX_GROUPS`, `_PIPELINE_DEPTH`) to enable pipelined send/recv staging buffers on the IBGDA transport, as required by ring collectives' `send()`/`recv()`/`forward()` protocol. Add `AllReducePipesRing.cc` implementing the Ctran-side dispatch that builds ring topology, gets per-peer IBGDA transport handles, and launches the Pipes AllReduce. Uses the scratch-free "Option B" buffer layout. Currently supports float32 + Sum only, `nLocalRanks=1` (IB-only, matching PAFT inter-replica topology). Differential Revision: D104701020

Summary: Add auto-tuning support for the pipesring AllReduce algorithm with two new CVARs (`NCCL_CTRAN_ALLREDUCE_PIPES_NUM_RINGS`, `_NUM_BLOCKS`) that default to -1 (auto-select based on message size). Add a comprehensive benchmark comparing Pipes Ring AllReduce against NCCL AllReduce across message sizes from 256KB to 512MB with 1 and 2 rings. Also fix the ncclx `AlgoConfig.cc` switch exhaustiveness for the new `pipesring` enum value. Differential Revision: D104703792

…h#2702) Summary: Replace the two-kernel AllReduce composition (separate RS + AG launches) with a single fused `ring_allreduce_kernel` that executes both ReduceScatter and AllGather phases within one kernel invocation. This eliminates the ~5-15us inter-kernel gap and shares setup code (thread group, ring partition, `TiledBuffer`, `pipeline_window`, stride). Transport step-state counters persist naturally within a single kernel; no barrier needed between phases. Also add Ctran ctring AllReduce as a third comparison arm in the benchmark, showing Pipes Ring vs Ctran Ring (GPE) vs NCCL (auto-select) to measure the speedup over the current MCCL production AllReduce algorithm. Differential Revision: D104709661

Summary: Switch the fused Ring AllReduce kernel's ReduceScatter phase from `TileReduce` to `TileReduceStaged` (introduced in D104600612). `TileReduceStaged` decouples the two memory load streams (staging buffer from IB + local GPU memory) from the accumulation step, giving the compiler more freedom to schedule loads in parallel. Measured +2-3% bandwidth improvement at 32-128MB for standalone `ReduceScatter`. Differential Revision: D104758065

Summary: Adds bi-directional AllGather to the Pipes fused Ring AllReduce kernel. Instead of propagating reduced shards in one direction (W-1 steps), data is sent both clockwise and counter-clockwise simultaneously, reducing the AG phase from W-1 to ceil((W-1)/2) steps. For W=8, this cuts AG from 7 to 4 steps (1.75x). The implementation exploits a key property of Pipes' IBGDA transport: each `P2pIbgdaTransportDevice` has fully independent send and recv channels (separate stepState indices, staging buffers, signal buffers). After ReduceScatter, the reverse-direction channels (`prev`'s send, `next`'s recv) are dormant, so bi-directional AG simply activates them with zero additional buffers or transport changes. **Key design decisions:** - `EnableBidirAg` compile-time template parameter with `if constexpr` guard, so zero overhead when disabled - All blocks participate in both directions (no block splitting): RDMA puts to different peers overlap on full-duplex IB links - Launcher silently falls back to unidirectional when `num_ranks <= 2` (degenerate case where `prev == next`) - New CVAR `NCCL_CTRAN_ALLREDUCE_PIPES_BIDIR_AG_MIN_SIZE` defaults to `-1` (disabled in production). Set to `0` to enable for all sizes, `-2` for auto-tune at >= 16MB, or a positive value for a custom threshold **Current status:** BiDir AG is functionally correct but does not yet improve performance on H100 IB (benchmark shows ~3-9% regression vs unidirectional at 256-512MB). It is disabled by default and available for future tuning on different hardware/topologies where the step-count reduction may outweigh the per-step overhead increase. **Tests:** - `Ring1Bidir` / `Ring2Bidir` suites: mirror existing Ring1/Ring2 configs with `enable_bidir_ag = true` - `MultiInvocationCorrectness` test: verifies step-state accumulation across 5 invocations - `TailCorrectness` test: verifies non-divisible element counts with remainder fallback - `ring_allreduce_2gpu_test` BUCK target at ppn:2: exercises W=2 edge case where launcher falls back to unidirectional - `make_bidir_rings()` unit tests: validates {forward, reverse} ring topology utility **Benchmarks:** - `_bd` configs: interleaved bidir (1 ring, `EnableBidirAg=true`) - `_rv` configs: reverse ring (2 rings `{fwd,rev}`, `EnableBidirAg=false`) - Existing `_2R` configs (coprime strides) serve as multi-ring baseline Differential Revision: D104758061

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 27, 2026

meta-codesync Bot added fb-exported meta-exported labels May 27, 2026

meta-codesync Bot changed the title ~~Bi-directional AllGather in fused Ring AllReduce~~ Bi-directional AllGather in fused Ring AllReduce (#2705) May 27, 2026

Scusemua force-pushed the export-D104758061 branch from 7634652 to eeb6277 Compare May 27, 2026 18:02

Scusemua force-pushed the export-D104758061 branch from eeb6277 to 0975ba1 Compare May 27, 2026 18:25

Scusemua force-pushed the export-D104758061 branch from 0975ba1 to 333dde9 Compare May 27, 2026 18:44

Scusemua force-pushed the export-D104758061 branch from 333dde9 to 91182a4 Compare May 27, 2026 18:53

meta-codesync Bot changed the title ~~Bi-directional AllGather in fused Ring AllReduce (#2705)~~ Bi-directional AllGather in fused Ring AllReduce May 27, 2026

Scusemua force-pushed the export-D104758061 branch from 91182a4 to a16ffce Compare May 27, 2026 20:31

meta-codesync Bot changed the title ~~Bi-directional AllGather in fused Ring AllReduce~~ Bi-directional AllGather in fused Ring AllReduce (#2705) May 27, 2026

Scusemua force-pushed the export-D104758061 branch from a16ffce to d722cc6 Compare May 27, 2026 20:39

Scusemua added 6 commits May 28, 2026 08:43

Scusemua force-pushed the export-D104758061 branch from d722cc6 to 35577d5 Compare May 28, 2026 15:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bi-directional AllGather in fused Ring AllReduce (#2705)#2705

Bi-directional AllGather in fused Ring AllReduce (#2705)#2705
Scusemua wants to merge 6 commits into
meta-pytorch:mainfrom
Scusemua:export-D104758061

Scusemua commented May 27, 2026 •

edited by meta-codesync Bot

Loading

Uh oh!

meta-codesync Bot commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Scusemua commented May 27, 2026 • edited by meta-codesync Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meta-codesync Bot commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Scusemua commented May 27, 2026 •

edited by meta-codesync Bot

Loading