Skip to content

Bi-directional AllGather in fused Ring AllReduce (#2705)#2705

Open
Scusemua wants to merge 6 commits into
meta-pytorch:mainfrom
Scusemua:export-D104758061
Open

Bi-directional AllGather in fused Ring AllReduce (#2705)#2705
Scusemua wants to merge 6 commits into
meta-pytorch:mainfrom
Scusemua:export-D104758061

Conversation

@Scusemua
Copy link
Copy Markdown
Contributor

@Scusemua Scusemua commented May 27, 2026

Summary:

Adds bi-directional AllGather to the Pipes fused Ring AllReduce kernel. Instead of propagating reduced shards in one direction (W-1 steps), data is sent both clockwise and counter-clockwise simultaneously, reducing the AG phase from W-1 to ceil((W-1)/2) steps. For W=8, this cuts AG from 7 to 4 steps (1.75x).

The implementation exploits a key property of Pipes' IBGDA transport: each P2pIbgdaTransportDevice has fully independent send and recv channels (separate stepState indices, staging buffers, signal buffers). After ReduceScatter, the reverse-direction channels (prev's send, next's recv) are dormant, so bi-directional AG simply activates them with zero additional buffers or transport changes.

Key design decisions:

  • EnableBidirAg compile-time template parameter with if constexpr guard, so zero overhead when disabled
  • All blocks participate in both directions (no block splitting): RDMA puts to different peers overlap on full-duplex IB links
  • Launcher silently falls back to unidirectional when num_ranks <= 2 (degenerate case where prev == next)
  • New CVAR NCCL_CTRAN_ALLREDUCE_PIPES_BIDIR_AG_MIN_SIZE defaults to -1 (disabled in production). Set to 0 to enable for all sizes, -2 for auto-tune at >= 16MB, or a positive value for a custom threshold

Current status: BiDir AG is functionally correct but does not yet improve performance on H100 IB (benchmark shows ~3-9% regression vs unidirectional at 256-512MB). It is disabled by default and available for future tuning on different hardware/topologies where the step-count reduction may outweigh the per-step overhead increase.

Tests:

  • Ring1Bidir / Ring2Bidir suites: mirror existing Ring1/Ring2 configs with enable_bidir_ag = true
  • MultiInvocationCorrectness test: verifies step-state accumulation across 5 invocations
  • TailCorrectness test: verifies non-divisible element counts with remainder fallback
  • ring_allreduce_2gpu_test BUCK target at ppn:2: exercises W=2 edge case where launcher falls back to unidirectional
  • make_bidir_rings() unit tests: validates {forward, reverse} ring topology utility

Benchmarks:

  • _bd configs: interleaved bidir (1 ring, EnableBidirAg=true)
  • _rv configs: reverse ring (2 rings {fwd,rev}, EnableBidirAg=false)
  • Existing _2R configs (coprime strides) serve as multi-ring baseline

Differential Revision: D104758061

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 27, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented May 27, 2026

@Scusemua has exported this pull request. If you are a Meta employee, you can view the originating Diff in D104758061.

@meta-codesync meta-codesync Bot changed the title Bi-directional AllGather in fused Ring AllReduce Bi-directional AllGather in fused Ring AllReduce (#2705) May 27, 2026
@Scusemua Scusemua force-pushed the export-D104758061 branch from 7634652 to eeb6277 Compare May 27, 2026 18:02
Scusemua added a commit to Scusemua/torchcomms that referenced this pull request May 27, 2026
Summary:

Adds bi-directional AllGather to the Pipes fused Ring AllReduce kernel. Instead of propagating reduced shards in one direction (`W-1` steps), data is sent both clockwise and counter-clockwise simultaneously, reducing the AG phase from `W-1` to `ceil((W-1)/2)` steps. For `W=8`, this cuts AG from 7 to 4 steps (1.75x), giving ~1.27x overall AllReduce speedup at large message sizes.

The implementation exploits a key property of Pipes' IBGDA transport: each P2pIbgdaTransportDevice has fully independent send and recv channels (separate stepState indices, staging buffers, signal buffers). After ReduceScatter, the reverse-direction channels (prev's send, next's recv) are dormant, so bi-directional AG simply activates them with zero additional buffers or transport changes. This is ~30 lines of kernel code vs ~1000 lines in Ctran's equivalent implementation.

**Key design decisions:**
- `EnableBidirAg` compile-time template parameter with `if constexpr` guard, so zero overhead when disabled
- All blocks participate in both directions (no block splitting): RDMA puts to different peers overlap on full-duplex IB links
- `nRanks > 2` guard in production dispatch: `W=2` is degenerate (`prev==next`)
- New CVAR `NCCL_CTRAN_ALLREDUCE_PIPES_BIDIR_AG_MIN_SIZE` with min-size semantics (`>= threshold`), auto-tune default at 16MB

Differential Revision: D104758061
@Scusemua Scusemua force-pushed the export-D104758061 branch from eeb6277 to 0975ba1 Compare May 27, 2026 18:25
Scusemua added a commit to Scusemua/torchcomms that referenced this pull request May 27, 2026
Summary:

Adds bi-directional AllGather to the Pipes fused Ring AllReduce kernel. Instead of propagating reduced shards in one direction (`W-1` steps), data is sent both clockwise and counter-clockwise simultaneously, reducing the AG phase from `W-1` to `ceil((W-1)/2)` steps. For `W=8`, this cuts AG from 7 to 4 steps (1.75x), giving ~1.27x overall AllReduce speedup at large message sizes.

The implementation exploits a key property of Pipes' IBGDA transport: each P2pIbgdaTransportDevice has fully independent send and recv channels (separate stepState indices, staging buffers, signal buffers). After ReduceScatter, the reverse-direction channels (prev's send, next's recv) are dormant, so bi-directional AG simply activates them with zero additional buffers or transport changes. This is ~30 lines of kernel code vs ~1000 lines in Ctran's equivalent implementation.

**Key design decisions:**
- `EnableBidirAg` compile-time template parameter with `if constexpr` guard, so zero overhead when disabled
- All blocks participate in both directions (no block splitting): RDMA puts to different peers overlap on full-duplex IB links
- `nRanks > 2` guard in production dispatch: `W=2` is degenerate (`prev==next`)
- New CVAR `NCCL_CTRAN_ALLREDUCE_PIPES_BIDIR_AG_MIN_SIZE` with min-size semantics (`>= threshold`), auto-tune default at 16MB

Differential Revision: D104758061
@Scusemua Scusemua force-pushed the export-D104758061 branch from 0975ba1 to 333dde9 Compare May 27, 2026 18:44
Scusemua added a commit to Scusemua/torchcomms that referenced this pull request May 27, 2026
Summary:

Adds bi-directional AllGather to the Pipes fused Ring AllReduce kernel. Instead of propagating reduced shards in one direction (`W-1` steps), data is sent both clockwise and counter-clockwise simultaneously, reducing the AG phase from `W-1` to `ceil((W-1)/2)` steps. For `W=8`, this cuts AG from 7 to 4 steps (1.75x), giving ~1.27x overall AllReduce speedup at large message sizes.

The implementation exploits a key property of Pipes' IBGDA transport: each P2pIbgdaTransportDevice has fully independent send and recv channels (separate stepState indices, staging buffers, signal buffers). After ReduceScatter, the reverse-direction channels (prev's send, next's recv) are dormant, so bi-directional AG simply activates them with zero additional buffers or transport changes. This is ~30 lines of kernel code vs ~1000 lines in Ctran's equivalent implementation.

**Key design decisions:**
- `EnableBidirAg` compile-time template parameter with `if constexpr` guard, so zero overhead when disabled
- All blocks participate in both directions (no block splitting): RDMA puts to different peers overlap on full-duplex IB links
- `nRanks > 2` guard in production dispatch: `W=2` is degenerate (`prev==next`)
- New CVAR `NCCL_CTRAN_ALLREDUCE_PIPES_BIDIR_AG_MIN_SIZE` with min-size semantics (`>= threshold`), auto-tune default at 16MB

Differential Revision: D104758061
@Scusemua Scusemua force-pushed the export-D104758061 branch from 333dde9 to 91182a4 Compare May 27, 2026 18:53
Scusemua added a commit to Scusemua/torchcomms that referenced this pull request May 27, 2026
Summary:

Adds bi-directional AllGather to the Pipes fused Ring AllReduce kernel. Instead of propagating reduced shards in one direction (`W-1` steps), data is sent both clockwise and counter-clockwise simultaneously, reducing the AG phase from `W-1` to `ceil((W-1)/2)` steps. For `W=8`, this cuts AG from 7 to 4 steps (1.75x), giving ~1.27x overall AllReduce speedup at large message sizes.

The implementation exploits a key property of Pipes' IBGDA transport: each P2pIbgdaTransportDevice has fully independent send and recv channels (separate stepState indices, staging buffers, signal buffers). After ReduceScatter, the reverse-direction channels (prev's send, next's recv) are dormant, so bi-directional AG simply activates them with zero additional buffers or transport changes. This is ~30 lines of kernel code vs ~1000 lines in Ctran's equivalent implementation.

**Key design decisions:**
- `EnableBidirAg` compile-time template parameter with `if constexpr` guard, so zero overhead when disabled
- All blocks participate in both directions (no block splitting): RDMA puts to different peers overlap on full-duplex IB links
- `nRanks > 2` guard in production dispatch: `W=2` is degenerate (`prev==next`)
- New CVAR `NCCL_CTRAN_ALLREDUCE_PIPES_BIDIR_AG_MIN_SIZE` with min-size semantics (`>= threshold`), auto-tune default at 16MB

Differential Revision: D104758061
@meta-codesync meta-codesync Bot changed the title Bi-directional AllGather in fused Ring AllReduce (#2705) Bi-directional AllGather in fused Ring AllReduce May 27, 2026
@Scusemua Scusemua force-pushed the export-D104758061 branch from 91182a4 to a16ffce Compare May 27, 2026 20:31
@meta-codesync meta-codesync Bot changed the title Bi-directional AllGather in fused Ring AllReduce Bi-directional AllGather in fused Ring AllReduce (#2705) May 27, 2026
Scusemua added a commit to Scusemua/torchcomms that referenced this pull request May 27, 2026
Summary:

Adds bi-directional AllGather to the Pipes fused Ring AllReduce kernel. Instead of propagating reduced shards in one direction (W-1 steps), data is sent both clockwise and counter-clockwise simultaneously, reducing the AG phase from W-1 to ceil((W-1)/2) steps. For W=8, this cuts AG from 7 to 4 steps (1.75x).

The implementation exploits a key property of Pipes' IBGDA transport: each `P2pIbgdaTransportDevice` has fully independent send and recv channels (separate stepState indices, staging buffers, signal buffers). After ReduceScatter, the reverse-direction channels (`prev`'s send, `next`'s recv) are dormant, so bi-directional AG simply activates them with zero additional buffers or transport changes.

**Key design decisions:**
- `EnableBidirAg` compile-time template parameter with `if constexpr` guard, so zero overhead when disabled
- All blocks participate in both directions (no block splitting): RDMA puts to different peers overlap on full-duplex IB links
- Launcher silently falls back to unidirectional when `num_ranks <= 2` (degenerate case where `prev == next`)
- New CVAR `NCCL_CTRAN_ALLREDUCE_PIPES_BIDIR_AG_MIN_SIZE` defaults to `-1` (disabled in production). Set to `0` to enable for all sizes, `-2` for auto-tune at >= 16MB, or a positive value for a custom threshold

**Current status:** BiDir AG is functionally correct but does not yet improve performance on H100 IB (benchmark shows ~3-9% regression vs unidirectional at 256-512MB). It is disabled by default and available for future tuning on different hardware/topologies where the step-count reduction may outweigh the per-step overhead increase.

**Tests:**
- `Ring1Bidir` / `Ring2Bidir` suites: mirror existing Ring1/Ring2 configs with `enable_bidir_ag = true`
- `MultiInvocationCorrectness` test: verifies step-state accumulation across 5 invocations
- `TailCorrectness` test: verifies non-divisible element counts with remainder fallback
- `ring_allreduce_2gpu_test` BUCK target at ppn:2: exercises W=2 edge case where launcher falls back to unidirectional
- `make_bidir_rings()` unit tests: validates {forward, reverse} ring topology utility

**Benchmarks:**
- `_bd` configs: interleaved bidir (1 ring, `EnableBidirAg=true`)
- `_rv` configs: reverse ring (2 rings `{fwd,rev}`, `EnableBidirAg=false`)
- Existing `_2R` configs (coprime strides) serve as multi-ring baseline

Differential Revision: D104758061
@Scusemua Scusemua force-pushed the export-D104758061 branch from a16ffce to d722cc6 Compare May 27, 2026 20:39
Scusemua added 6 commits May 28, 2026 08:43
…orch#2689)

Summary:

Add a composed Ring AllReduce launcher to the Pipes collectives library. The launcher sequentially launches the existing `ring_reduce_scatter_kernel` and `ring_allgather_kernel` on a user-specified CUDA stream using the "Option B" buffer layout (no scratch buffer; RS writes directly to `recvbuf[my_rank * chunk_bytes]`, AG reads from the same position). Supports 1, 2, or 4 rings with configurable block count and timeout.

Reviewed By: siyengar

Differential Revision: D104701021
…pytorch#2700)

Summary:

Wire Pipes-based Ring AllReduce into the Ctran dispatch as a new `pipesring` algorithm.

Add three new CVARs (`NCCL_CTRAN_PIPES_SENDRECV_ENABLE`, `_MAX_GROUPS`, `_PIPELINE_DEPTH`) to enable pipelined send/recv staging buffers on the IBGDA transport, as required by ring collectives' `send()`/`recv()`/`forward()` protocol.

Add `AllReducePipesRing.cc` implementing the Ctran-side dispatch that builds ring topology, gets per-peer IBGDA transport handles, and launches the Pipes AllReduce. Uses the scratch-free "Option B" buffer layout. Currently supports float32 + Sum only, `nLocalRanks=1` (IB-only, matching PAFT inter-replica topology).

Differential Revision: D104701020
Summary:

Add auto-tuning support for the pipesring AllReduce algorithm with two new CVARs (`NCCL_CTRAN_ALLREDUCE_PIPES_NUM_RINGS`, `_NUM_BLOCKS`) that default to -1 (auto-select based on message size).

Add a comprehensive benchmark comparing Pipes Ring AllReduce against NCCL AllReduce across message sizes from 256KB to 512MB with 1 and 2 rings. Also fix the ncclx `AlgoConfig.cc` switch exhaustiveness for the new `pipesring` enum value.

Differential Revision: D104703792
…h#2702)

Summary:

Replace the two-kernel AllReduce composition (separate RS + AG launches) with a single fused `ring_allreduce_kernel` that executes both ReduceScatter and AllGather phases within one kernel invocation. This eliminates the ~5-15us inter-kernel gap and shares setup code (thread group, ring partition, `TiledBuffer`, `pipeline_window`, stride). Transport step-state counters persist naturally within a single kernel; no barrier needed between phases.

Also add Ctran ctring AllReduce as a third comparison arm in the benchmark, showing Pipes Ring vs Ctran Ring (GPE) vs NCCL (auto-select) to measure the speedup over the current MCCL production AllReduce algorithm.

Differential Revision: D104709661
Summary:

Switch the fused Ring AllReduce kernel's ReduceScatter phase from `TileReduce` to `TileReduceStaged` (introduced in D104600612). `TileReduceStaged` decouples the two memory load streams (staging buffer from IB + local GPU memory) from the accumulation step, giving the compiler more freedom to schedule loads in parallel. Measured +2-3% bandwidth improvement at 32-128MB for standalone `ReduceScatter`.

Differential Revision: D104758065
Summary:

Adds bi-directional AllGather to the Pipes fused Ring AllReduce kernel. Instead of propagating reduced shards in one direction (W-1 steps), data is sent both clockwise and counter-clockwise simultaneously, reducing the AG phase from W-1 to ceil((W-1)/2) steps. For W=8, this cuts AG from 7 to 4 steps (1.75x).

The implementation exploits a key property of Pipes' IBGDA transport: each `P2pIbgdaTransportDevice` has fully independent send and recv channels (separate stepState indices, staging buffers, signal buffers). After ReduceScatter, the reverse-direction channels (`prev`'s send, `next`'s recv) are dormant, so bi-directional AG simply activates them with zero additional buffers or transport changes.

**Key design decisions:**
- `EnableBidirAg` compile-time template parameter with `if constexpr` guard, so zero overhead when disabled
- All blocks participate in both directions (no block splitting): RDMA puts to different peers overlap on full-duplex IB links
- Launcher silently falls back to unidirectional when `num_ranks <= 2` (degenerate case where `prev == next`)
- New CVAR `NCCL_CTRAN_ALLREDUCE_PIPES_BIDIR_AG_MIN_SIZE` defaults to `-1` (disabled in production). Set to `0` to enable for all sizes, `-2` for auto-tune at >= 16MB, or a positive value for a custom threshold

**Current status:** BiDir AG is functionally correct but does not yet improve performance on H100 IB (benchmark shows ~3-9% regression vs unidirectional at 256-512MB). It is disabled by default and available for future tuning on different hardware/topologies where the step-count reduction may outweigh the per-step overhead increase.

**Tests:**
- `Ring1Bidir` / `Ring2Bidir` suites: mirror existing Ring1/Ring2 configs with `enable_bidir_ag = true`
- `MultiInvocationCorrectness` test: verifies step-state accumulation across 5 invocations
- `TailCorrectness` test: verifies non-divisible element counts with remainder fallback
- `ring_allreduce_2gpu_test` BUCK target at ppn:2: exercises W=2 edge case where launcher falls back to unidirectional
- `make_bidir_rings()` unit tests: validates {forward, reverse} ring topology utility

**Benchmarks:**
- `_bd` configs: interleaved bidir (1 ring, `EnableBidirAg=true`)
- `_rv` configs: reverse ring (2 rings `{fwd,rev}`, `EnableBidirAg=false`)
- Existing `_2R` configs (coprime strides) serve as multi-ring baseline

Differential Revision: D104758061
@Scusemua Scusemua force-pushed the export-D104758061 branch from d722cc6 to 35577d5 Compare May 28, 2026 15:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant