Skip to content

Add NCCL Ring baseline and medium-size block sweep to AllReduce benchmark (#2708)#2708

Open
Scusemua wants to merge 9 commits into
meta-pytorch:mainfrom
Scusemua:export-D105618089
Open

Add NCCL Ring baseline and medium-size block sweep to AllReduce benchmark (#2708)#2708
Scusemua wants to merge 9 commits into
meta-pytorch:mainfrom
Scusemua:export-D105618089

Conversation

@Scusemua
Copy link
Copy Markdown
Contributor

@Scusemua Scusemua commented May 27, 2026

Summary:

Add a forced NCCL Ring/Simple baseline column and medium-size block count sweep configs to the Ring AllReduce benchmark, enabling true ring-vs-ring performance comparison across all message sizes.

NCCL Ring baseline:
Creates a second NCCL communicator with NCCL_ALGO=Ring and NCCL_PROTO=Simple forced via env vars before ncclCommInitRank. This produces a 4-column comparison: NCCL (auto) vs NCCL Ring vs Ctran (ctring) vs Pipes (IBGDA). The NCCL Ring column confirms that NCCL auto uses tree (not ring) at small/medium sizes; the apparent "gap" to NCCL at 4-16MB was mostly ring-vs-tree, not a Pipes deficiency.

Medium-size block count sweep:
Tests block counts 2, 4, 8, 16 at 4MB, 16MB, and 32MB to find optimal configurations at medium sizes. At 16MB with 16 blocks, each block only has 128KB of data per ring step — too small to amortize the per-WQE overhead (QP slot reservation, mark_wqes_ready, NIC doorbell). Fewer blocks give more data per WQE.

GB200 NCCL Ring results (Run 11, 4n×2g=8 ranks, IB-only):

Size      NCCL(auto)  NCCLRing   Ctran        Pipes   Pipes vs Ring  Pipes vs Ctran
256KB           1.77      0.45    0.63         0.45           1.02x           0.73x
1MB             3.65      1.71    2.53         1.55           0.91x           0.61x
4MB            12.90      5.88    9.01     6.23(8B)           1.06x           0.69x
16MB           35.80     21.70   22.63   18.91(16B)           0.87x           0.83x
32MB           40.90     37.29   29.83   31.19(16B)           0.84x           1.05x
64MB           45.82     45.72   43.73   43.28(16B)           0.95x           0.99x
128MB          49.94     49.83   52.15   49.16(16B)           0.99x           0.94x
256MB          51.27     51.17   53.84   51.95(32B)           1.02x           0.96x
512MB          52.44     52.46   53.91   53.13(16B)           1.01x           0.99x
1GB            53.48     53.49   53.92   53.64(16B)           1.00x           0.99x

Key findings:

  • NCCL auto uses tree at <64MB (up to 2.2x faster than NCCL Ring at 4MB)
  • Pipes matches NCCL Ring at 256KB and beats it at 256MB+ (1.00-1.02x)
  • Pipes is 0.87-0.95x of NCCL Ring at 4-16MB — block count optimization may close this
  • The "0.53x NCCL" gap at 16MB was mostly ring-vs-tree, not a Pipes problem (vs Ring it's 0.87x)

Differential Revision: D105618089

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 27, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented May 27, 2026

@Scusemua has exported this pull request. If you are a Meta employee, you can view the originating Diff in D105618089.

Scusemua added a commit to Scusemua/torchcomms that referenced this pull request May 27, 2026
…mark (meta-pytorch#2708)

Summary:
Pull Request resolved: meta-pytorch#2708

Add a forced NCCL Ring/Simple baseline column and medium-size block count sweep configs to the Ring AllReduce benchmark, enabling true ring-vs-ring performance comparison across all message sizes.

**NCCL Ring baseline:**
Creates a second NCCL communicator with `NCCL_ALGO=Ring` and `NCCL_PROTO=Simple` forced via env vars before `ncclCommInitRank`. This produces a 4-column comparison: NCCL (auto) vs NCCL Ring vs Ctran (ctring) vs Pipes (IBGDA). The NCCL Ring column confirms that NCCL auto uses tree (not ring) at small/medium sizes; the apparent "gap" to NCCL at 4-16MB was mostly ring-vs-tree, not a Pipes deficiency.

**Medium-size block count sweep:**
Tests block counts 2, 4, 8, 16 at 4MB, 16MB, and 32MB to find optimal configurations at medium sizes. At 16MB with 16 blocks, each block only has 128KB of data per ring step — too small to amortize the per-WQE overhead (QP slot reservation, mark_wqes_ready, NIC doorbell). Fewer blocks give more data per WQE.

**GB200 NCCL Ring results (Run 11, 4n×2g=8 ranks, IB-only):**

```
Size      NCCL(auto)  NCCLRing   Ctran        Pipes   Pipes vs Ring  Pipes vs Ctran
256KB           1.77      0.45    0.63         0.45           1.02x           0.73x
1MB             3.65      1.71    2.53         1.55           0.91x           0.61x
4MB            12.90      5.88    9.01     6.23(8B)           1.06x           0.69x
16MB           35.80     21.70   22.63   18.91(16B)           0.87x           0.83x
32MB           40.90     37.29   29.83   31.19(16B)           0.84x           1.05x
64MB           45.82     45.72   43.73   43.28(16B)           0.95x           0.99x
128MB          49.94     49.83   52.15   49.16(16B)           0.99x           0.94x
256MB          51.27     51.17   53.84   51.95(32B)           1.02x           0.96x
512MB          52.44     52.46   53.91   53.13(16B)           1.01x           0.99x
1GB            53.48     53.49   53.92   53.64(16B)           1.00x           0.99x
```

Key findings:
- NCCL auto uses tree at <64MB (up to 2.2x faster than NCCL Ring at 4MB)
- Pipes matches NCCL Ring at 256KB and beats it at 256MB+ (1.00-1.02x)
- Pipes is 0.87-0.95x of NCCL Ring at 4-16MB — block count optimization may close this
- The "0.53x NCCL" gap at 16MB was mostly ring-vs-tree, not a Pipes problem (vs Ring it's 0.87x)

Differential Revision: D105618089
@meta-codesync meta-codesync Bot changed the title Add NCCL Ring baseline and medium-size block sweep to AllReduce benchmark Add NCCL Ring baseline and medium-size block sweep to AllReduce benchmark (#2708) May 27, 2026
@Scusemua Scusemua force-pushed the export-D105618089 branch from 57a6042 to c72432e Compare May 27, 2026 18:03
Scusemua added a commit to Scusemua/torchcomms that referenced this pull request May 27, 2026
…mark (meta-pytorch#2708)

Summary:
Pull Request resolved: meta-pytorch#2708

Add a forced NCCL Ring/Simple baseline column and medium-size block count sweep configs to the Ring AllReduce benchmark, enabling true ring-vs-ring performance comparison across all message sizes.

**NCCL Ring baseline:**
Creates a second NCCL communicator with `NCCL_ALGO=Ring` and `NCCL_PROTO=Simple` forced via env vars before `ncclCommInitRank`. This produces a 4-column comparison: NCCL (auto) vs NCCL Ring vs Ctran (ctring) vs Pipes (IBGDA). The NCCL Ring column confirms that NCCL auto uses tree (not ring) at small/medium sizes; the apparent "gap" to NCCL at 4-16MB was mostly ring-vs-tree, not a Pipes deficiency.

**Medium-size block count sweep:**
Tests block counts 2, 4, 8, 16 at 4MB, 16MB, and 32MB to find optimal configurations at medium sizes. At 16MB with 16 blocks, each block only has 128KB of data per ring step — too small to amortize the per-WQE overhead (QP slot reservation, mark_wqes_ready, NIC doorbell). Fewer blocks give more data per WQE.

**GB200 NCCL Ring results (Run 11, 4n×2g=8 ranks, IB-only):**

```
Size      NCCL(auto)  NCCLRing   Ctran        Pipes   Pipes vs Ring  Pipes vs Ctran
256KB           1.77      0.45    0.63         0.45           1.02x           0.73x
1MB             3.65      1.71    2.53         1.55           0.91x           0.61x
4MB            12.90      5.88    9.01     6.23(8B)           1.06x           0.69x
16MB           35.80     21.70   22.63   18.91(16B)           0.87x           0.83x
32MB           40.90     37.29   29.83   31.19(16B)           0.84x           1.05x
64MB           45.82     45.72   43.73   43.28(16B)           0.95x           0.99x
128MB          49.94     49.83   52.15   49.16(16B)           0.99x           0.94x
256MB          51.27     51.17   53.84   51.95(32B)           1.02x           0.96x
512MB          52.44     52.46   53.91   53.13(16B)           1.01x           0.99x
1GB            53.48     53.49   53.92   53.64(16B)           1.00x           0.99x
```

Key findings:
- NCCL auto uses tree at <64MB (up to 2.2x faster than NCCL Ring at 4MB)
- Pipes matches NCCL Ring at 256KB and beats it at 256MB+ (1.00-1.02x)
- Pipes is 0.87-0.95x of NCCL Ring at 4-16MB — block count optimization may close this
- The "0.53x NCCL" gap at 16MB was mostly ring-vs-tree, not a Pipes problem (vs Ring it's 0.87x)

Differential Revision: D105618089
@Scusemua Scusemua force-pushed the export-D105618089 branch 2 times, most recently from 2ae4eae to 81c9363 Compare May 27, 2026 20:24
Scusemua added a commit to Scusemua/torchcomms that referenced this pull request May 27, 2026
…mark (meta-pytorch#2708)

Summary:
Pull Request resolved: meta-pytorch#2708

Add a forced NCCL Ring/Simple baseline column and medium-size block count sweep configs to the Ring AllReduce benchmark, enabling true ring-vs-ring performance comparison across all message sizes.

**NCCL Ring baseline:**
Creates a second NCCL communicator with `NCCL_ALGO=Ring` and `NCCL_PROTO=Simple` forced via env vars before `ncclCommInitRank`. This produces a 4-column comparison: NCCL (auto) vs NCCL Ring vs Ctran (ctring) vs Pipes (IBGDA). The NCCL Ring column confirms that NCCL auto uses tree (not ring) at small/medium sizes; the apparent "gap" to NCCL at 4-16MB was mostly ring-vs-tree, not a Pipes deficiency.

**Medium-size block count sweep:**
Tests block counts 2, 4, 8, 16 at 4MB, 16MB, and 32MB to find optimal configurations at medium sizes. At 16MB with 16 blocks, each block only has 128KB of data per ring step — too small to amortize the per-WQE overhead (QP slot reservation, mark_wqes_ready, NIC doorbell). Fewer blocks give more data per WQE.

**GB200 NCCL Ring results (Run 11, 4n×2g=8 ranks, IB-only):**

```
Size      NCCL(auto)  NCCLRing   Ctran        Pipes   Pipes vs Ring  Pipes vs Ctran
256KB           1.77      0.45    0.63         0.45           1.02x           0.73x
1MB             3.65      1.71    2.53         1.55           0.91x           0.61x
4MB            12.90      5.88    9.01     6.23(8B)           1.06x           0.69x
16MB           35.80     21.70   22.63   18.91(16B)           0.87x           0.83x
32MB           40.90     37.29   29.83   31.19(16B)           0.84x           1.05x
64MB           45.82     45.72   43.73   43.28(16B)           0.95x           0.99x
128MB          49.94     49.83   52.15   49.16(16B)           0.99x           0.94x
256MB          51.27     51.17   53.84   51.95(32B)           1.02x           0.96x
512MB          52.44     52.46   53.91   53.13(16B)           1.01x           0.99x
1GB            53.48     53.49   53.92   53.64(16B)           1.00x           0.99x
```

Key findings:
- NCCL auto uses tree at <64MB (up to 2.2x faster than NCCL Ring at 4MB)
- Pipes matches NCCL Ring at 256KB and beats it at 256MB+ (1.00-1.02x)
- Pipes is 0.87-0.95x of NCCL Ring at 4-16MB — block count optimization may close this
- The "0.53x NCCL" gap at 16MB was mostly ring-vs-tree, not a Pipes problem (vs Ring it's 0.87x)

Differential Revision: D105618089
Scusemua added a commit to Scusemua/torchcomms that referenced this pull request May 27, 2026
…mark (meta-pytorch#2708)

Summary:
Pull Request resolved: meta-pytorch#2708

Add a forced NCCL Ring/Simple baseline column and medium-size block count sweep configs to the Ring AllReduce benchmark, enabling true ring-vs-ring performance comparison across all message sizes.

**NCCL Ring baseline:**
Creates a second NCCL communicator with `NCCL_ALGO=Ring` and `NCCL_PROTO=Simple` forced via env vars before `ncclCommInitRank`. This produces a 4-column comparison: NCCL (auto) vs NCCL Ring vs Ctran (ctring) vs Pipes (IBGDA). The NCCL Ring column confirms that NCCL auto uses tree (not ring) at small/medium sizes; the apparent "gap" to NCCL at 4-16MB was mostly ring-vs-tree, not a Pipes deficiency.

**Medium-size block count sweep:**
Tests block counts 2, 4, 8, 16 at 4MB, 16MB, and 32MB to find optimal configurations at medium sizes. At 16MB with 16 blocks, each block only has 128KB of data per ring step — too small to amortize the per-WQE overhead (QP slot reservation, mark_wqes_ready, NIC doorbell). Fewer blocks give more data per WQE.

**GB200 NCCL Ring results (Run 11, 4n×2g=8 ranks, IB-only):**

```
Size      NCCL(auto)  NCCLRing   Ctran        Pipes   Pipes vs Ring  Pipes vs Ctran
256KB           1.77      0.45    0.63         0.45           1.02x           0.73x
1MB             3.65      1.71    2.53         1.55           0.91x           0.61x
4MB            12.90      5.88    9.01     6.23(8B)           1.06x           0.69x
16MB           35.80     21.70   22.63   18.91(16B)           0.87x           0.83x
32MB           40.90     37.29   29.83   31.19(16B)           0.84x           1.05x
64MB           45.82     45.72   43.73   43.28(16B)           0.95x           0.99x
128MB          49.94     49.83   52.15   49.16(16B)           0.99x           0.94x
256MB          51.27     51.17   53.84   51.95(32B)           1.02x           0.96x
512MB          52.44     52.46   53.91   53.13(16B)           1.01x           0.99x
1GB            53.48     53.49   53.92   53.64(16B)           1.00x           0.99x
```

Key findings:
- NCCL auto uses tree at <64MB (up to 2.2x faster than NCCL Ring at 4MB)
- Pipes matches NCCL Ring at 256KB and beats it at 256MB+ (1.00-1.02x)
- Pipes is 0.87-0.95x of NCCL Ring at 4-16MB — block count optimization may close this
- The "0.53x NCCL" gap at 16MB was mostly ring-vs-tree, not a Pipes problem (vs Ring it's 0.87x)

Differential Revision: D105618089
@Scusemua Scusemua force-pushed the export-D105618089 branch 2 times, most recently from b727fc0 to 3fe738a Compare May 27, 2026 20:42
Scusemua added a commit to Scusemua/torchcomms that referenced this pull request May 27, 2026
…mark (meta-pytorch#2708)

Summary:
Pull Request resolved: meta-pytorch#2708

Add a forced NCCL Ring/Simple baseline column and medium-size block count sweep configs to the Ring AllReduce benchmark, enabling true ring-vs-ring performance comparison across all message sizes.

**NCCL Ring baseline:**
Creates a second NCCL communicator with `NCCL_ALGO=Ring` and `NCCL_PROTO=Simple` forced via env vars before `ncclCommInitRank`. This produces a 4-column comparison: NCCL (auto) vs NCCL Ring vs Ctran (ctring) vs Pipes (IBGDA). The NCCL Ring column confirms that NCCL auto uses tree (not ring) at small/medium sizes; the apparent "gap" to NCCL at 4-16MB was mostly ring-vs-tree, not a Pipes deficiency.

**Medium-size block count sweep:**
Tests block counts 2, 4, 8, 16 at 4MB, 16MB, and 32MB to find optimal configurations at medium sizes. At 16MB with 16 blocks, each block only has 128KB of data per ring step — too small to amortize the per-WQE overhead (QP slot reservation, mark_wqes_ready, NIC doorbell). Fewer blocks give more data per WQE.

**GB200 NCCL Ring results (Run 11, 4n×2g=8 ranks, IB-only):**

```
Size      NCCL(auto)  NCCLRing   Ctran        Pipes   Pipes vs Ring  Pipes vs Ctran
256KB           1.77      0.45    0.63         0.45           1.02x           0.73x
1MB             3.65      1.71    2.53         1.55           0.91x           0.61x
4MB            12.90      5.88    9.01     6.23(8B)           1.06x           0.69x
16MB           35.80     21.70   22.63   18.91(16B)           0.87x           0.83x
32MB           40.90     37.29   29.83   31.19(16B)           0.84x           1.05x
64MB           45.82     45.72   43.73   43.28(16B)           0.95x           0.99x
128MB          49.94     49.83   52.15   49.16(16B)           0.99x           0.94x
256MB          51.27     51.17   53.84   51.95(32B)           1.02x           0.96x
512MB          52.44     52.46   53.91   53.13(16B)           1.01x           0.99x
1GB            53.48     53.49   53.92   53.64(16B)           1.00x           0.99x
```

Key findings:
- NCCL auto uses tree at <64MB (up to 2.2x faster than NCCL Ring at 4MB)
- Pipes matches NCCL Ring at 256KB and beats it at 256MB+ (1.00-1.02x)
- Pipes is 0.87-0.95x of NCCL Ring at 4-16MB — block count optimization may close this
- The "0.53x NCCL" gap at 16MB was mostly ring-vs-tree, not a Pipes problem (vs Ring it's 0.87x)

Differential Revision: D105618089
Scusemua added a commit to Scusemua/torchcomms that referenced this pull request May 27, 2026
…mark (meta-pytorch#2708)

Summary:
Pull Request resolved: meta-pytorch#2708

Add a forced NCCL Ring/Simple baseline column and medium-size block count sweep configs to the Ring AllReduce benchmark, enabling true ring-vs-ring performance comparison across all message sizes.

**NCCL Ring baseline:**
Creates a second NCCL communicator with `NCCL_ALGO=Ring` and `NCCL_PROTO=Simple` forced via env vars before `ncclCommInitRank`. This produces a 4-column comparison: NCCL (auto) vs NCCL Ring vs Ctran (ctring) vs Pipes (IBGDA). The NCCL Ring column confirms that NCCL auto uses tree (not ring) at small/medium sizes; the apparent "gap" to NCCL at 4-16MB was mostly ring-vs-tree, not a Pipes deficiency.

**Medium-size block count sweep:**
Tests block counts 2, 4, 8, 16 at 4MB, 16MB, and 32MB to find optimal configurations at medium sizes. At 16MB with 16 blocks, each block only has 128KB of data per ring step — too small to amortize the per-WQE overhead (QP slot reservation, mark_wqes_ready, NIC doorbell). Fewer blocks give more data per WQE.

**GB200 NCCL Ring results (Run 11, 4n×2g=8 ranks, IB-only):**

```
Size      NCCL(auto)  NCCLRing   Ctran        Pipes   Pipes vs Ring  Pipes vs Ctran
256KB           1.77      0.45    0.63         0.45           1.02x           0.73x
1MB             3.65      1.71    2.53         1.55           0.91x           0.61x
4MB            12.90      5.88    9.01     6.23(8B)           1.06x           0.69x
16MB           35.80     21.70   22.63   18.91(16B)           0.87x           0.83x
32MB           40.90     37.29   29.83   31.19(16B)           0.84x           1.05x
64MB           45.82     45.72   43.73   43.28(16B)           0.95x           0.99x
128MB          49.94     49.83   52.15   49.16(16B)           0.99x           0.94x
256MB          51.27     51.17   53.84   51.95(32B)           1.02x           0.96x
512MB          52.44     52.46   53.91   53.13(16B)           1.01x           0.99x
1GB            53.48     53.49   53.92   53.64(16B)           1.00x           0.99x
```

Key findings:
- NCCL auto uses tree at <64MB (up to 2.2x faster than NCCL Ring at 4MB)
- Pipes matches NCCL Ring at 256KB and beats it at 256MB+ (1.00-1.02x)
- Pipes is 0.87-0.95x of NCCL Ring at 4-16MB — block count optimization may close this
- The "0.53x NCCL" gap at 16MB was mostly ring-vs-tree, not a Pipes problem (vs Ring it's 0.87x)

Differential Revision: D105618089
@Scusemua Scusemua force-pushed the export-D105618089 branch 2 times, most recently from 86742e7 to 60c5311 Compare May 27, 2026 21:01
Scusemua added a commit to Scusemua/torchcomms that referenced this pull request May 27, 2026
…mark (meta-pytorch#2708)

Summary:
Pull Request resolved: meta-pytorch#2708

Add a forced NCCL Ring/Simple baseline column and medium-size block count sweep configs to the Ring AllReduce benchmark, enabling true ring-vs-ring performance comparison across all message sizes.

**NCCL Ring baseline:**
Creates a second NCCL communicator with `NCCL_ALGO=Ring` and `NCCL_PROTO=Simple` forced via env vars before `ncclCommInitRank`. This produces a 4-column comparison: NCCL (auto) vs NCCL Ring vs Ctran (ctring) vs Pipes (IBGDA). The NCCL Ring column confirms that NCCL auto uses tree (not ring) at small/medium sizes; the apparent "gap" to NCCL at 4-16MB was mostly ring-vs-tree, not a Pipes deficiency.

**Medium-size block count sweep:**
Tests block counts 2, 4, 8, 16 at 4MB, 16MB, and 32MB to find optimal configurations at medium sizes. At 16MB with 16 blocks, each block only has 128KB of data per ring step — too small to amortize the per-WQE overhead (QP slot reservation, mark_wqes_ready, NIC doorbell). Fewer blocks give more data per WQE.

**GB200 NCCL Ring results (Run 11, 4n×2g=8 ranks, IB-only):**

```
Size      NCCL(auto)  NCCLRing   Ctran        Pipes   Pipes vs Ring  Pipes vs Ctran
256KB           1.77      0.45    0.63         0.45           1.02x           0.73x
1MB             3.65      1.71    2.53         1.55           0.91x           0.61x
4MB            12.90      5.88    9.01     6.23(8B)           1.06x           0.69x
16MB           35.80     21.70   22.63   18.91(16B)           0.87x           0.83x
32MB           40.90     37.29   29.83   31.19(16B)           0.84x           1.05x
64MB           45.82     45.72   43.73   43.28(16B)           0.95x           0.99x
128MB          49.94     49.83   52.15   49.16(16B)           0.99x           0.94x
256MB          51.27     51.17   53.84   51.95(32B)           1.02x           0.96x
512MB          52.44     52.46   53.91   53.13(16B)           1.01x           0.99x
1GB            53.48     53.49   53.92   53.64(16B)           1.00x           0.99x
```

Key findings:
- NCCL auto uses tree at <64MB (up to 2.2x faster than NCCL Ring at 4MB)
- Pipes matches NCCL Ring at 256KB and beats it at 256MB+ (1.00-1.02x)
- Pipes is 0.87-0.95x of NCCL Ring at 4-16MB — block count optimization may close this
- The "0.53x NCCL" gap at 16MB was mostly ring-vs-tree, not a Pipes problem (vs Ring it's 0.87x)

Differential Revision: D105618089
Ben Carver and others added 9 commits May 28, 2026 08:40
Summary: Add a composed Ring AllReduce launcher to the Pipes collectives library. The launcher sequentially launches the existing `ring_reduce_scatter_kernel` and `ring_allgather_kernel` on a user-specified CUDA stream using the "Option B" buffer layout (no scratch buffer; RS writes directly to `recvbuf[my_rank * chunk_bytes]`, AG reads from the same position). Supports 1, 2, or 4 rings with configurable block count and timeout.

Differential Revision: D104701021
Summary:
Wire Pipes-based Ring AllReduce into the MCCL/Ctran dispatch as a new `pipesring` algorithm.

Add three new CVARs (`NCCL_CTRAN_PIPES_SENDRECV_ENABLE`, `_MAX_GROUPS`, `_PIPELINE_DEPTH`) to enable pipelined send/recv staging buffers on the IBGDA transport, as required by ring collectives' `send()`/`recv()`/`forward()` protocol.

Add `AllReducePipesRing.cc` implementing the Ctran-side dispatch that builds ring topology, gets per-peer IBGDA transport handles, and launches the Pipes AllReduce. Uses the scratch-free "Option B" buffer layout. Currently supports float32 + Sum only, `nLocalRanks=1` (IB-only, matching PAFT inter-replica topology).

Differential Revision: D104701020
Summary:
Pull Request resolved: meta-pytorch#2701

Add auto-tuning support for the pipesring AllReduce algorithm with two new CVARs (`NCCL_CTRAN_ALLREDUCE_PIPES_NUM_RINGS`, `_NUM_BLOCKS`) that default to -1 (auto-select based on message size).

Add a comprehensive benchmark comparing Pipes Ring AllReduce against NCCL AllReduce across message sizes from 256KB to 512MB with 1 and 2 rings. Also fix the ncclx `AlgoConfig.cc` switch exhaustiveness for the new `pipesring` enum value.

Differential Revision: D104703792
…h#2702)

Summary:
Pull Request resolved: meta-pytorch#2702

Replace the two-kernel AllReduce composition (separate RS + AG launches) with a single fused `ring_allreduce_kernel` that executes both ReduceScatter and AllGather phases within one kernel invocation. This eliminates the ~5-15us inter-kernel gap and shares setup code (thread group, ring partition, `TiledBuffer`, `pipeline_window`, stride). Transport step-state counters persist naturally within a single kernel; no barrier needed between phases.

Also add Ctran ctring AllReduce as a third comparison arm in the benchmark, showing Pipes Ring vs Ctran Ring (GPE) vs NCCL (auto-select) to measure the speedup over the current MCCL production AllReduce algorithm.

Differential Revision: D104709661
Summary: Switch the fused Ring AllReduce kernel's ReduceScatter phase from `TileReduce` to `TileReduceStaged` (introduced in D104600612). `TileReduceStaged` decouples the two memory load streams (staging buffer from IB + local GPU memory) from the accumulation step, giving the compiler more freedom to schedule loads in parallel. Measured +2-3% bandwidth improvement at 32-128MB for standalone `ReduceScatter`.

Differential Revision: D104758065
Summary:
Adds bi-directional AllGather to the Pipes fused Ring AllReduce kernel. Instead of propagating reduced shards in one direction (W-1 steps), data is sent both clockwise and counter-clockwise simultaneously, reducing the AG phase from W-1 to ceil((W-1)/2) steps. For W=8, this cuts AG from 7 to 4 steps (1.75x).

The implementation exploits a key property of Pipes' IBGDA transport: each `P2pIbgdaTransportDevice` has fully independent send and recv channels (separate stepState indices, staging buffers, signal buffers). After ReduceScatter, the reverse-direction channels (`prev`'s send, `next`'s recv) are dormant, so bi-directional AG simply activates them with zero additional buffers or transport changes.

**Key design decisions:**
- `EnableBidirAg` compile-time template parameter with `if constexpr` guard, so zero overhead when disabled
- All blocks participate in both directions (no block splitting): RDMA puts to different peers overlap on full-duplex IB links
- Launcher silently falls back to unidirectional when `num_ranks <= 2` (degenerate case where `prev == next`)
- New CVAR `NCCL_CTRAN_ALLREDUCE_PIPES_BIDIR_AG_MIN_SIZE` defaults to `-1` (disabled in production). Set to `0` to enable for all sizes, `-2` for auto-tune at >= 16MB, or a positive value for a custom threshold

**Current status:** BiDir AG is functionally correct but does not yet improve performance on H100 IB (benchmark shows ~3-9% regression vs unidirectional at 256-512MB). It is disabled by default and available for future tuning on different hardware/topologies where the step-count reduction may outweigh the per-step overhead increase.

**Tests:**
- `Ring1Bidir` / `Ring2Bidir` suites: mirror existing Ring1/Ring2 configs with `enable_bidir_ag = true`
- `MultiInvocationCorrectness` test: verifies step-state accumulation across 5 invocations
- `TailCorrectness` test: verifies non-divisible element counts with remainder fallback
- `ring_allreduce_2gpu_test` BUCK target at ppn:2: exercises W=2 edge case where launcher falls back to unidirectional
- `make_bidir_rings()` unit tests: validates {forward, reverse} ring topology utility

**Benchmarks:**
- `_bd` configs: interleaved bidir (1 ring, `EnableBidirAg=true`)
- `_rv` configs: reverse ring (2 rings `{fwd,rev}`, `EnableBidirAg=false`)
- Existing `_2R` configs (coprime strides) serve as multi-ring baseline

Differential Revision: D104758061
Summary:
Fuses the AllGather self-copy into the first RDMA send using a dual-destination `MemcpyAndSelfCopy` CopyOp. Instead of copying own shard to `recvbuf[self]` and then re-reading it for RDMA staging (2 reads + 2 writes per byte), the fused path reads the source once and writes to both staging and `recvbuf[self]` simultaneously (1 read + 2 writes). This optimization was discovered by Subodh and Santosh on GB200, where the extra read pass showed up as a major bottleneck with dual-NIC configurations.

**Changes:**
- Promotes `MemcpyAndSelfCopy` from `DirectNvl.cu`'s anonymous namespace to `CopyOp.cuh` as a shared, reusable CopyOp
- Applies the fused self-copy to the AG phase of `RingAllReduce.cu` (both unidirectional and bidir paths)
- Applies the same optimization to standalone `RingAllgather.cu`
- Removes the local duplicate from `DirectNvl.cu`

For bidir AG, both forward and reverse sends read from the original source (`own_src + off`), avoiding a data dependency on the fused write. The `ring_group.sync()` barrier between RS and AG phases is preserved.

The optimization is invisible on H100 (single NIC) but provides ~7.5% improvement on GB200 at 1GB (D104725448 benchmarks: 115.73 → 124.36 GB/s).

Based on D104725448 (Subodh).

Differential Revision: D104868628
…rmance analysis

Summary:
Add a compile-time toggle (`PIPES_USE_DEVICE_SCOPE_RELEASE_FENCE`) for the two hot-path release fences in the IBGDA send/forward path, plus infrastructure for differential performance debugging of Ring AllReduce on GB200.

**Benchmark infrastructure additions:**
- `skip_reduction` flag: bypasses `TileReduceStaged` in RS phase, using plain `Memcpy` forward/recv instead. Isolates reduction compute cost. Benchmark-only (produces incorrect output).
- `ib_window_bytes` override: controls the kernel's pipeline window size, clamped to `min(override, transport_window)` to prevent staging ring deadlocks.
- Registered `ring_allgather` and `ring_reduce_scatter` in MAST launcher for standalone phase benchmarking.

**Fence toggle details:**
Two `__threadfence_system()` sites replaced with `PIPES_RELEASE_FENCE()`:
- `send()` at P2pIbgdaTransportDevice.cuh:1241
- `forward()` at P2pIbgdaTransportDevice.cuh:1668
Acquire fences remain `__threadfence_system()` unconditionally.

 ---

**Performance analysis from nine GB200 MAST benchmark runs** (4 nodes x 2 GPUs = 8 ranks, IB-only):

Best AllReduce results (GB200, 1-ring):

| Size | Blocks | NCCL (GB/s) | Ctran (GB/s) | Pipes (GB/s) | vs NCCL | vs Ctran |
|------|--------|------------|-------------|-------------|---------|----------|
| 256MB | 32 | 51.13 | 53.83 | 51.84 | 1.01x | 0.96x |
| 512MB | 16 | 52.48 | 53.88 | **53.13** | **1.01x** | **0.99x** |
| 1GB | 16 | 53.41 | 53.92 | **53.64** | **1.00x** | **0.99x** |

At 1GB with 16 blocks, Pipes reaches 53.64 GB/s — only **0.5% behind Ctran**.

**Standalone phase benchmarks (Run 9, first-ever GB200 isolated AG and RS):**

| Phase | Size (total) | NCCL (GB/s) | Pipes (GB/s) | vs NCCL |
|-------|-------------|------------|-------------|---------|
| AG | 512MB | 98.82 | 103.51 | **1.05x** |
| AG | 1GB | 101.66 | 106.00 | **1.04x** |
| AG | 2GB | 104.03 | 106.78 | **1.03x** |
| RS | 512MB | 100.80 | 103.39 | **1.03x** |
| RS | 1GB | 103.25 | 105.78 | **1.02x** |
| RS | 2GB | 105.65 | 106.94 | **1.01x** |

Pipes beats NCCL in both standalone phases (AG by 2-5%, RS by 1-3%).

**Phase decomposition — fused kernel has NO overhead:**

| Total data | AG latency | RS latency | AG+RS sum | Fused AR latency | Fusion saves |
|-----------|-----------|-----------|----------|-----------------|-------------|
| 512MB | 5186.8 us | 5192.8 us | 10379.6 us | 10164.9 us | **214.7 us (2.1%)** |
| 1GB | 10130.0 us | 10150.3 us | 20280.3 us | 20048.1 us | **232.2 us (1.1%)** |

The fused AllReduce is faster than separate AG+RS — the kernel fusion provides net benefit by avoiding inter-kernel launch overhead and leveraging step-state continuity.

**Differential debugging findings:**

1. **Reduction compute is NOT the bottleneck.** `skip_reduction` = baseline (+0.06% at 512MB).
2. **16 blocks is optimal on GB200 at 512MB+.** 16B=53.13 (+0.7%), 32B=52.76 (default), 64B=51.14 (-3.2%).
3. **Window size doesn't matter.** w512k and w1m match baseline.
4. **QP count: 4 is optimal.** 2q=-18% (catastrophic), 4q=baseline, 8q=neutral.
5. **Phase transition is a net positive.** Fused kernel saves 1-2% vs separate AG+RS launches.
6. **Remaining ~0.5% gap to Ctran** is in DOCA GPUNetIO vs Ctran's CPU-side `ibv_post_send` WQE posting efficiency.

**Previously established findings:**
- Pipeline depth 4: no improvement
- Staging buffer 64/128MB: no impact
- IB signaling granularity: finer = worse
- Device-scope fence: +0.3% at large sizes, +2.5% at 32MB
- Bidir AG: -8-10% at large sizes, +6% at 32MB
- GB200 bandwidth ~2x H100 (53 vs 27 GB/s at 512MB)

Differential Revision: D105036747
…mark (meta-pytorch#2708)

Summary:
Pull Request resolved: meta-pytorch#2708

Add a forced NCCL Ring/Simple baseline column and medium-size block count sweep configs to the Ring AllReduce benchmark, enabling true ring-vs-ring performance comparison across all message sizes.

**NCCL Ring baseline:**
Creates a second NCCL communicator with `NCCL_ALGO=Ring` and `NCCL_PROTO=Simple` forced via env vars before `ncclCommInitRank`. This produces a 4-column comparison: NCCL (auto) vs NCCL Ring vs Ctran (ctring) vs Pipes (IBGDA). The NCCL Ring column confirms that NCCL auto uses tree (not ring) at small/medium sizes; the apparent "gap" to NCCL at 4-16MB was mostly ring-vs-tree, not a Pipes deficiency.

**Medium-size block count sweep:**
Tests block counts 2, 4, 8, 16 at 4MB, 16MB, and 32MB to find optimal configurations at medium sizes. At 16MB with 16 blocks, each block only has 128KB of data per ring step — too small to amortize the per-WQE overhead (QP slot reservation, mark_wqes_ready, NIC doorbell). Fewer blocks give more data per WQE.

**GB200 NCCL Ring results (Run 11, 4n×2g=8 ranks, IB-only):**

```
Size      NCCL(auto)  NCCLRing   Ctran        Pipes   Pipes vs Ring  Pipes vs Ctran
256KB           1.77      0.45    0.63         0.45           1.02x           0.73x
1MB             3.65      1.71    2.53         1.55           0.91x           0.61x
4MB            12.90      5.88    9.01     6.23(8B)           1.06x           0.69x
16MB           35.80     21.70   22.63   18.91(16B)           0.87x           0.83x
32MB           40.90     37.29   29.83   31.19(16B)           0.84x           1.05x
64MB           45.82     45.72   43.73   43.28(16B)           0.95x           0.99x
128MB          49.94     49.83   52.15   49.16(16B)           0.99x           0.94x
256MB          51.27     51.17   53.84   51.95(32B)           1.02x           0.96x
512MB          52.44     52.46   53.91   53.13(16B)           1.01x           0.99x
1GB            53.48     53.49   53.92   53.64(16B)           1.00x           0.99x
```

Key findings:
- NCCL auto uses tree at <64MB (up to 2.2x faster than NCCL Ring at 4MB)
- Pipes matches NCCL Ring at 256KB and beats it at 256MB+ (1.00-1.02x)
- Pipes is 0.87-0.95x of NCCL Ring at 4-16MB — block count optimization may close this
- The "0.53x NCCL" gap at 16MB was mostly ring-vs-tree, not a Pipes problem (vs Ring it's 0.87x)

Differential Revision: D105618089
@Scusemua Scusemua force-pushed the export-D105618089 branch from 60c5311 to e90c9b9 Compare May 28, 2026 15:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant