Skip to content

Auto-tuning CVARs and AllReduce benchmark harness (#2701)#2701

Open
Scusemua wants to merge 3 commits into
meta-pytorch:mainfrom
Scusemua:export-D104703792
Open

Auto-tuning CVARs and AllReduce benchmark harness (#2701)#2701
Scusemua wants to merge 3 commits into
meta-pytorch:mainfrom
Scusemua:export-D104703792

Conversation

@Scusemua
Copy link
Copy Markdown
Contributor

@Scusemua Scusemua commented May 27, 2026

Summary:

Add auto-tuning support for the pipesring AllReduce algorithm with two new CVARs (NCCL_CTRAN_ALLREDUCE_PIPES_NUM_RINGS, _NUM_BLOCKS) that default to -1 (auto-select based on message size).

Add a comprehensive benchmark comparing Pipes Ring AllReduce against NCCL AllReduce across message sizes from 256KB to 512MB with 1 and 2 rings. Also fix the ncclx AlgoConfig.cc switch exhaustiveness for the new pipesring enum value.

Differential Revision: D104703792

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 27, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented May 27, 2026

@Scusemua has exported this pull request. If you are a Meta employee, you can view the originating Diff in D104703792.

Scusemua added a commit to Scusemua/torchcomms that referenced this pull request May 27, 2026
Summary:

Add auto-tuning support for the pipesring AllReduce algorithm with two new CVARs (`NCCL_CTRAN_ALLREDUCE_PIPES_NUM_RINGS`, `_NUM_BLOCKS`) that default to -1 (auto-select based on message size).

Add a comprehensive benchmark comparing Pipes Ring AllReduce against NCCL AllReduce across message sizes from 256KB to 512MB with 1 and 2 rings. Also fix the ncclx `AlgoConfig.cc` switch exhaustiveness for the new `pipesring` enum value.

Differential Revision: D104703792
Scusemua added a commit to Scusemua/torchcomms that referenced this pull request May 27, 2026
Summary:

Add auto-tuning support for the pipesring AllReduce algorithm with two new CVARs (`NCCL_CTRAN_ALLREDUCE_PIPES_NUM_RINGS`, `_NUM_BLOCKS`) that default to -1 (auto-select based on message size).

Add a comprehensive benchmark comparing Pipes Ring AllReduce against NCCL AllReduce across message sizes from 256KB to 512MB with 1 and 2 rings. Also fix the ncclx `AlgoConfig.cc` switch exhaustiveness for the new `pipesring` enum value.

Differential Revision: D104703792
@meta-codesync meta-codesync Bot changed the title Auto-tuning CVARs and AllReduce benchmark harness Auto-tuning CVARs and AllReduce benchmark harness (#2701) May 27, 2026
Scusemua added a commit to Scusemua/torchcomms that referenced this pull request May 27, 2026
Summary:

Add auto-tuning support for the pipesring AllReduce algorithm with two new CVARs (`NCCL_CTRAN_ALLREDUCE_PIPES_NUM_RINGS`, `_NUM_BLOCKS`) that default to -1 (auto-select based on message size).

Add a comprehensive benchmark comparing Pipes Ring AllReduce against NCCL AllReduce across message sizes from 256KB to 512MB with 1 and 2 rings. Also fix the ncclx `AlgoConfig.cc` switch exhaustiveness for the new `pipesring` enum value.

Differential Revision: D104703792
@Scusemua Scusemua force-pushed the export-D104703792 branch from 1ed57a0 to df1a677 Compare May 27, 2026 16:07
Scusemua added a commit to Scusemua/torchcomms that referenced this pull request May 27, 2026
Summary:

Add auto-tuning support for the pipesring AllReduce algorithm with two new CVARs (`NCCL_CTRAN_ALLREDUCE_PIPES_NUM_RINGS`, `_NUM_BLOCKS`) that default to -1 (auto-select based on message size).

Add a comprehensive benchmark comparing Pipes Ring AllReduce against NCCL AllReduce across message sizes from 256KB to 512MB with 1 and 2 rings. Also fix the ncclx `AlgoConfig.cc` switch exhaustiveness for the new `pipesring` enum value.

Differential Revision: D104703792
@Scusemua Scusemua force-pushed the export-D104703792 branch from df1a677 to 473725a Compare May 27, 2026 18:01
Scusemua added a commit to Scusemua/torchcomms that referenced this pull request May 27, 2026
Summary:

Add auto-tuning support for the pipesring AllReduce algorithm with two new CVARs (`NCCL_CTRAN_ALLREDUCE_PIPES_NUM_RINGS`, `_NUM_BLOCKS`) that default to -1 (auto-select based on message size).

Add a comprehensive benchmark comparing Pipes Ring AllReduce against NCCL AllReduce across message sizes from 256KB to 512MB with 1 and 2 rings. Also fix the ncclx `AlgoConfig.cc` switch exhaustiveness for the new `pipesring` enum value.

Differential Revision: D104703792
Scusemua added a commit to Scusemua/torchcomms that referenced this pull request May 27, 2026
Summary:

Add auto-tuning support for the pipesring AllReduce algorithm with two new CVARs (`NCCL_CTRAN_ALLREDUCE_PIPES_NUM_RINGS`, `_NUM_BLOCKS`) that default to -1 (auto-select based on message size).

Add a comprehensive benchmark comparing Pipes Ring AllReduce against NCCL AllReduce across message sizes from 256KB to 512MB with 1 and 2 rings. Also fix the ncclx `AlgoConfig.cc` switch exhaustiveness for the new `pipesring` enum value.

Differential Revision: D104703792
Scusemua added a commit to Scusemua/torchcomms that referenced this pull request May 27, 2026
Summary:

Add auto-tuning support for the pipesring AllReduce algorithm with two new CVARs (`NCCL_CTRAN_ALLREDUCE_PIPES_NUM_RINGS`, `_NUM_BLOCKS`) that default to -1 (auto-select based on message size).

Add a comprehensive benchmark comparing Pipes Ring AllReduce against NCCL AllReduce across message sizes from 256KB to 512MB with 1 and 2 rings. Also fix the ncclx `AlgoConfig.cc` switch exhaustiveness for the new `pipesring` enum value.

Differential Revision: D104703792
Scusemua pushed a commit to Scusemua/torchcomms that referenced this pull request May 27, 2026
Summary:
Pull Request resolved: meta-pytorch#2701

Add auto-tuning support for the pipesring AllReduce algorithm with two new CVARs (`NCCL_CTRAN_ALLREDUCE_PIPES_NUM_RINGS`, `_NUM_BLOCKS`) that default to -1 (auto-select based on message size).

Add a comprehensive benchmark comparing Pipes Ring AllReduce against NCCL AllReduce across message sizes from 256KB to 512MB with 1 and 2 rings. Also fix the ncclx `AlgoConfig.cc` switch exhaustiveness for the new `pipesring` enum value.

Differential Revision: D104703792
Scusemua added a commit to Scusemua/torchcomms that referenced this pull request May 27, 2026
Summary:

Add auto-tuning support for the pipesring AllReduce algorithm with two new CVARs (`NCCL_CTRAN_ALLREDUCE_PIPES_NUM_RINGS`, `_NUM_BLOCKS`) that default to -1 (auto-select based on message size).

Add a comprehensive benchmark comparing Pipes Ring AllReduce against NCCL AllReduce across message sizes from 256KB to 512MB with 1 and 2 rings. Also fix the ncclx `AlgoConfig.cc` switch exhaustiveness for the new `pipesring` enum value.

Differential Revision: D104703792
Scusemua added a commit to Scusemua/torchcomms that referenced this pull request May 27, 2026
Summary:

Add auto-tuning support for the pipesring AllReduce algorithm with two new CVARs (`NCCL_CTRAN_ALLREDUCE_PIPES_NUM_RINGS`, `_NUM_BLOCKS`) that default to -1 (auto-select based on message size).

Add a comprehensive benchmark comparing Pipes Ring AllReduce against NCCL AllReduce across message sizes from 256KB to 512MB with 1 and 2 rings. Also fix the ncclx `AlgoConfig.cc` switch exhaustiveness for the new `pipesring` enum value.

Differential Revision: D104703792
@Scusemua Scusemua force-pushed the export-D104703792 branch from 473725a to 3c7a08f Compare May 27, 2026 18:43
Scusemua added a commit to Scusemua/torchcomms that referenced this pull request May 27, 2026
Summary:

Add auto-tuning support for the pipesring AllReduce algorithm with two new CVARs (`NCCL_CTRAN_ALLREDUCE_PIPES_NUM_RINGS`, `_NUM_BLOCKS`) that default to -1 (auto-select based on message size).

Add a comprehensive benchmark comparing Pipes Ring AllReduce against NCCL AllReduce across message sizes from 256KB to 512MB with 1 and 2 rings. Also fix the ncclx `AlgoConfig.cc` switch exhaustiveness for the new `pipesring` enum value.

Differential Revision: D104703792
Scusemua added a commit to Scusemua/torchcomms that referenced this pull request May 27, 2026
Summary:

Add auto-tuning support for the pipesring AllReduce algorithm with two new CVARs (`NCCL_CTRAN_ALLREDUCE_PIPES_NUM_RINGS`, `_NUM_BLOCKS`) that default to -1 (auto-select based on message size).

Add a comprehensive benchmark comparing Pipes Ring AllReduce against NCCL AllReduce across message sizes from 256KB to 512MB with 1 and 2 rings. Also fix the ncclx `AlgoConfig.cc` switch exhaustiveness for the new `pipesring` enum value.

Differential Revision: D104703792
Scusemua added a commit to Scusemua/torchcomms that referenced this pull request May 27, 2026
Summary:

Add auto-tuning support for the pipesring AllReduce algorithm with two new CVARs (`NCCL_CTRAN_ALLREDUCE_PIPES_NUM_RINGS`, `_NUM_BLOCKS`) that default to -1 (auto-select based on message size).

Add a comprehensive benchmark comparing Pipes Ring AllReduce against NCCL AllReduce across message sizes from 256KB to 512MB with 1 and 2 rings. Also fix the ncclx `AlgoConfig.cc` switch exhaustiveness for the new `pipesring` enum value.

Differential Revision: D104703792
Scusemua added a commit to Scusemua/torchcomms that referenced this pull request May 27, 2026
Summary:

Add auto-tuning support for the pipesring AllReduce algorithm with two new CVARs (`NCCL_CTRAN_ALLREDUCE_PIPES_NUM_RINGS`, `_NUM_BLOCKS`) that default to -1 (auto-select based on message size).

Add a comprehensive benchmark comparing Pipes Ring AllReduce against NCCL AllReduce across message sizes from 256KB to 512MB with 1 and 2 rings. Also fix the ncclx `AlgoConfig.cc` switch exhaustiveness for the new `pipesring` enum value.

Differential Revision: D104703792
Scusemua added a commit to Scusemua/torchcomms that referenced this pull request May 27, 2026
Summary:

Add auto-tuning support for the pipesring AllReduce algorithm with two new CVARs (`NCCL_CTRAN_ALLREDUCE_PIPES_NUM_RINGS`, `_NUM_BLOCKS`) that default to -1 (auto-select based on message size).

Add a comprehensive benchmark comparing Pipes Ring AllReduce against NCCL AllReduce across message sizes from 256KB to 512MB with 1 and 2 rings. Also fix the ncclx `AlgoConfig.cc` switch exhaustiveness for the new `pipesring` enum value.

Differential Revision: D104703792
@Scusemua Scusemua force-pushed the export-D104703792 branch from 3c7a08f to 217ed1c Compare May 27, 2026 18:53
Scusemua added a commit to Scusemua/torchcomms that referenced this pull request May 27, 2026
Summary:

Add auto-tuning support for the pipesring AllReduce algorithm with two new CVARs (`NCCL_CTRAN_ALLREDUCE_PIPES_NUM_RINGS`, `_NUM_BLOCKS`) that default to -1 (auto-select based on message size).

Add a comprehensive benchmark comparing Pipes Ring AllReduce against NCCL AllReduce across message sizes from 256KB to 512MB with 1 and 2 rings. Also fix the ncclx `AlgoConfig.cc` switch exhaustiveness for the new `pipesring` enum value.

Differential Revision: D104703792
Scusemua added a commit to Scusemua/torchcomms that referenced this pull request May 27, 2026
Summary:

Add auto-tuning support for the pipesring AllReduce algorithm with two new CVARs (`NCCL_CTRAN_ALLREDUCE_PIPES_NUM_RINGS`, `_NUM_BLOCKS`) that default to -1 (auto-select based on message size).

Add a comprehensive benchmark comparing Pipes Ring AllReduce against NCCL AllReduce across message sizes from 256KB to 512MB with 1 and 2 rings. Also fix the ncclx `AlgoConfig.cc` switch exhaustiveness for the new `pipesring` enum value.

Differential Revision: D104703792
Scusemua pushed a commit to Scusemua/torchcomms that referenced this pull request May 27, 2026
Summary:
Pull Request resolved: meta-pytorch#2701

Add auto-tuning support for the pipesring AllReduce algorithm with two new CVARs (`NCCL_CTRAN_ALLREDUCE_PIPES_NUM_RINGS`, `_NUM_BLOCKS`) that default to -1 (auto-select based on message size).

Add a comprehensive benchmark comparing Pipes Ring AllReduce against NCCL AllReduce across message sizes from 256KB to 512MB with 1 and 2 rings. Also fix the ncclx `AlgoConfig.cc` switch exhaustiveness for the new `pipesring` enum value.

Differential Revision: D104703792
Scusemua pushed a commit to Scusemua/torchcomms that referenced this pull request May 27, 2026
Summary:
Pull Request resolved: meta-pytorch#2701

Add auto-tuning support for the pipesring AllReduce algorithm with two new CVARs (`NCCL_CTRAN_ALLREDUCE_PIPES_NUM_RINGS`, `_NUM_BLOCKS`) that default to -1 (auto-select based on message size).

Add a comprehensive benchmark comparing Pipes Ring AllReduce against NCCL AllReduce across message sizes from 256KB to 512MB with 1 and 2 rings. Also fix the ncclx `AlgoConfig.cc` switch exhaustiveness for the new `pipesring` enum value.

Differential Revision: D104703792
Scusemua added a commit to Scusemua/torchcomms that referenced this pull request May 27, 2026
Summary:

Add auto-tuning support for the pipesring AllReduce algorithm with two new CVARs (`NCCL_CTRAN_ALLREDUCE_PIPES_NUM_RINGS`, `_NUM_BLOCKS`) that default to -1 (auto-select based on message size).

Add a comprehensive benchmark comparing Pipes Ring AllReduce against NCCL AllReduce across message sizes from 256KB to 512MB with 1 and 2 rings. Also fix the ncclx `AlgoConfig.cc` switch exhaustiveness for the new `pipesring` enum value.

Differential Revision: D104703792
Scusemua pushed a commit to Scusemua/torchcomms that referenced this pull request May 27, 2026
Summary:
Pull Request resolved: meta-pytorch#2701

Add auto-tuning support for the pipesring AllReduce algorithm with two new CVARs (`NCCL_CTRAN_ALLREDUCE_PIPES_NUM_RINGS`, `_NUM_BLOCKS`) that default to -1 (auto-select based on message size).

Add a comprehensive benchmark comparing Pipes Ring AllReduce against NCCL AllReduce across message sizes from 256KB to 512MB with 1 and 2 rings. Also fix the ncclx `AlgoConfig.cc` switch exhaustiveness for the new `pipesring` enum value.

Differential Revision: D104703792
Scusemua added a commit to Scusemua/torchcomms that referenced this pull request May 27, 2026
Summary:

Add auto-tuning support for the pipesring AllReduce algorithm with two new CVARs (`NCCL_CTRAN_ALLREDUCE_PIPES_NUM_RINGS`, `_NUM_BLOCKS`) that default to -1 (auto-select based on message size).

Add a comprehensive benchmark comparing Pipes Ring AllReduce against NCCL AllReduce across message sizes from 256KB to 512MB with 1 and 2 rings. Also fix the ncclx `AlgoConfig.cc` switch exhaustiveness for the new `pipesring` enum value.

Differential Revision: D104703792
Scusemua added a commit to Scusemua/torchcomms that referenced this pull request May 27, 2026
Summary:

Add auto-tuning support for the pipesring AllReduce algorithm with two new CVARs (`NCCL_CTRAN_ALLREDUCE_PIPES_NUM_RINGS`, `_NUM_BLOCKS`) that default to -1 (auto-select based on message size).

Add a comprehensive benchmark comparing Pipes Ring AllReduce against NCCL AllReduce across message sizes from 256KB to 512MB with 1 and 2 rings. Also fix the ncclx `AlgoConfig.cc` switch exhaustiveness for the new `pipesring` enum value.

Differential Revision: D104703792
Scusemua added a commit to Scusemua/torchcomms that referenced this pull request May 27, 2026
Summary:

Add auto-tuning support for the pipesring AllReduce algorithm with two new CVARs (`NCCL_CTRAN_ALLREDUCE_PIPES_NUM_RINGS`, `_NUM_BLOCKS`) that default to -1 (auto-select based on message size).

Add a comprehensive benchmark comparing Pipes Ring AllReduce against NCCL AllReduce across message sizes from 256KB to 512MB with 1 and 2 rings. Also fix the ncclx `AlgoConfig.cc` switch exhaustiveness for the new `pipesring` enum value.

Differential Revision: D104703792
Scusemua added a commit to Scusemua/torchcomms that referenced this pull request May 27, 2026
Summary:

Add auto-tuning support for the pipesring AllReduce algorithm with two new CVARs (`NCCL_CTRAN_ALLREDUCE_PIPES_NUM_RINGS`, `_NUM_BLOCKS`) that default to -1 (auto-select based on message size).

Add a comprehensive benchmark comparing Pipes Ring AllReduce against NCCL AllReduce across message sizes from 256KB to 512MB with 1 and 2 rings. Also fix the ncclx `AlgoConfig.cc` switch exhaustiveness for the new `pipesring` enum value.

Differential Revision: D104703792
@Scusemua Scusemua force-pushed the export-D104703792 branch from 217ed1c to b5f949a Compare May 27, 2026 20:39
Scusemua pushed a commit to Scusemua/torchcomms that referenced this pull request May 27, 2026
Summary:
Pull Request resolved: meta-pytorch#2701

Add auto-tuning support for the pipesring AllReduce algorithm with two new CVARs (`NCCL_CTRAN_ALLREDUCE_PIPES_NUM_RINGS`, `_NUM_BLOCKS`) that default to -1 (auto-select based on message size).

Add a comprehensive benchmark comparing Pipes Ring AllReduce against NCCL AllReduce across message sizes from 256KB to 512MB with 1 and 2 rings. Also fix the ncclx `AlgoConfig.cc` switch exhaustiveness for the new `pipesring` enum value.

Differential Revision: D104703792
Scusemua pushed a commit to Scusemua/torchcomms that referenced this pull request May 27, 2026
Summary:
Pull Request resolved: meta-pytorch#2701

Add auto-tuning support for the pipesring AllReduce algorithm with two new CVARs (`NCCL_CTRAN_ALLREDUCE_PIPES_NUM_RINGS`, `_NUM_BLOCKS`) that default to -1 (auto-select based on message size).

Add a comprehensive benchmark comparing Pipes Ring AllReduce against NCCL AllReduce across message sizes from 256KB to 512MB with 1 and 2 rings. Also fix the ncclx `AlgoConfig.cc` switch exhaustiveness for the new `pipesring` enum value.

Differential Revision: D104703792
Scusemua added a commit to Scusemua/torchcomms that referenced this pull request May 28, 2026
Summary:

Add auto-tuning support for the pipesring AllReduce algorithm with two new CVARs (`NCCL_CTRAN_ALLREDUCE_PIPES_NUM_RINGS`, `_NUM_BLOCKS`) that default to -1 (auto-select based on message size).

Add a comprehensive benchmark comparing Pipes Ring AllReduce against NCCL AllReduce across message sizes from 256KB to 512MB with 1 and 2 rings. Also fix the ncclx `AlgoConfig.cc` switch exhaustiveness for the new `pipesring` enum value.

Differential Revision: D104703792
Scusemua added a commit to Scusemua/torchcomms that referenced this pull request May 28, 2026
Summary:

Add auto-tuning support for the pipesring AllReduce algorithm with two new CVARs (`NCCL_CTRAN_ALLREDUCE_PIPES_NUM_RINGS`, `_NUM_BLOCKS`) that default to -1 (auto-select based on message size).

Add a comprehensive benchmark comparing Pipes Ring AllReduce against NCCL AllReduce across message sizes from 256KB to 512MB with 1 and 2 rings. Also fix the ncclx `AlgoConfig.cc` switch exhaustiveness for the new `pipesring` enum value.

Differential Revision: D104703792
Scusemua added 3 commits May 28, 2026 08:42
…orch#2689)

Summary:

Add a composed Ring AllReduce launcher to the Pipes collectives library. The launcher sequentially launches the existing `ring_reduce_scatter_kernel` and `ring_allgather_kernel` on a user-specified CUDA stream using the "Option B" buffer layout (no scratch buffer; RS writes directly to `recvbuf[my_rank * chunk_bytes]`, AG reads from the same position). Supports 1, 2, or 4 rings with configurable block count and timeout.

Reviewed By: siyengar

Differential Revision: D104701021
…pytorch#2700)

Summary:

Wire Pipes-based Ring AllReduce into the Ctran dispatch as a new `pipesring` algorithm.

Add three new CVARs (`NCCL_CTRAN_PIPES_SENDRECV_ENABLE`, `_MAX_GROUPS`, `_PIPELINE_DEPTH`) to enable pipelined send/recv staging buffers on the IBGDA transport, as required by ring collectives' `send()`/`recv()`/`forward()` protocol.

Add `AllReducePipesRing.cc` implementing the Ctran-side dispatch that builds ring topology, gets per-peer IBGDA transport handles, and launches the Pipes AllReduce. Uses the scratch-free "Option B" buffer layout. Currently supports float32 + Sum only, `nLocalRanks=1` (IB-only, matching PAFT inter-replica topology).

Differential Revision: D104701020
Summary:

Add auto-tuning support for the pipesring AllReduce algorithm with two new CVARs (`NCCL_CTRAN_ALLREDUCE_PIPES_NUM_RINGS`, `_NUM_BLOCKS`) that default to -1 (auto-select based on message size).

Add a comprehensive benchmark comparing Pipes Ring AllReduce against NCCL AllReduce across message sizes from 256KB to 512MB with 1 and 2 rings. Also fix the ncclx `AlgoConfig.cc` switch exhaustiveness for the new `pipesring` enum value.

Differential Revision: D104703792
@Scusemua Scusemua force-pushed the export-D104703792 branch from b5f949a to 0b470ea Compare May 28, 2026 15:43
Scusemua added a commit to Scusemua/torchcomms that referenced this pull request May 28, 2026
Summary:

Add auto-tuning support for the pipesring AllReduce algorithm with two new CVARs (`NCCL_CTRAN_ALLREDUCE_PIPES_NUM_RINGS`, `_NUM_BLOCKS`) that default to -1 (auto-select based on message size).

Add a comprehensive benchmark comparing Pipes Ring AllReduce against NCCL AllReduce across message sizes from 256KB to 512MB with 1 and 2 rings. Also fix the ncclx `AlgoConfig.cc` switch exhaustiveness for the new `pipesring` enum value.

Differential Revision: D104703792
Scusemua pushed a commit to Scusemua/torchcomms that referenced this pull request May 28, 2026
Summary:
Pull Request resolved: meta-pytorch#2701

Add auto-tuning support for the pipesring AllReduce algorithm with two new CVARs (`NCCL_CTRAN_ALLREDUCE_PIPES_NUM_RINGS`, `_NUM_BLOCKS`) that default to -1 (auto-select based on message size).

Add a comprehensive benchmark comparing Pipes Ring AllReduce against NCCL AllReduce across message sizes from 256KB to 512MB with 1 and 2 rings. Also fix the ncclx `AlgoConfig.cc` switch exhaustiveness for the new `pipesring` enum value.

Differential Revision: D104703792
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant