Fused Ring AllReduce kernel + Ctran benchmark comparison (#2702) by Scusemua · Pull Request #2702 · meta-pytorch/torchcomms

Scusemua · 2026-05-27T15:49:46Z

Summary:

Replace the two-kernel AllReduce composition (separate RS + AG launches) with a single fused ring_allreduce_kernel that executes both ReduceScatter and AllGather phases within one kernel invocation. This eliminates the ~5-15us inter-kernel gap and shares setup code (thread group, ring partition, TiledBuffer, pipeline_window, stride). Transport step-state counters persist naturally within a single kernel; no barrier needed between phases.

Also add Ctran ctring AllReduce as a third comparison arm in the benchmark, showing Pipes Ring vs Ctran Ring (GPE) vs NCCL (auto-select) to measure the speedup over the current MCCL production AllReduce algorithm.

Differential Revision: D104709661

meta-codesync · 2026-05-27T15:49:54Z

@Scusemua has exported this pull request. If you are a Meta employee, you can view the originating Diff in D104709661.

…h#2702) Summary: Replace the two-kernel AllReduce composition (separate RS + AG launches) with a single fused `ring_allreduce_kernel` that executes both ReduceScatter and AllGather phases within one kernel invocation. This eliminates the ~5-15us inter-kernel gap and shares setup code (thread group, ring partition, `TiledBuffer`, `pipeline_window`, stride). Transport step-state counters persist naturally within a single kernel; no barrier needed between phases. Also add Ctran ctring AllReduce as a third comparison arm in the benchmark, showing Pipes Ring vs Ctran Ring (GPE) vs NCCL (auto-select) to measure the speedup over the current MCCL production AllReduce algorithm. Differential Revision: D104709661

…h#2702) Summary: Pull Request resolved: meta-pytorch#2702 Replace the two-kernel AllReduce composition (separate RS + AG launches) with a single fused `ring_allreduce_kernel` that executes both ReduceScatter and AllGather phases within one kernel invocation. This eliminates the ~5-15us inter-kernel gap and shares setup code (thread group, ring partition, `TiledBuffer`, `pipeline_window`, stride). Transport step-state counters persist naturally within a single kernel; no barrier needed between phases. Also add Ctran ctring AllReduce as a third comparison arm in the benchmark, showing Pipes Ring vs Ctran Ring (GPE) vs NCCL (auto-select) to measure the speedup over the current MCCL production AllReduce algorithm. Differential Revision: D104709661

…h#2702) Summary: Replace the two-kernel AllReduce composition (separate RS + AG launches) with a single fused `ring_allreduce_kernel` that executes both ReduceScatter and AllGather phases within one kernel invocation. This eliminates the ~5-15us inter-kernel gap and shares setup code (thread group, ring partition, `TiledBuffer`, `pipeline_window`, stride). Transport step-state counters persist naturally within a single kernel; no barrier needed between phases. Also add Ctran ctring AllReduce as a third comparison arm in the benchmark, showing Pipes Ring vs Ctran Ring (GPE) vs NCCL (auto-select) to measure the speedup over the current MCCL production AllReduce algorithm. Differential Revision: D104709661

…h#2702) Summary: Pull Request resolved: meta-pytorch#2702 Replace the two-kernel AllReduce composition (separate RS + AG launches) with a single fused `ring_allreduce_kernel` that executes both ReduceScatter and AllGather phases within one kernel invocation. This eliminates the ~5-15us inter-kernel gap and shares setup code (thread group, ring partition, `TiledBuffer`, `pipeline_window`, stride). Transport step-state counters persist naturally within a single kernel; no barrier needed between phases. Also add Ctran ctring AllReduce as a third comparison arm in the benchmark, showing Pipes Ring vs Ctran Ring (GPE) vs NCCL (auto-select) to measure the speedup over the current MCCL production AllReduce algorithm. Differential Revision: D104709661

…h#2702) Summary: Replace the two-kernel AllReduce composition (separate RS + AG launches) with a single fused `ring_allreduce_kernel` that executes both ReduceScatter and AllGather phases within one kernel invocation. This eliminates the ~5-15us inter-kernel gap and shares setup code (thread group, ring partition, `TiledBuffer`, `pipeline_window`, stride). Transport step-state counters persist naturally within a single kernel; no barrier needed between phases. Also add Ctran ctring AllReduce as a third comparison arm in the benchmark, showing Pipes Ring vs Ctran Ring (GPE) vs NCCL (auto-select) to measure the speedup over the current MCCL production AllReduce algorithm. Differential Revision: D104709661

…h#2702) Summary: Pull Request resolved: meta-pytorch#2702 Replace the two-kernel AllReduce composition (separate RS + AG launches) with a single fused `ring_allreduce_kernel` that executes both ReduceScatter and AllGather phases within one kernel invocation. This eliminates the ~5-15us inter-kernel gap and shares setup code (thread group, ring partition, `TiledBuffer`, `pipeline_window`, stride). Transport step-state counters persist naturally within a single kernel; no barrier needed between phases. Also add Ctran ctring AllReduce as a third comparison arm in the benchmark, showing Pipes Ring vs Ctran Ring (GPE) vs NCCL (auto-select) to measure the speedup over the current MCCL production AllReduce algorithm. Differential Revision: D104709661

…h#2702) Summary: Replace the two-kernel AllReduce composition (separate RS + AG launches) with a single fused `ring_allreduce_kernel` that executes both ReduceScatter and AllGather phases within one kernel invocation. This eliminates the ~5-15us inter-kernel gap and shares setup code (thread group, ring partition, `TiledBuffer`, `pipeline_window`, stride). Transport step-state counters persist naturally within a single kernel; no barrier needed between phases. Also add Ctran ctring AllReduce as a third comparison arm in the benchmark, showing Pipes Ring vs Ctran Ring (GPE) vs NCCL (auto-select) to measure the speedup over the current MCCL production AllReduce algorithm. Differential Revision: D104709661

…h#2702) Summary: Pull Request resolved: meta-pytorch#2702 Replace the two-kernel AllReduce composition (separate RS + AG launches) with a single fused `ring_allreduce_kernel` that executes both ReduceScatter and AllGather phases within one kernel invocation. This eliminates the ~5-15us inter-kernel gap and shares setup code (thread group, ring partition, `TiledBuffer`, `pipeline_window`, stride). Transport step-state counters persist naturally within a single kernel; no barrier needed between phases. Also add Ctran ctring AllReduce as a third comparison arm in the benchmark, showing Pipes Ring vs Ctran Ring (GPE) vs NCCL (auto-select) to measure the speedup over the current MCCL production AllReduce algorithm. Differential Revision: D104709661

…orch#2689) Summary: Add a composed Ring AllReduce launcher to the Pipes collectives library. The launcher sequentially launches the existing `ring_reduce_scatter_kernel` and `ring_allgather_kernel` on a user-specified CUDA stream using the "Option B" buffer layout (no scratch buffer; RS writes directly to `recvbuf[my_rank * chunk_bytes]`, AG reads from the same position). Supports 1, 2, or 4 rings with configurable block count and timeout. Reviewed By: siyengar Differential Revision: D104701021

…pytorch#2700) Summary: Wire Pipes-based Ring AllReduce into the Ctran dispatch as a new `pipesring` algorithm. Add three new CVARs (`NCCL_CTRAN_PIPES_SENDRECV_ENABLE`, `_MAX_GROUPS`, `_PIPELINE_DEPTH`) to enable pipelined send/recv staging buffers on the IBGDA transport, as required by ring collectives' `send()`/`recv()`/`forward()` protocol. Add `AllReducePipesRing.cc` implementing the Ctran-side dispatch that builds ring topology, gets per-peer IBGDA transport handles, and launches the Pipes AllReduce. Uses the scratch-free "Option B" buffer layout. Currently supports float32 + Sum only, `nLocalRanks=1` (IB-only, matching PAFT inter-replica topology). Differential Revision: D104701020

Summary: Add auto-tuning support for the pipesring AllReduce algorithm with two new CVARs (`NCCL_CTRAN_ALLREDUCE_PIPES_NUM_RINGS`, `_NUM_BLOCKS`) that default to -1 (auto-select based on message size). Add a comprehensive benchmark comparing Pipes Ring AllReduce against NCCL AllReduce across message sizes from 256KB to 512MB with 1 and 2 rings. Also fix the ncclx `AlgoConfig.cc` switch exhaustiveness for the new `pipesring` enum value. Differential Revision: D104703792

…h#2702) Summary: Replace the two-kernel AllReduce composition (separate RS + AG launches) with a single fused `ring_allreduce_kernel` that executes both ReduceScatter and AllGather phases within one kernel invocation. This eliminates the ~5-15us inter-kernel gap and shares setup code (thread group, ring partition, `TiledBuffer`, `pipeline_window`, stride). Transport step-state counters persist naturally within a single kernel; no barrier needed between phases. Also add Ctran ctring AllReduce as a third comparison arm in the benchmark, showing Pipes Ring vs Ctran Ring (GPE) vs NCCL (auto-select) to measure the speedup over the current MCCL production AllReduce algorithm. Differential Revision: D104709661

…h#2702) Summary: Pull Request resolved: meta-pytorch#2702 Replace the two-kernel AllReduce composition (separate RS + AG launches) with a single fused `ring_allreduce_kernel` that executes both ReduceScatter and AllGather phases within one kernel invocation. This eliminates the ~5-15us inter-kernel gap and shares setup code (thread group, ring partition, `TiledBuffer`, `pipeline_window`, stride). Transport step-state counters persist naturally within a single kernel; no barrier needed between phases. Also add Ctran ctring AllReduce as a third comparison arm in the benchmark, showing Pipes Ring vs Ctran Ring (GPE) vs NCCL (auto-select) to measure the speedup over the current MCCL production AllReduce algorithm. Differential Revision: D104709661

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 27, 2026

meta-codesync Bot added fb-exported meta-exported labels May 27, 2026

meta-codesync Bot changed the title ~~Fused Ring AllReduce kernel + Ctran benchmark comparison~~ Fused Ring AllReduce kernel + Ctran benchmark comparison (#2702) May 27, 2026

Scusemua force-pushed the export-D104709661 branch 2 times, most recently from 1361b51 to cd49f0a Compare May 27, 2026 16:07

Scusemua force-pushed the export-D104709661 branch from cd49f0a to c16e70d Compare May 27, 2026 18:01

Scusemua force-pushed the export-D104709661 branch from c16e70d to 1984e5c Compare May 27, 2026 18:44

Scusemua force-pushed the export-D104709661 branch from 1984e5c to 4dbc69b Compare May 27, 2026 18:53

Scusemua force-pushed the export-D104709661 branch from 4dbc69b to e434a70 Compare May 27, 2026 20:33

Scusemua force-pushed the export-D104709661 branch from e434a70 to e6dc447 Compare May 27, 2026 20:39

Scusemua added 4 commits May 28, 2026 08:41

Scusemua force-pushed the export-D104709661 branch from e6dc447 to 75414fc Compare May 28, 2026 15:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fused Ring AllReduce kernel + Ctran benchmark comparison (#2702)#2702

Fused Ring AllReduce kernel + Ctran benchmark comparison (#2702)#2702
Scusemua wants to merge 4 commits into
meta-pytorch:mainfrom
Scusemua:export-D104709661

Scusemua commented May 27, 2026 •

edited by meta-codesync Bot

Loading

Uh oh!

meta-codesync Bot commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Scusemua commented May 27, 2026 • edited by meta-codesync Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meta-codesync Bot commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Scusemua commented May 27, 2026 •

edited by meta-codesync Bot

Loading