Skip to content

gpudirect-tcpx: Update NCCL config manifest for GKE 1.34+ recommendat…#610

Open
jkru3 wants to merge 3 commits into
GoogleCloudPlatform:masterfrom
jkru3:nccl-config-deprecation
Open

gpudirect-tcpx: Update NCCL config manifest for GKE 1.34+ recommendat…#610
jkru3 wants to merge 3 commits into
GoogleCloudPlatform:masterfrom
jkru3:nccl-config-deprecation

Conversation

@jkru3
Copy link
Copy Markdown

@jkru3 jkru3 commented May 19, 2026

Update NCCL config manifest for GKE 1.34+ recommendations

jkru3 added 3 commits May 19, 2026 22:37
…ions

This change updates the `nccl-config.yaml` ConfigMap manifest to remove deprecated environment variables and obsolete channel restrictions, aligning it with the official recommendations for the GKE 1.34+ TCPX stack.

Rationale for changes:

1. Removed `NCCL_GPUDIRECTTCPX_FORCE_ACK=0` & `NCCL_GPUDIRECTTCPX_TX_COMPLETION_NANOSLEEP=1000`
   - Reason: These manual packet tuning variables are deprecated and completely ignored by the updated TCPX daemon (v2.0.15+) used in GKE 1.34. With the migration to COS 125 (Linux kernel 6.12+), the stack natively utilizes upstream Device Memory TCP (devmem TCP) for zero-copy transfers, making these custom daemon-level workarounds obsolete.
   - Proof: These variables have been removed from the recommended configuration in the official Google Cloud GPUDirect-TCPX documentation:
     https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx#add-gpudirect-tcpx-manifests

2. Removed `NCCL_MAX_NCHANNELS=8` & `NCCL_MIN_NCHANNELS=8`
   - Reason: Forcing the system to use exactly 8 channels is no longer recommended for H100 workloads running NCCL core 3.1.12+ (standard in GKE 1.34). Restricting the channel count prevents NCCL from dynamically selecting the optimal number of channels based on topology, which can artificially limit GPU network bandwidth.
   - Proof: The official configuration guide no longer lists channel count limits, allowing NCCL to dynamically optimize itself:
     https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx#add-gpudirect-tcpx-manifests

These updates resolve the discrepancy where the manifest did not reflect the GKE 1.34 user guide recommendations.
…endations

This change creates `nccl-config-latest.yaml` ConfigMap manifest to remove deprecated environment variables and obsolete channel restrictions, aligning it with the official recommendations for the GKE 1.34+ TCPX stack.

Rationale for changes:

1. Removed `NCCL_GPUDIRECTTCPX_FORCE_ACK=0` & `NCCL_GPUDIRECTTCPX_TX_COMPLETION_NANOSLEEP=1000`
   - Rationale: These manual tuning parameters were workarounds for older, custom out-of-tree TCPX drivers. GKE 1.34 (COS 125) migrates to Linux Kernel 6.12+, which natively supports **Device Memory TCP (devmem TCP)**. The kernel's TCP stack now handles packet acknowledgment and zero-copy transfers natively, making these CPU-timing and socket-level workarounds obsolete. The new tcpx-daemon (v2.0.15) ignores these variables.
   - Proof (Linux Kernel v6.12 Merge): https://lore.kernel.org/netdev/20240831004313.3713467-1-almasrymina@google.com/
   - Proof (Linux Kernel Documentation): https://www.kernel.org/doc/html/v6.12/networking/devmem.html

2. Removed `NCCL_MAX_NCHANNELS=8` & `NCCL_MIN_NCHANNELS=8`
   - Rationale: Setting these variables forces NCCL to bypass its internal, automatic topology-detection and channel-tuning algorithm. In newer NCCL versions (3.1.12+), this tuner is highly optimized to dynamically allocate the optimal number of channels (often up to 24 channels on A3/H100 nodes) to fully saturate the network bandwidth. Manually capping channels at 8 disables this optimization and acts as a performance bottleneck, which is recognized as a primary cause of communication regressions in distributed GPU training (and is actively asserted against in standard ML validation suites like Megatron-LM).
   - Proof (NVIDIA NCCL Tuning Documentation): Bypassing automatic channel selection is documented by NVIDIA as a manual override that should be avoided in production to allow topology-aware tuning:
     https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html

These updates resolve the discrepancy where the manifest did not reflect the GKE 1.34 user guide recommendations.
…endations

This change creates `nccl-config-latest.yaml` ConfigMap manifest to remove deprecated environment variables and obsolete channel restrictions, aligning it with the official recommendations for the GKE 1.34+ TCPX stack.

Rationale for changes:

1. Removed `NCCL_GPUDIRECTTCPX_FORCE_ACK=0` & `NCCL_GPUDIRECTTCPX_TX_COMPLETION_NANOSLEEP=1000`
   - Rationale: These manual tuning parameters were workarounds for older, custom out-of-tree TCPX drivers. GKE 1.34 (COS 125) migrates to Linux Kernel 6.12+, which natively supports **Device Memory TCP (devmem TCP)**. The kernel's TCP stack now handles packet acknowledgment and zero-copy transfers natively, making these CPU-timing and socket-level workarounds obsolete. The new tcpx-daemon (v2.0.15) ignores these variables.
   - Proof (Linux Kernel v6.12 Merge): https://lore.kernel.org/netdev/20240831004313.3713467-1-almasrymina@google.com/
   - Proof (Linux Kernel Documentation): https://www.kernel.org/doc/html/v6.12/networking/devmem.html

2. Removed `NCCL_MAX_NCHANNELS=8` & `NCCL_MIN_NCHANNELS=8`
   - Rationale: Setting these variables forces NCCL to bypass its internal, automatic topology-detection and channel-tuning algorithm. In newer NCCL versions (3.1.12+), this tuner is highly optimized to dynamically allocate the optimal number of channels (often up to 24 channels on A3/H100 nodes) to fully saturate the network bandwidth. Manually capping channels at 8 disables this optimization and acts as a performance bottleneck, which is recognized as a primary cause of communication regressions in distributed GPU training (and is actively asserted against in standard ML validation suites like Megatron-LM).
   - Proof (NVIDIA NCCL Tuning Documentation): Bypassing automatic channel selection is documented by NVIDIA as a manual override that should be avoided in production to allow topology-aware tuning:
     https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html

These updates resolve the discrepancy where the manifest did not reflect the GKE 1.34 user guide recommendations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant