Fix int32 overflow in Triton _padded_copy pointer arithmetic#186
Open
hangg7 wants to merge 1 commit intodatabricks:mainfrom
Open
Fix int32 overflow in Triton _padded_copy pointer arithmetic#186hangg7 wants to merge 1 commit intodatabricks:mainfrom
hangg7 wants to merge 1 commit intodatabricks:mainfrom
Conversation
The _padded_copy and _binned_copy Triton kernels compute pointer offsets as `offset * NUM_COLUMNS` using int32 arithmetic. In Triton, int32 * int32 stays int32 without promotion to int64. When the product exceeds 2^31, the result wraps negative, creating a backward pointer that accesses memory before the tensor start — triggering "CUDA error: an illegal memory access was encountered". This triggers with expert parallelism at high token counts: the all-to-all dispatch can concentrate tokens on one rank due to routing imbalance. For hidden_size=4096, the overflow threshold is offset >= 524,288 tokens on a single rank. Fix: cast offset and index_b to tl.int64 before the multiplication in all 4 Triton kernels. The .to(tl.int64) adds one instruction per thread block — negligible performance impact. This is the same class of bug as triton-lang/triton#832.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Cast pointer offsets to
tl.int64before multiplying byNUM_COLUMNSin all 4 Triton kernels (_padded_copy,_padded_copy_wgrad,_binned_copy,_binned_copy_wgrad).Problem
The Triton kernels compute pointer offsets as
offset * NUM_COLUMNSusing int32 arithmetic. In Triton,int32 * int32stays int32 without promotion. When the product exceeds 2^31, the result wraps negative, creating a backward pointer that accesses memory before the tensor start — triggeringCUDA error: an illegal memory access was encountered.This is the same class of bug as triton-lang/triton#832.
When does it trigger?
With expert parallelism, the all-to-all dispatch can concentrate tokens on one rank due to routing imbalance. The overflow threshold for
hidden_size=4096isoffset >= 524,288. At 20-30k tokens/GPU with EP, moderate routing imbalance (1.5-2x) is sufficient.Symptoms
Fix
Applied to all 4 kernels (8 lines). Negligible performance impact.
Testing