[Common/PyTorch] bugfix: Token-linear fused RoPE impl. for THD tensors. by plugyawn · Pull Request #3057 · NVIDIA/TransformerEngine

plugyawn · 2026-05-28T20:51:37Z

Description

Adds a token-linear implementation of the existing THD fused RoPE path to remove a launch-scaling bug.

Addresses #2866, which finds an interesting case with RoPE scales by freqs_len × n_spans, which is pathological; it should scale by total tokens. I reproduced the issue and found that it's causing a noticeable drops on even plausibly routine shapes. For eg: the [128/512] and [512/128] cases here.

The new kernel reuses the existing fused_rope_block_forward and fused_rope_block_backward device helpers, so the math doesn't change. All we need to do is add a THD-only path that launches one bloc/packed token.

n_seqs	max span	old layer fwd+bwd (ms)	new layer fwd+bwd (ms)	layer speedup	old paired-RoPE share	new paired-RoPE share
128	512	41.8151	23.0284	1.816x	49.12%	6.14%
512	128	102.1047	23.0167	4.436x	79.38%	6.59%
1024	64	182.9933	23.3783	7.827x	88.36%	6.77%
2401	28	401.0516	24.5668	16.325x	94.40%	6.41%

This is mostly pathological, however, so I've added a condition on the dispatch to avoid the unnecessary binary search overhead, although the overhead appears to be not-that-relevant. The condition is: token-linear only when b >= 64 and the old launch would issue ≥ 8× as many blocks as there are tokens. I'm not sure if this the usual shape of TE updates, so I could remove it!

Some more relevant tests:
Microbenchmark on H100 (bf16, h=32, d=d2=128, freqs_len=T_local=65536, single GPU):

n_seqs	old fwd+bwd (ms)	new fwd+bwd (ms)	speedup
1	1.2746	1.2734	1.001x
8	1.8860	1.3827	1.364x
32	3.9359	1.4462	2.722x
128	12.1849	1.5024	8.110x
512	44.9411	1.5600	28.808x
1024	89.1110	1.5919	55.977x
2401	208.4182	1.6373	127.296x

Fixes: #2866.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Add token-linear THD fused RoPE forward/backward kernels that launch one CUDA block per packed local token row.
Add NVTE_FUSED_ROPE_THD_TOKEN_LINEAR=0|1.
Reuses existing fused_rope_block_forward and fused_rope_block_backward device helpers.

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation <<(none?)>>
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

greptile-apps · 2026-05-28T21:02:01Z

Greptile Summary

This PR fixes a launch-scaling bug in the THD fused RoPE path where the old kernel launched freqs_len × nseq blocks, causing pathological over-dispatch when many short sequences are packed together. It adds token-linear forward and backward kernels that launch exactly one CUDA block per packed local token row, using a device-side binary search to locate each block's owning sequence.

Adds fused_rope_thd_token_forward_kernel and fused_rope_thd_token_backward_kernel, each reusing the existing fused_rope_block_forward/fused_rope_block_backward device helpers so the per-token math is bitwise-identical to the old path.
Adds a host-side heuristic fused_rope_thd_use_token_linear (overridable via NVTE_FUSED_ROPE_THD_TOKEN_LINEAR=0|1) that selects the new kernel when freqs_len × nseq ≥ 8 × total_tokens and nseq ≥ 64.
Adds a parity test asserting bitwise equality between old and new paths across diverse cu_seqlens shapes, including zero-length spans and cp_size=2, and two benchmarks that reproduce the speedup numbers from the PR description.

Confidence Score: 4/5

Safe to merge for the common case; two previously-flagged open concerns in the CUDA kernel remain unaddressed and could surface under adversarial or mismatched inputs.

The new token-linear kernels correctly reproduce the original kernel's per-token math for all valid inputs — the binary search is sound, the CP-rank offset formula is identical to the existing path, and the parity test covers key edge cases. The two open issues from prior review rounds (redundant per-thread binary search, missing out-of-range guard) are the main reason not to score higher.

transformer_engine/common/fused_rope/fused_rope.cu — the two new token-linear kernels and the heuristic dispatcher are the only code paths that need a second look.

Important Files Changed

Filename	Overview
transformer_engine/common/fused_rope/fused_rope.cu	Adds two new CUDA kernels (forward + backward) that launch one block per packed local token, a binary-search device helper `fused_rope_thd_find_seq_id`, and a host-side heuristic `fused_rope_thd_use_token_linear`. The math is correct and mirrors the original kernel exactly for all valid blocks; previously-flagged concerns (per-thread redundant binary search, missing t_id guard) are still open.
tests/pytorch/test_fused_rope.py	Adds `test_fused_rope_thd_token_linear_parity` that forces old and new paths back-to-back and asserts bitwise equality on both output and gradient; covers several sequence configs including zero-length spans, cp_size=2, start_positions, and multiple dtypes.
benchmarks/attention/benchmark_rope_thd_token_linear.py	New microbenchmark that sweeps n_seqs while holding total_tokens fixed under three env-var regimes (old/new/heuristic); produces CSV and optional matplotlib plot.
benchmarks/attention/benchmark_rope_thd_full_layer.py	New full TransformerLayer benchmark measuring end-to-end fwd+bwd time and RoPE share across the three dispatch regimes; correctly controls env var inside each timing loop.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["fused_rope_forward / fused_rope_backward\n(C++ binding layer)"] --> B["Read total_tokens = input.shape[0]\n(THD only)"]
    B --> C["fused_rope_thd_use_token_linear\n(host heuristic)"]
    C -->|"env=0 OR nseq<64 OR\nfreqs_len×nseq < 8×tokens"| E["Old kernel\ndim3 blocks(freqs_len, nseq)\nblockIdx.x = s_id, blockIdx.y = b_id\nmany dead blocks filtered at runtime"]
    C -->|"env=1 OR\n(nseq≥64 AND freqs_len×nseq ≥ 8×tokens)"| D["Token-linear kernel\ndim3 blocks(total_tokens)\nblockIdx.x = t_id (linear token index)"]
    D --> F["fused_rope_thd_find_seq_id\nbinary search on cu_seqlens\nto recover b_id from t_id"]
    F --> G["Compute s_id = t_id - start\ncur_seqlens, begin_offset\nCP-rank freq offset"]
    G --> H["fused_rope_block_forward/backward\n(same device helper as old path)"]
    E --> H

_{Reviews (3): Last reviewed commit: "Merge branch 'main' into rope-thd-token-..." | Re-trigger Greptile}

greptile-apps · 2026-05-28T21:02:04Z

+  int t_id = blockIdx.x;
+  int b_id = fused_rope_thd_find_seq_id(cu_seqlens, nseq, t_id, cp_size);


Redundant binary search across all threads in the block

Every thread in the block calls fused_rope_thd_find_seq_id with the same arguments (t_id = blockIdx.x, nseq, cp_size) and produces an identical result. With warps_per_block = 8, that's 256 threads each doing O(log nseq) global-memory reads of cu_seqlens that could be performed once. For nseq=2401 (~12 iterations x 256 threads), each block reads ~3,072 redundant entries from cu_seqlens. Performing the search once in thread 0 and broadcasting the result via shared memory would eliminate that overhead.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

greptile-apps · 2026-05-28T21:02:05Z

+  int t_id = blockIdx.x;
+  int b_id = fused_rope_thd_find_seq_id(cu_seqlens, nseq, t_id, cp_size);
+  int start = cu_seqlens[b_id] / cp_size;
+  int end = cu_seqlens[b_id + 1] / cp_size;
+  int s_id = t_id - start;
+  int cur_seqlens = end - start;


No guard for t_id exceeding valid cu_seqlens range

The old kernel explicitly filters dead blocks with if (t_id >= end) return; before any computation. The new kernel does not: it trusts that blockIdx.x < cu_seqlens[nseq]/cp_size because total_tokens is read from input.data.shape[0]. If a caller passes a tensor with shape[0] larger than cu_seqlens[-1]/cp_size, the binary search lands on b_id = nseq-1, computes s_id = t_id - start >= cur_seqlens, and fused_rope_block_forward indexes freqs at an out-of-range s_id_for_freqs. Adding if (t_id >= (int)(cu_seqlens[nseq] / cp_size)) return; after the binary search would restore the safety property the old kernel had.

ptrendx · 2026-05-28T21:40:06Z

@plugyawn Hi, could you sign your commits? See https://github.com/NVIDIA/TransformerEngine/blob/main/CONTRIBUTING.rst#sign-your-work
Nice improvement :-).

@sudhakarsingh27 Could you take a look?

Signed-off-by: plugyawn <progyan.das@iitgn.ac.in>

for more information, see https://pre-commit.ci Signed-off-by: plugyawn <progyan.das@iitgn.ac.in>

plugyawn · 2026-05-28T22:18:38Z

Thanks! Signed!

fwiw I think the binary search overhead on normal cases can be reduced also, I'll probably add some improvements.

github-actions Bot added the community-contribution PRs from external contributor outside the core maintainers, representing community-driven work. label May 28, 2026

greptile-apps Bot reviewed May 28, 2026

View reviewed changes

ptrendx assigned sudhakarsingh27 May 28, 2026

plugyawn and others added 3 commits May 29, 2026 03:23

Add token-linear THD fused RoPE path

8c81119

Signed-off-by: plugyawn <progyan.das@iitgn.ac.in>

Add THD RoPE full-layer benchmark

059a2e2

Signed-off-by: plugyawn <progyan.das@iitgn.ac.in>

[pre-commit.ci] auto fixes from pre-commit.com hooks

6c46696

for more information, see https://pre-commit.ci Signed-off-by: plugyawn <progyan.das@iitgn.ac.in>

plugyawn force-pushed the rope-thd-token-linear branch from 331a3a0 to 6c46696 Compare May 28, 2026 21:55

Merge branch 'main' into rope-thd-token-linear

88d56c4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Common/PyTorch] bugfix: Token-linear fused RoPE impl. for THD tensors.#3057

[Common/PyTorch] bugfix: Token-linear fused RoPE impl. for THD tensors.#3057
plugyawn wants to merge 4 commits into
NVIDIA:mainfrom
plugyawn:rope-thd-token-linear

plugyawn commented May 28, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented May 28, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot May 28, 2026

Uh oh!

plugyawn May 28, 2026

Uh oh!

greptile-apps Bot May 28, 2026

Uh oh!

ptrendx commented May 28, 2026

Uh oh!

plugyawn commented May 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		int t_id = blockIdx.x;
		int b_id = fused_rope_thd_find_seq_id(cu_seqlens, nseq, t_id, cp_size);

Conversation

plugyawn commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

plugyawn May 28, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

ptrendx commented May 28, 2026

Uh oh!

plugyawn commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

plugyawn commented May 28, 2026 •

edited

Loading

greptile-apps Bot commented May 28, 2026 •

edited

Loading

plugyawn commented May 28, 2026 •

edited

Loading