Skip to content

IFU v2.14.dev0#557

Open
ipanfilo wants to merge 125 commits into
devfrom
IFU-dev-20260315-v2.14
Open

IFU v2.14.dev0#557
ipanfilo wants to merge 125 commits into
devfrom
IFU-dev-20260315-v2.14

Conversation

@ipanfilo
Copy link
Copy Markdown
Collaborator

Description

IFU from 2026-03-15 upstream commit 708d7c1
https://github.com/ROCm/frameworks-internal/issues/16312

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

ptrendx and others added 30 commits January 20, 2026 09:14
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
* update FE to 1.17

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add determinism flag

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add determinism to test

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add determinism to qa/

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* move bias/dbias/versioning/dropout logic to C API

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update qa/L0_pytorch_unittest/test.sh

make .xml file specific to deterministic tests in qa/

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add determinism to Jax extension

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add determinism to Jax tests

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update tests/jax/test_fused_attn.py

fix typo

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* Update transformer_engine/common/fused_attn/fused_attn.cpp

fix indentation

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix the AI fixes

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix Jax extension call

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* minor fixes based on comments

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix selection logic and fwd arg

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix version check in Jax test

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix pytorch CI failures

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix Jax CI failures

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix non-/determinism logic and CI

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix formatting

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update transformer_engine/common/fused_attn/fused_attn.cpp

fix and/or logic

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* update to 9.18.1 for requirement

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* reduce Jax CI tests for determinism

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
* Implemented persistent nvfp4 kernel

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix FP4 guard in ptx

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fix

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fix in ptx. reduxf32 guard

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixes per PR review

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixes per PR review. Added parameter to turn off the persistency

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Modified reference CPU implementation in C++ unit tests to match GPU (numerical truncation). Tightened the numerical tolerance

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Disabled persistency by default, as non-persistent kernel is more performant when inputs are large

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Use the tuned kernel also for the rowwise only quantization

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixed typo

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Addressed comments from the PR review

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Resolved conflicts

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Macros renaming

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

---------

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
* PoC of the changes

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Early exit from the Free function for the empty tensor

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Use the proper function for nvtx range

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Only do mark_not_offload when the cpu_offloading is enabled

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* First pass on making the setattr issue not come back

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Actually add pytest.ini

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Changes to __init__

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* A different way

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* WAR the fact that it is not possible to set __setattr__ dynamically

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Simpler solution and fixes

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fix for the inference mode DPA

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Start of debugging debug tools

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* More fixes in debug

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Speculative moving the validate_name to the constructor

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Making the debug tools names saner

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Change the setattr usage in the tensor parallel group setting

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Adding try/finally - it does not seem to impact the time in observable
way

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fixing lint issues and the thunder test

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fix 1 of the debug tests

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Removed the warning and enforcement in the CI

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* try-finally in the context manager

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fixing the debug tests

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

---------

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Fix cb.CUDAOptions usage for Triton 3.6.0

Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update utils.py

Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>

* Update utils.py

Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>

* Update utils.py

Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>

---------

Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
return tokens_per_experts always

Signed-off-by: tdophung <tdophung@nvidia.com>
* Update THD sink attention logic for newer cudnn versions

THD Sink attention is supported in 9.18.0

Signed-off-by: Chen Cui <chcui@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update thd sink attention logic for cp>1

Signed-off-by: Chen Cui <chcui@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add unit test for thd + sink attention

Signed-off-by: Chen Cui <chcui@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* address comments

Signed-off-by: Chen Cui <chcui@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* do not skip thd cp sink attention test

Signed-off-by: Chen Cui <chcui@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* disable deterministic mode for sink attention

Signed-off-by: Chen Cui <chcui@nvidia.com>

---------

Signed-off-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
* SWA (left, right) with FusedAttention changes cherry-picked from NVIDIA/TransformerEngine#1369

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix test_kv_cache failures

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* remove unnecessary comments

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* fix some more filter issues, address feedback

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* fix for local test case failures - `bottom_right_diagonal` should be calculated in `fused_attn_fwd` call as well

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* make conditions more accurate

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* add cp tests to test swa (left, right)

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove dead code and make conditions better

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix lint

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* feedback form Charlene

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* small er

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* plumb `bottom_right_diagonal` through jax

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* plumb `bottom_right_diagonal` through jax

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add missing fields

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* use proper mask type in CP

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Use correct block size for workspace in row id map creation, also shard workspace correctly based on 2nd dim of routing_map/row_id map

Signed-off-by: DoubleCheeseCheetos <hanhdp99@gmail.com>

* reduce size of largest test case on single_GPU scenario to fit on L40 and A100 in CI line up

Signed-off-by: tdophung <hanhdp99@gmail.com>

---------

Signed-off-by: DoubleCheeseCheetos <hanhdp99@gmail.com>
Signed-off-by: tdophung <hanhdp99@gmail.com>
Co-authored-by: DoubleCheeseCheetos <hanhdp99@gmail.com>
* Disabled the tuned NVFP4 kernels

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Disabled fast math in cpp tests

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

---------

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
* Expose option for custom op fusions

Refactor fusion functions to remove index bookkeeping. Refactor fused ops to use consistent operation order.

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Add tests for custom ops

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix linter warnings and numerical test failures

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Tweak pattern matching logic with fixed window sizes

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Use TF32 tols in fused op tests

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Review suggestion from @greptile-apps

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Backpropagate fixes from #2622

Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix(examples): te_llama compatibility with HuggingFace transformers >= 4.57

The te_llama.py example was failing with HuggingFace transformers 4.57+
due to API changes in how decoder layer outputs are handled.

Changes:
- Handle case where hidden_states is passed as a tuple (older HF versions)
- Return tensor directly instead of wrapped in tuple (HF 4.57+ expects this)
- Fix regex pattern to use raw string (fixes SyntaxWarning)

Error fixed:
  AttributeError: 'tuple' object has no attribute 'contiguous'

Tested with:
- transformer_engine 2.5.0
- transformers 4.57.3
- PyTorch container nvcr.io/nvidia/pytorch:25.08-py3

Signed-off-by: Santosh Bhavani <santosh.bhavani@live.com>

* docs(te_llama): add requirements.txt

Signed-off-by: Santosh Bhavani <santosh.bhavani@live.com>

* fix(docs): add missing notebook output names

Signed-off-by: Santosh Bhavani <santosh.bhavani@live.com>

---------

Signed-off-by: Santosh Bhavani <santosh.bhavani@live.com>
…x 404 error (#2625)

* Use "nyu-mll/glue" instead of "glue" for encoder datasets to fix 404 error

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* rename mnist dataset path

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* add dataset manifest

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
* jjit bug fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix'

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fixes

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* lint fixes

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

---------

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* code drop

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add FP8 scale support and fix alignment for grouped GEMM

- Add FP8 scale_inv pointer handling in nvte_grouped_gemm for proper FP8 GEMM
- Fix random padding in tests to ensure 16-byte alignment for all dtypes
- Reorder GroupedGemmSetupWorkspace members for natural alignment
- Remove debug prints

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Grouped GEMM: code cleanup and NULL C support

- Remove unused alignment parameter from GroupedGemmSetupWorkspace::from_buffers
- Simplify select_grouped_operand by removing dead code branches
- Add GroupedOperandSelection.tensor field to avoid passing tensor separately
- Extract set_fp8_scale_pointers and init_matrix_layouts helpers
- Add safety check for FP8 on Hopper column-wise fallback
- Support NULL C tensor when beta=0 (uses D as placeholder)
- Remove unused get_scale_inv() from test
- Add use_null_c test parameter and test case
- Fix documentation: alpha/beta are single element tensors only

Signed-off-by: Piotr Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Grouped GEMM: per-matrix alpha/beta support

- Change alpha/beta from single values to per-matrix arrays
- Validate alpha/beta have exactly num_tensors elements
- Update kernel to index alpha_ptr[idx] and beta_ptr[idx]
- Move alpha/beta validation to validate_grouped_gemm_inputs
- Update tests to use per-matrix alpha/beta arrays
- Update documentation

Signed-off-by: Piotr Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix alpha/beta numel - use SimpleTensor::numel()

Signed-off-by: Piotr Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* Refactor: move grouped GEMM to separate file and cleanup API

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* Require Blackwell (SM100) and cuBLAS 13.1+ for grouped GEMM

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fixes

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fixes

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update transformer_engine/common/gemm/config.h

Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com>
Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* changed

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* suggestions

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* refactored hopper tensor selection

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Piotr Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com>
fix wheel

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
* version change

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* ifx

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

---------

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
* Code drop: Update recipes documentation and remove custom recipes from low precision training

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* Fix SVG css import path for diagrams

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* Refactor low_precision_training docs: remove optimizers, fix imports, add GPU checks

Changes:
- Remove optimizer code from all recipe examples (keep only forward/backward)
- Fix Format imports (use Format.E4M3 instead of string 'E4M3')
- Fix params_dtype for PyTorch examples (add params_dtype=torch.bfloat16)
- Add GPU capability assertions before START blocks for blockwise/mxfp8/nvfp4
- Fix JAX imports (Float8CurrentScaling from common.recipe, NVFP4BlockScaling)
- Add global_shard_guard for TransformerLayer examples in JAX
- Fix fused_layers_jax.py return tuple unpacking
- Update memory_usage JAX examples with dynamic GPU measurement
- Remove memory_usage_3_jax (JAX doesn't support FP8 weight storage)
- Update performance_considerations.rst for JAX differences
- Delete unused .out files and fp8_autocast_jax.py

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix JAX memory usage .out files with correct output

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* responded to comments

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* applied suggestions form greptile

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* year change

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* jax compute capability fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fixes

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

---------

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Support building with headers from nvidia wheels

There are two changes:
1. `import nvidia` returns a namespace package with `__file__` equal to `None`
2. Add the way to force headers from nvidia wheels. Without that envvar, it's practically impossible with CUDA installed system-wide.

I successfully built the package with torch using the following `uv` configuration:
```
[tool.uv.extra-build-dependencies]
"transformer-engine-torch" = [
    "ninja",
    "nvidia-cuda-crt==13.0.88",
    "nvidia-cuda-cccl==13.0.85",
    { requirement = "torch", match-runtime = true },
    { requirement = "pytorch-triton", match-runtime = true },
    { requirement = "nvidia-cusolver", match-runtime = true },
    { requirement = "nvidia-curand", match-runtime = true },
    { requirement = "nvidia-cublas", match-runtime = true },
    { requirement = "nvidia-cusparse", match-runtime = true },
    { requirement = "nvidia-cudnn-cu13", match-runtime = true },
    { requirement = "nvidia-nvtx", match-runtime = true },
    { requirement = "nvidia-cuda-nvrtc", match-runtime = true },
    { requirement = "nvidia-cuda-runtime", match-runtime = true },
]
```

Signed-off-by: Vadim Markovtsev <vadim@poolside.ai>

* Apply suggestion from @ksivaman

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------

Signed-off-by: Vadim Markovtsev <vadim@poolside.ai>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* Fixed scaling-factor computation for FP32 to match the reference implementation.

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Uncommented the tuned kernel path

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* init

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fixes

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* year update in license

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Update README.rst

Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>

* Update README.rst

Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>

* Update README.rst

Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>

---------

Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>
Signed-off-by: Kaining Zhong <kainingz@nvidia.com>
* Rebased to main

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixed the year to 2026

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Added compilation guards

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Added BWD pass

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Added dbias and dact tests. Refactoring.

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Added grouped MXFP8 DACT and ACT API and tests

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixed a typo

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixes per the review

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* More fixes from the review

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixes per the review

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Relaxed requirement for last dim from mod128 to mod32

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Added alignment checks when tensor descriptors are modified

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: vthumbe1503 <vthumbe@nvidia.com>
bucket max_b with more granularity when >512

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
* expand troubleshooting docs

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Update README.rst

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>

* Update README.rst

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>

* Update README.rst

Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>

---------

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
* add grad reduce api for cuda graph hook

Signed-off-by: Pingtian Li <pingtianl@nvidia.com>

* fix code consistency

Signed-off-by: Pingtian Li <pingtianl@nvidia.com>

---------

Signed-off-by: Pingtian Li <pingtianl@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
* Fix the compilation warnings for the PyTorch extension

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Apply suggestion from @greptile-apps[bot]

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Przemyslaw Tredak <ptrendx@gmail.com>

---------

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemyslaw Tredak <ptrendx@gmail.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
pstjohn and others added 4 commits March 15, 2026 09:13
* fix for async dcp checkpointing

Signed-off-by: Peter St. John <pstjohn@nvidia.com>

* Apply suggestions from code review

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Peter St. John <peterc.stjohn@gmail.com>

* Update transformer_engine/pytorch/tensor/storage/float8_tensor_storage.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Address Greptile review feedback: defensive guards for edge cases

- Add _quantizer None guard in new_empty dispatch
- Replace self.is_cpu with explicit _data/_transpose checks in __reduce_ex__
- Make get_metadata() safe for cleared tensors (both _data and _transpose None)

Signed-off-by: Peter St. John <pstjohn@nvidia.com>

---------

Signed-off-by: Peter St. John <pstjohn@nvidia.com>
Signed-off-by: Peter St. John <peterc.stjohn@gmail.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
…ped Tensor Swizzling (#2669)

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove changes not needed for bf16

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* keep only pytorch binding for now

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* linting error

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* add fast accumulator support

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* MXFP8 grouped GEMM + tensor-scaled FP8 fixes

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Change version to 13.3

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* fix the test

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Random padding condition shouldnt be done for mxfp8

Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* Remove incorrect comment

Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* CUBLAS > 13.2 is enough

Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* CUBLAS version needed for MXFP8 indeed seems to be 13.3

Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* all changes for grouped gemm

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Accidental line removal added back. Plus need changes ci t trigger

Add documentation for scaling factors in common.h

Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* Update cuBLAS version requirement for MXFP8 support

Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* Update transformer_engine/common/gemm/cublaslt_grouped_gemm.cu

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* grouped gemm: address code review comments

- Replace nvte_set/get_grouped_tensor_swizzled_scales with nvte_set_grouped_tensor_param
- Add host-side validation: A and B must use same scaling mode (both MXFP8 or both tensor scaling)
- Add host-side validation: A and B must both be FP8 or both non-FP8; restrict inputs to FP8/BF16
- Restrict output (C/D) to BF16/FP32; remove FP16 from supported types
- Refactor workspace allocation: replace manual offset arithmetic with moving pointer pattern
- Use void* + NVTEScalingMode in setup kernel instead of separate float*/char* scale params
- Extract use_columnwise(swap_dims) helper to eliminate duplicated MXFP8 columnwise blocks
- Split set_fp8_scale_pointers into set_fp8_scale_pointers / set_mxfp8_scale_pointers
- Remove scale_inv_ptrs from GroupedOperandSelection; pass workspace pointers directly
- Move swizzled-scales validation into validate_grouped_gemm_inputs for fail-fast behavior
- Add use_split_accumulator to GroupedMatmulConfig (Hopper only, default false)
- Add FP8 test case with per-tensor scales; add BF16/MXFP8 shape-varying test cases

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* address reviee comments

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* missed merged conflict handling

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* minor change

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* forgot adding a or

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* resolve merge conflicts

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* address review comments

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* address minor review comments

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* remove unecessary code

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* one line that broke everything :(

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* address review comments

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* address review comments

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update transformer_engine/common/gemm/cublaslt_grouped_gemm.cu

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* Update transformer_engine/common/gemm/cublaslt_grouped_gemm.cu

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* address review comments

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* unecessary

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* revert caching changes

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* Update transformer_engine/common/gemm/cublaslt_grouped_gemm.cu

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* fix minor bug from greptile

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* revert for now

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* address review comments

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* address review comments

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* address review comments

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* address review comments

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* address review comments

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* address review commentsgp

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

---------

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Jeremy Berchtold <jberchtold@nvidia.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: Pawel Gadzinski <pgadzinski@nvidia.com>
@ipanfilo ipanfilo requested review from Micky774 and aris134 April 21, 2026 21:17
@ipanfilo ipanfilo added the ci-level 3 CI test level 3 label May 11, 2026
@ipanfilo ipanfilo marked this pull request as ready for review May 14, 2026 14:38
Copy link
Copy Markdown
Contributor

@alextmagro alextmagro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! No comments from me, looks clean.

Comment thread tests/pytorch/distributed/run_fsdp2_fused_adam.py
Comment thread tests/pytorch/test_numerics.py Outdated
)


@pytest.mark.skipif(IS_HIP_EXTENSION, reason="Grouped GEMM is not yet supported in ROCm TE")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: grouped gemm with grouped tensor not yet supported?

Comment thread transformer_engine/common/gemm/cublaslt_grouped_gemm.cu
Comment thread transformer_engine/common/recipe/nvfp4.cu
Comment thread transformer_engine/common/recipe/nvfp4.cu

// first check whether ck can be used, then check aotriton
// TODO: deterministic support on CK
if(nvte_fused_attn_ck && !deterministic && fused_attn_rocm::is_ck_backend_supported(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't we recently enable deterministic support for CK? cc: @wangye805 @AllenFarcas

@ipanfilo ipanfilo requested review from Micky774 and wangye805 June 1, 2026 17:28
Copy link
Copy Markdown
Contributor

@Micky774 Micky774 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-level 3 CI test level 3

Projects

None yet

Development

Successfully merging this pull request may close these issues.