CUTLASS 4.4.2

Latest

Latest

hwu36 released this 17 Mar 14:55

da5e086

CuTe DSL

New features
- CuTe DSL now supports Python 3.14 for both x86_64 and aarch64
- Runtime Pointer/Tensor/FakeTensor now supports cache_key, providing a stable, hashable representation that simplifies and improves compiled function caching.
Bug fixing and improvements
- Fixed Hopper FMHA causal attention performance regression on CUDA toolkit 13.1 by
  optimizing mbarrier synchronization to avoid unnecessary convergence barriers.
- Fix kernel loading race condition when multiple GPU are present in the same process in JAX.

CUTLASS C++

Enable Blackwell SM120f compilation of examples and exposes NVFP4/MX Grouped GEMM in the CUTLASS Profiler.

Assets 2