You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This commit was created on GitHub.com and signed with GitHub’s verified signature.
CuTe DSL
New features
CuTe DSL now supports Python 3.14 for both x86_64 and aarch64
Runtime Pointer/Tensor/FakeTensor now supports cache_key, providing a stable, hashable representation that simplifies and improves compiled function caching.
Bug fixing and improvements
Fixed Hopper FMHA causal attention performance regression on CUDA toolkit 13.1 by
optimizing mbarrier synchronization to avoid unnecessary convergence barriers.
Fix kernel loading race condition when multiple GPU are present in the same process in JAX.
CUTLASS C++
Enable Blackwell SM120f compilation of examples and exposes NVFP4/MX Grouped GEMM in the CUTLASS Profiler.