CK JIT integration#582
Conversation
Claude WalkthroughIntent. Wire up an on-demand JIT path for the AITER/CK fused-attention kernels: rather than always building the full kernel matrix from source (or downloading a monolithic prebuilt), the build now produces slim MHA libraries paired with per-kernel blob sources that get compiled lazily at runtime. A new Key changes.
Walkthrough.
Testing. No new tests. The only PR-side test-surface change is the CI shell wrappers ( Notes for reviewers.
Generated by Claude. To request a code review, comment |
| url = https://github.com/Micky774/QoLA.git | ||
| [submodule "3rdparty/ck_jit"] | ||
| path = 3rdparty/ck_jit | ||
| url = https://github.com/ipanfilo/ck_jit.git |
There was a problem hiding this comment.
The ck_jit submodule points at a personal repo (github.com/ipanfilo/ck_jit.git). With CK_JIT being enabled by default (see transformer_engine/common/ck_fused_attn/CMakeLists.txt:73), this becomes a hard dependency on ipanfilo/ck_jit for every default ROCm build — including wheels published from rocm-wheels-build.yml. Recommend moving the repo under the ROCm org (or another organizational account) before this lands, so availability/ownership isn't tied to a single contributor account. The existing personal-repo submodules in this file (HaiShaw/minGPT, floraamd/nanoGPTwTE) are example-only and don't gate the build.
| fmha_bwd_convert_dq_d32_fp16_b64x0_batch_o2_npad_ndeterministic_gfx9.so.0PNcSK | ||
| fmha_bwd_convert_dq_d32_fp16_b64x0_batch_o2_npad_ndeterministic_gfx9.so.HqtomN | ||
| fmha_bwd_convert_dq_d32_fp16_b64x0_batch_o2_npad_ndeterministic_gfx9.so.PHbI9W |
There was a problem hiding this comment.
These three entries (and several more in the file) are clearly transient build-tool tempfiles, not real kernel libraries — the .so.<6-random-chars> suffix is what tools like mv -T / atomic-rename writers leave behind. They got captured into the blob list by accident and will never match an actual prebuild target.
All offending lines I see: 47, 48, 49, 75, 77, 323, 324, 325, 355, 358, 506. Regex \.so\.[A-Za-z0-9]{6}$ flags them all. Suggest regenerating the list with the temp files filtered out (e.g. only accept names matching \.so$).
| if(NOT "$ENV{NVTE_CK_JIT}" STREQUAL "0") | ||
| set(__USE_CK_JIT TRUE) | ||
| else() | ||
| set(__AITER_MHA_PATH "") | ||
| set(__USE_CK_JIT FALSE) | ||
| endif() | ||
| if(DEFINED ENV{AITER_MHA_PATH}) | ||
| message(STATUS "[AITER-BUILD] Using AITER_MHA_PATH=$ENV{AITER_MHA_PATH}") | ||
| # use pre-built libraries and includes from a location specified by the user | ||
| set(__AITER_CACHE_DIR $ENV{AITER_MHA_PATH}) | ||
| elseif(NOT __USE_CK_JIT) #disable for CK_JIT for now | ||
| # use pre-built cache | ||
| include("${CMAKE_CURRENT_LIST_DIR}/aiter_prebuilt.cmake") | ||
| get_prebuilt_aiter(__AITER_MHA_PATH) | ||
| get_prebuilt_aiter(__AITER_CACHE_DIR) | ||
| endif() |
There was a problem hiding this comment.
Two concerns about the default-on behavior:
-
Default-on with a fail-closed signal.
__USE_CK_JITisTRUEwheneverNVTE_CK_JITis anything except literal"0"— unset, empty,"off","false", etc. all enable CK_JIT. If you want default-on, fine, but the checkNOT "$ENV{NVTE_CK_JIT}" STREQUAL "0"makes "opt out" only work forNVTE_CK_JIT=0. A clearer pattern is to use a CMakeoption()plus an env-var override so the choice surfaces in the configure summary and is documented. -
Default-on skips the prebuilt-cache path entirely. The
elseif(NOT __USE_CK_JIT)on line 82 means that with the default settings (no env vars set) we never callget_prebuilt_aiter(), so the existing on-disk cache andNVTE_AITER_PREBUILT_BASE_URLdownload path are bypassed and we always rebuild via CK_JIT. That's a non-trivial CI/dev-loop regression vs. the previous default. If CK_JIT is meant to also honor that cache, this branch should still call intoaiter_prebuilt.cmakefirst; if not, the PR description should call this out as an intentional behavior change. Also:NVTE_CK_JIT/NVTE_CK_JIT_DIRaren't documented anywhere — please add a short note (README or env-var table) since this is the new default.
There was a problem hiding this comment.
Ditto on point 2, I think we should still use the pre-built cache if available as first priority.
|
|
||
| if rocm_build(): | ||
| cmake_flags.append("-DUSE_ROCM=ON") |
There was a problem hiding this comment.
This hunk silently drops the NVTE_AOTRITON_PATH -> -DAOTRITON_PATH=… and NVTE_CK_FUSED_ATTN_PATH -> -DAITER_MHA_PATH=… translations.
Downstream the CMake files now read $ENV{AOTRITON_PATH} and $ENV{AITER_MHA_PATH} directly (see aotriton/CMakeLists.txt:11 and ck_fused_attn/CMakeLists.txt:78), so users have to rename their env vars from NVTE_AOTRITON_PATH / NVTE_CK_FUSED_ATTN_PATH to the bare names. Anyone with existing scripts setting the NVTE_* form will get a silent "ignored" failure — the build will quietly fall back to the default behavior instead of using their prebuilt path.
PR is marked as breaking, but the description doesn't call this rename out. Two reasonable options:
- keep the
NVTE_*aliases here as a thin shim (a couple of lines) so old usage still works; or - explicitly document the env-var rename in the PR description / release notes.
There was a problem hiding this comment.
Is this an intentional part of this change? Seems unrelated?
| pip list | egrep "flax|fidle|jax|ml_dtypes|numpy|transformer_e|typing_ext" | ||
| #check_test_jobs_requested | ||
| #test $? -eq 0 && init_test_jobs `python -c "import jax; print(len([d for d in jax.devices() if 'rocm' in d.client.platform_version]))"` | ||
| ck_jit_prebuild build |
There was a problem hiding this comment.
ck_jit_prebuild can return non-zero from several paths (missing blob list, missing prebuild script, can't resolve TE install dir — see ci/_utils.sh:280-300), but neither this caller nor the matching one in ci/pytorch.sh:151 checks the return. A silent prebuild failure here means every test config that follows still runs, but JIT-compiles each kernel on the first call — looks like the CI succeeded while actually undoing the whole point of the prebuild step.
Suggest making this fail-fast, e.g. ck_jit_prebuild build || return 1 (or || exit 1 depending on shell context), so a broken prebuild surfaces as a CI failure rather than as a quiet slowdown.
|
|
||
|
|
||
| set(AITER_MHA_INSTALL_PREFIX "transformer_engine" CACHE STRING "aiter mha shared lib install prefix in TE") | ||
| set(AITER_MHA_INSTALL_DIR "${CMAKE_INSTALL_PREFIX}/transformer_engine/lib") |
There was a problem hiding this comment.
Minor: this replaces the previous AITER_MHA_INSTALL_PREFIX CACHE STRING with a hardcoded path. The old form let a downstream consumer override the install layout with -DAITER_MHA_INSTALL_PREFIX=…; the new form bakes transformer_engine/lib in. Probably no one was actually using the override, but if you don't need configurability the CACHE STRING doc-string was worth keeping for grep-ability. Not blocking.
Review summaryReviewed the CK JIT integration changes (10 files, ~666/-36 LoC; merge-base 6855218). Scope: new Verdict: changes look generally on-track for adding CK_JIT as an additional build path, but a few items should be addressed before merge — see inline comments. Highlights:
Copyright headers: OK — all modified/added files carry an AMD copyright line with year ending in 2026; no NVIDIA copyright was altered. This is the first Claude review on this PR; no prior Claude or human review comments to dedupe against. |
wangye805
left a comment
There was a problem hiding this comment.
LGTM. Actually this is a very clever and non-trivial design. Do you think we can merge this into CK repo?
Two small issues:
1). For the multi-process scenario like pytorch, your mktemp + mv -n cache write pattern relies on rename(2) being atomic, which is only guaranteed on a single local filesystem — if $CK_JIT_CACHE_DIR ever lands on NFS or another shared FS, readers can dlopen a partially-written .so. Worth documenting that the cache dir must be local.
2). cache invalidation note: stems are content-agnostic, so a CK/aiter source change without a template-signature change silently reuses the stale .so — worth a README line telling users to run ck_jit_prebuild.py clean --all after any CK/aiter version bump. Or run ck_jit_prebuild.py clean during each pip install?
Micky774
left a comment
There was a problem hiding this comment.
Overall looks good, just a couple points
|
|
||
| if rocm_build(): | ||
| cmake_flags.append("-DUSE_ROCM=ON") |
There was a problem hiding this comment.
Is this an intentional part of this change? Seems unrelated?
| if(NOT "$ENV{NVTE_CK_JIT}" STREQUAL "0") | ||
| set(__USE_CK_JIT TRUE) | ||
| else() | ||
| set(__AITER_MHA_PATH "") | ||
| set(__USE_CK_JIT FALSE) | ||
| endif() | ||
| if(DEFINED ENV{AITER_MHA_PATH}) | ||
| message(STATUS "[AITER-BUILD] Using AITER_MHA_PATH=$ENV{AITER_MHA_PATH}") | ||
| # use pre-built libraries and includes from a location specified by the user | ||
| set(__AITER_CACHE_DIR $ENV{AITER_MHA_PATH}) | ||
| elseif(NOT __USE_CK_JIT) #disable for CK_JIT for now | ||
| # use pre-built cache | ||
| include("${CMAKE_CURRENT_LIST_DIR}/aiter_prebuilt.cmake") | ||
| get_prebuilt_aiter(__AITER_MHA_PATH) | ||
| get_prebuilt_aiter(__AITER_CACHE_DIR) | ||
| endif() |
There was a problem hiding this comment.
Ditto on point 2, I think we should still use the pre-built cache if available as first priority.
Description
Add CK JIT demo integration that builds slim MHA libraries accompanied by blob sources to compile on demand
Fixes https://github.com/ROCm/frameworks-internal/issues/16406
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: