Update QoLA/AITER #599
Conversation
| COMMAND sh -c | ||
| "tmp=$(mktemp /tmp/gitconfig.XXXXXX) || exit 1; \ | ||
| GIT_CONFIG_GLOBAL=$tmp git config --global --add safe.directory '*' >/dev/null 2>&1; \ | ||
| GIT_CONFIG_GLOBAL=$tmp PYTHONPATH=\"${__QOLA_DIR}:$PYTHONPATH\" '${Python_EXECUTABLE}' -m qola.cli checkout \ |
There was a problem hiding this comment.
Wouldn't it make sense to integrate safe.directory overriding to qola? The pattern with dubious ownership is probably not TE specific
There was a problem hiding this comment.
I worry that such behavior is a bit too authoritative for qola if that makes sense? My reasoning is that the permission scope here seems to be outside of qola and hence qola should not be the one in charge of it. I'm open to reconsidering that, but it's just my initial position.
There was a problem hiding this comment.
It is valid concern bearing in mind QoLA is intended to be reused by different components. May be make this behavior controllable then. It is OK to keep the things in TE, it will just require doing things this overriding twice - here and when build is called - BTW, there are comments there but not actual code change
| # commit QoLA will actually check out and build, not whatever happens to be | ||
| # the submodule's current HEAD at configure time. | ||
| set(__QOLA_MANIFEST "${CMAKE_CURRENT_LIST_DIR}/qola_manifest.toml") | ||
| set_property(DIRECTORY APPEND PROPERTY CMAKE_CONFIGURE_DEPENDS "${__QOLA_MANIFEST}") |
There was a problem hiding this comment.
AITER_SHA is not cached variable. Why is CMAKE_CONFIGURE_DEPENDS needed?
| if(NOT AITER_CHECKOUT_RESULT EQUAL 0) | ||
| message(FATAL_ERROR | ||
| "Failed to sync AITER source tree at ${__AITER_SOURCE_DIR} to " | ||
| "manifest-pinned commit ${AITER_SHA}.\n" |
There was a problem hiding this comment.
Should it also validate that actual commit matches one detected by prebuilt.cmake? If QoLA checkout AITER unconditionally, may be keep prebuilt.cmake as-is and where it is now? I.e. QoLA fetches AITER, prebuit.cmake checks for git commit as before. It will only loose AITER_SHA value in this error message
| return nullptr; | ||
| } | ||
| void* ptr = nullptr; | ||
| if(hipMallocAsync(&ptr, bytes, stream) != hipSuccess){ |
There was a problem hiding this comment.
Emm, if we let hip runtime to allocate and manage our buffers, this will create a series of issues. For example, if Pytorch or JAX users pre-allocate 97% of HBM, then our hipMallocAsync will return out of memory.
If what new aiter needs for fmha_args.workspace_alloc is a lambda, we can fake it to give jax/pytorch generated workspace buffer ptr?
| // callback thread, which holds runtime locks — calling any HIP API from it | ||
| // (including hipHostFree) deadlocks against concurrent main-thread HIP | ||
| // calls. Defer the free to ck_tile::pinned_host_releaser's worker thread. | ||
| fmha_args.pinned_host_alloc = [](size_t bytes) -> std::shared_ptr<void> { |
There was a problem hiding this comment.
Emm, why do they need host memory allocated? Is it for inference?
Again, can we fake the lambda to use pytorch/jax generated workspace?
Description
Updates QoLA as well as moves up the pinned AITER commit
Fixes # (issue)
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: