[UR][L0v2] Fix double zeCommandListClose() in batched queue flush#21660
[UR][L0v2] Fix double zeCommandListClose() in batched queue flush#21660ldorau wants to merge 2 commits intointel:syclfrom
Conversation
|
Please review @pbalcer @EuphoricThinking |
There was a problem hiding this comment.
Pull request overview
Fixes a Level Zero v2 batched-queue flush/renew corner case where reaching the submitted-batch slot limit could re-enter submission and call zeCommandListClose()/submit twice on the same command list, potentially hanging in zeCommandListHostSynchronize() on some driver versions.
Changes:
- Update
ur_queue_batched_t::renewBatchUnlocked()to avoid delegating toqueueFinishUnlocked()when the batch-slot limit is reached; instead directly synchronize, clean pools, andbatchFinish()to reset state without re-submitting. - Clarify
batch_manager::batchFinish()comments to reflect that the active batch may already have been submitted viaqueueFlushUnlocked(). - Unskip batched-queue execution in the multi-queue USM alloc test to allow exercising this path.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| unified-runtime/source/adapters/level_zero/v2/queue_batched.cpp | Prevents double close/double submit on slot-limit renew by changing the synchronization/reset sequence. |
| unified-runtime/test/adapters/level_zero/enqueue_alloc.cpp | Removes batched-queue skips so the multi-queue USM alloc test can run under batched submission. |
ce528a7 to
111db9f
Compare
|
The "E2E (Preview Mode, ["Linux", "gen12"]" CI job failed because of #21023 |
|
Good catch. I would also say that we could replace the content of This way, we don't scatter the implementation of |
111db9f to
27761c6
Compare
Thanks! Fixed. |
|
Please review @EuphoricThinking @pbalcer |
|
Please review @pbalcer @EuphoricThinking @intel/unified-runtime-reviewers-level-zero |
27761c6 to
1e4274b
Compare
My bad, I was thinking of replacing I mean: |
1e4274b to
6d299f3
Compare
|
@EuphoricThinking Fixed. Re-review please. |
Root cause
----------
ur_queue_batched_t::queueFlushUnlocked is called after every
enqueueUSMHostAllocExp / enqueueUSMSharedAllocExp / enqueueUSMDeviceAllocExp
/ enqueueUSMFreeExp to eagerly submit the current batch. The call sequence
is:
queueFlushUnlocked
enqueueCurrentBatchUnlocked() <- (1) zeCommandListClose
+ zeCommandListImmediateAppendCommandListsExp
renewBatchUnlocked()
if runBatches.size() >= initialSlotsForBatches (10):
queueFinishUnlocked()
if !isActiveBatchEmpty(): <- true: enqueuedOperationsCounter > 0
enqueueCurrentBatchUnlocked() <- (2) BUG: double zeCommandListClose
+ double submit
hostSynchronize() <- may hang on some driver versions
enqueuedOperationsCounter is incremented by markIssuedCommandInBatch before
queueFlushUnlocked is called, but it is not cleared by
enqueueCurrentBatchUnlocked, so queueFinishUnlocked's !isActiveBatchEmpty()
guard does not protect against the re-entry.
With 2 queues and 256 iterations the bug fires ~92 times. On certain GPU
driver versions the immediate command list enters a state from which
zeCommandListHostSynchronize never returns, hanging the test:
UR_ADAPTERS_FORCE_LOAD=lib/libur_adapter_level_zero_v2.so \
UR_L0_V2_FORCE_BATCHED=1 \
./test/adapters/level_zero/enqueue_alloc-test \
--gtest_filter=*urL0EnqueueAllocMultiQueueSameDeviceTest.SuccessMt*
Fix
---
Remove the pre-close of the active batch from queueFlushUnlocked and move
enqueueCurrentBatchUnlocked() into renewBatchUnlocked's else branch.
This ensures that when the batch-slot limit is reached the active batch is
still open, so the existing delegation to queueFinishUnlocked closes and
submits it exactly once via its !isActiveBatchEmpty() guard:
renewBatchUnlocked()
if runBatches.size() >= initialSlotsForBatches (10):
queueFinishUnlocked()
if !isActiveBatchEmpty(): <- closes + submits exactly once
enqueueCurrentBatchUnlocked()
hostSynchronize()
queueFinishPoolsUnlocked()
batchFinish()
else:
enqueueCurrentBatchUnlocked() <- normal path
renewRegularUnlocked()
This keeps all finish logic inside queueFinishUnlocked, making the code
easier to maintain and less prone to bugs if queueFinishUnlocked changes.
With queueFlushUnlocked reduced to a single call to renewBatchUnlocked,
the wrapper is no longer needed. Remove it and call renewBatchUnlocked
directly at all former call sites.
Tested-by:
UR_ADAPTERS_FORCE_LOAD=lib/libur_adapter_level_zero_v2.so \
UR_L0_V2_FORCE_BATCHED=1 \
./test/adapters/level_zero/enqueue_alloc-test \
(81 passed, 6 skipped, 0 failed)
Signed-off-by: Lukasz Dorau <lukasz.dorau@intel.com>
6d299f3 to
d421dc7
Compare
d421dc7 to
7507428
Compare
7507428 to
c7c9a6f
Compare
… all queue types Enable the urL0EnqueueAllocMultiQueueSameDeviceTest and parameterize it over all queue submission modes (UR_QUEUE_FLAG_SUBMISSION_BATCHED and UR_QUEUE_FLAG_SUBMISSION_IMMEDIATE) by: - Removing SKIP_IF_BATCHED_QUEUE to enable the test for batched queues. - Changing the base class template parameter from EnqueueAllocMultiQueueTestParam to uur::MultiQueueParam<EnqueueAllocMultiQueueTestParam> so that the queue mode becomes part of the test parameter. - Adding getMultiQueueParam(), getAllocParam() and getQueueFlags() helpers to the fixture for clean access to the two parts of the parameter tuple. getMultiQueueParam() calls the base class getter using the fully qualified name uur::urContextTestWithParam<...>::getParam() to avoid ambiguity. The names are specific enough to not be confused with GoogleTest's conventional GetParam(). - Creating queues with the parameterized flag via ur_queue_properties_t instead of a hardcoded UR_QUEUE_FLAG_SUBMISSION_BATCHED. - Switching the test suite macro from UUR_DEVICE_TEST_SUITE_WITH_PARAM to UUR_MULTI_QUEUE_TYPE_TEST_SUITE_WITH_PARAM and the printer to deviceTestWithParamPrinterMulti, which expands the suite to cover both queue modes automatically. - Updating all three test bodies (SuccessMt, SuccessReuse, SuccessDependantMt) to use getAllocParam() instead of std::get<1>(this->GetParam()), and restoring the numQueues parameter in SuccessMt to getAllocParam().numQueues. This ensures both batched and immediate queues are covered by default test runs without requiring UR_L0_V2_FORCE_BATCHED=1. Signed-off-by: Lukasz Dorau <lukasz.dorau@intel.com>
| auto batchLocked = currentCmdLists.lock(); | ||
| if (batchLocked->isCurrentGeneration(batch_generation)) { | ||
| return queueFlushUnlocked(batchLocked); | ||
| return renewBatchUnlocked(batchLocked); |
There was a problem hiding this comment.
Why would we like to rename queueFlushUnlocked? I think it is consistent with Func() and FuncUnlocked() naming convention.
There was a problem hiding this comment.
We do not rename it. We replace all calls to queueFlushUnlocked() with a call to renewBatchUnlocked().
Root cause
ur_queue_batched_t::queueFlushUnlocked is called after every
enqueueUSMHostAllocExp / enqueueUSMSharedAllocExp / enqueueUSMDeviceAllocExp
/ enqueueUSMFreeExp to eagerly submit the current batch. The call sequence
is:
enqueuedOperationsCounter is incremented by markIssuedCommandInBatch before
queueFlushUnlocked is called, but it is not cleared by
enqueueCurrentBatchUnlocked, so queueFinishUnlocked's !isActiveBatchEmpty()
guard does not protect against the re-entry.
With 2 queues and 256 iterations the bug fires ~92 times. On certain GPU
driver versions the immediate command list enters a state from which
zeCommandListHostSynchronize never returns, hanging the test:
Fix
Remove the pre-close of the active batch from queueFlushUnlocked and move
enqueueCurrentBatchUnlocked() into renewBatchUnlocked's else branch.
This ensures that when the batch-slot limit is reached the active batch is
still open, so the existing delegation to queueFinishUnlocked closes and
submits it exactly once via its !isActiveBatchEmpty() guard:
This keeps all finish logic inside queueFinishUnlocked, making the code
easier to maintain and less prone to bugs if queueFinishUnlocked changes.
With queueFlushUnlocked reduced to a single call to renewBatchUnlocked,
the wrapper is no longer needed. Remove it and call renewBatchUnlocked
directly at all former call sites.
Tested-by: