[CUDA Plugin EP] Add provider options: user_compute_stream, do_copy_in_default_stream, use_ep_level_unified_stream, external allocator#28603
Open
tianleiwu wants to merge 2 commits into
Open
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Brings the CUDA plugin execution provider closer to parity with the bundled CUDA EP by wiring additional OrtCUDAProviderOptionsV2-style stream/copy/unified-stream and external allocator options through the plugin EP’s session config parsing, stream creation, and kernel-adapter runtime config.
Changes:
- Add new CUDA plugin EP config fields and plumb
do_copy_in_default_streaminto the kernel adapter runtime config (plus accessor onCUDAExecutionProvider). - Support wrapping a user-provided CUDA compute stream by creating per-stream cuBLAS/cuDNN/cuBLASLt handles (
InitHandlesWithUserStream). - Add an external device allocator implementation and factory plumbing, plus a new unit test covering
do_copy_in_default_stream.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| onnxruntime/test/framework/dynamic_plugin_ep_test.cc | Updates adapter-config access pattern and adds a test for do_copy_in_default_stream exposure. |
| onnxruntime/core/providers/cuda/plugin/cuda_stream_plugin.h | Declares CudaSyncStream::InitHandlesWithUserStream for user stream scenarios. |
| onnxruntime/core/providers/cuda/plugin/cuda_stream_plugin.cc | Implements handle creation/binding for a user-provided CUDA stream. |
| onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h | Adds do_copy_in_default_stream to runtime config and exposes it via CUDAExecutionProvider. |
| onnxruntime/core/providers/cuda/plugin/cuda_ep.h | Extends CudaEp::Config with stream/unified-stream and external allocator fields. |
| onnxruntime/core/providers/cuda/plugin/cuda_ep.cc | Passes do_copy_in_default_stream into adapter config; uses user stream wrapper; disables concurrent runs when unified stream is active. |
| onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.h | Extends DeviceCacheEntry with external allocator state/refcount and helper. |
| onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc | Parses new options (incl. pointer parsing), validates combinations, and creates/releases external allocator from the device cache. |
| onnxruntime/core/providers/cuda/plugin/cuda_allocator_plugin.h | Declares CudaExternalDeviceAllocator using user-provided function pointers. |
| onnxruntime/core/providers/cuda/plugin/cuda_allocator_plugin.cc | Implements external allocator alloc/free/reserve behavior (currently without empty_cache semantics). |
Comments suppressed due to low confidence (3)
onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc:577
has_user_compute_streamcan remain true even whenuser_compute_streamis null (e.g., user sets the flag but forgets/invalidly parses the pointer). This diverges from the bundled CUDA EP behavior (which deriveshas_user_compute_streamfrom whether the pointer is non-null) and can incorrectly force unified-stream mode and trigger the mutual-exclusion errors with CUDA graph/external allocator. Suggest settingconfig.has_user_compute_stream = (config.user_compute_stream != nullptr)after parsing, and/or returning a clear error when the flag is set but the pointer is missing.
// If user_compute_stream is provided, force has_user_compute_stream to true.
if (config.user_compute_stream != nullptr) {
config.has_user_compute_stream = true;
}
onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc:633
- External allocator callbacks are persisted in
DeviceCacheEntryand never cleared. This means one session that setsgpu_external_alloc/freecan cause subsequent sessions on the same device (that do not configure an external allocator) to still use the external allocator, sinceCreateAllocatorImplonly checksentry->UseExternalAllocator(). It also makes the callbacks effectively global across all sessions and can break per-session semantics and concurrency assumptions. Consider scoping external allocator config to the EP/session (e.g., include callback tuple in the cache key, or store per-session state and ensure non-configured sessions don’t inherit prior callbacks).
// Store external allocator info in the device cache entry so CreateAllocatorImpl can use it.
if (config.external_alloc != nullptr && config.external_free != nullptr) {
std::lock_guard<std::mutex> lock(factory->device_cache_mutex_);
auto* entry = factory->FindDeviceCacheEntryByOrdinalLocked(config.device_id);
if (entry) {
std::lock_guard<std::mutex> arena_lock(entry->arena_mutex);
entry->external_alloc = config.external_alloc;
entry->external_free = config.external_free;
entry->external_empty_cache = config.external_empty_cache;
}
onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc:704
external_device_allocatoris created once and then reused without validating thatentry->external_alloc/free/empty_cachestill match the allocator’s stored callbacks. If a later session updates the callback pointers,CreateAllocatorImplwill keep returning the allocator constructed with the old function pointers, leading to calling stale/wrong callbacks. If callbacks are meant to be configurable, recreate the allocator when the callback tuple changes (or reject changes once an allocator exists).
// If external allocator function pointers are configured, use those directly
// (no arena, no mempool — the external allocator manages its own caching).
if (entry->UseExternalAllocator()) {
if (!entry->external_device_allocator) {
entry->external_device_allocator = std::make_unique<CudaExternalDeviceAllocator>(
memory_info, req_device_id,
entry->external_alloc, entry->external_free, entry->external_empty_cache);
}
++entry->num_external_allocator_users;
*allocator = entry->external_device_allocator.get();
return nullptr;
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Wires up the remaining provider options from
OrtCUDAProviderOptionsV2that were previously missing in the CUDA plugin EP (CudaEpFactory::CreateEpImpl). This closes the "Provider Options Gaps" section (§2) of the plugin EP tracking doc for the stream/allocator options that are not blocked on the tunable ops framework.The bundled CUDA EP already supports these through
OrtCUDAProviderOptionsV2. This PR brings parity to the plugin EP so that users can configure external streams, copy behavior, unified stream mode, and external GPU allocators via session options.Summary of Changes
Stream and copy options
onnxruntime/core/providers/cuda/plugin/cuda_ep.hhas_user_compute_stream,user_compute_stream,do_copy_in_default_stream,use_ep_level_unified_stream,external_alloc,external_free,external_empty_cachetoConfig.onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.ccuser_compute_stream+external_allocator, mutual exclusion ofuser_compute_stream+enable_cuda_graph. Auto-force unified stream when user stream or external allocator is configured.onnxruntime/core/providers/cuda/plugin/cuda_ep.ccuser_compute_streaminCreateSyncStreamForDeviceImplvia newInitHandlesWithUserStream. Disable concurrent runs when unified stream is active.onnxruntime/core/providers/cuda/plugin/cuda_stream_plugin.cc/hCudaSyncStream::InitHandlesWithUserStream()— wraps user CUDA stream with owned cuBLAS/cuDNN/cuBLASLt handles without taking ownership of the stream itself.External allocator
onnxruntime/core/providers/cuda/plugin/cuda_allocator_plugin.cc/hCudaExternalDeviceAllocatorclass delegating alloc/free/empty_cache to user-provided function pointers.onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.hDeviceCacheEntrywith external allocator state, refcount, andUseExternalAllocator()helper.onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.ccCreateAllocatorImplcreatesCudaExternalDeviceAllocatorwhen external pointers are configured;ReleaseAllocatorImplhandles its refcount and cleanup.Kernel adapter plumbing
onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.hdo_copy_in_default_streamtoCudaKernelAdapterRuntimeConfig; addCUDAExecutionProvider::DoCopyOnDefaultStream().Tests
onnxruntime/test/framework/dynamic_plugin_ep_test.ccCudaKernelAdapterRuntimeConfigExposesDoCopyInDefaultStream; fix shared_ptr access pattern in existing tests (auto&→auto,.→->).Testing
*Plugin*gtest cases pass (4DynamicPluginEpInfraTest+ 29PluginExecutionProviderTest)..sobuilds clean with no warnings.#if defined(ORT_USE_EP_API_ADAPTERS)guard validate the kernel adapter plumbing when built as a plugin library; they are intentionally excluded from the monolithic test binary (known pre-existing limitation due toORT_API_MANUAL_INITheader requirement).Motivation and Context
The CUDA plugin EP is a standalone shared library EP using the
OrtEpFactoryAPI. It was missing support for severalOrtCUDAProviderOptionsV2options that users rely on in production: external CUDA streams (for framework interop with PyTorch/TensorFlow), copy-stream behavior, unified stream mode, and external GPU memory allocators (for memory pool sharing across frameworks). This PR brings the plugin EP to parity with the bundled EP for these options.Checklist