Skip to content

[CUDA Plugin EP] Add provider options: user_compute_stream, do_copy_in_default_stream, use_ep_level_unified_stream, external allocator#28603

Open
tianleiwu wants to merge 2 commits into
mainfrom
tlwu/20260520/cuda_plugin_options
Open

[CUDA Plugin EP] Add provider options: user_compute_stream, do_copy_in_default_stream, use_ep_level_unified_stream, external allocator#28603
tianleiwu wants to merge 2 commits into
mainfrom
tlwu/20260520/cuda_plugin_options

Conversation

@tianleiwu
Copy link
Copy Markdown
Contributor

Description

Wires up the remaining provider options from OrtCUDAProviderOptionsV2 that were previously missing in the CUDA plugin EP (CudaEpFactory::CreateEpImpl). This closes the "Provider Options Gaps" section (§2) of the plugin EP tracking doc for the stream/allocator options that are not blocked on the tunable ops framework.

The bundled CUDA EP already supports these through OrtCUDAProviderOptionsV2. This PR brings parity to the plugin EP so that users can configure external streams, copy behavior, unified stream mode, and external GPU allocators via session options.

Summary of Changes

Stream and copy options

File Change
onnxruntime/core/providers/cuda/plugin/cuda_ep.h Add has_user_compute_stream, user_compute_stream, do_copy_in_default_stream, use_ep_level_unified_stream, external_alloc, external_free, external_empty_cache to Config.
onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc Parse all new options from session config with hex-safe pointer parsing (base 0 + full-string validation). Add validations: incomplete external allocator warning, mutual exclusion of user_compute_stream + external_allocator, mutual exclusion of user_compute_stream + enable_cuda_graph. Auto-force unified stream when user stream or external allocator is configured.
onnxruntime/core/providers/cuda/plugin/cuda_ep.cc Use user_compute_stream in CreateSyncStreamForDeviceImpl via new InitHandlesWithUserStream. Disable concurrent runs when unified stream is active.
onnxruntime/core/providers/cuda/plugin/cuda_stream_plugin.cc/h New CudaSyncStream::InitHandlesWithUserStream() — wraps user CUDA stream with owned cuBLAS/cuDNN/cuBLASLt handles without taking ownership of the stream itself.

External allocator

File Change
onnxruntime/core/providers/cuda/plugin/cuda_allocator_plugin.cc/h New CudaExternalDeviceAllocator class delegating alloc/free/empty_cache to user-provided function pointers.
onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.h Extend DeviceCacheEntry with external allocator state, refcount, and UseExternalAllocator() helper.
onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc CreateAllocatorImpl creates CudaExternalDeviceAllocator when external pointers are configured; ReleaseAllocatorImpl handles its refcount and cleanup.

Kernel adapter plumbing

File Change
onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h Add do_copy_in_default_stream to CudaKernelAdapterRuntimeConfig; add CUDAExecutionProvider::DoCopyOnDefaultStream().

Tests

File Change
onnxruntime/test/framework/dynamic_plugin_ep_test.cc New test CudaKernelAdapterRuntimeConfigExposesDoCopyInDefaultStream; fix shared_ptr access pattern in existing tests (auto&auto, .->).

Testing

  • All 33 *Plugin* gtest cases pass (4 DynamicPluginEpInfraTest + 29 PluginExecutionProviderTest).
  • Plugin .so builds clean with no warnings.
  • Tests under #if defined(ORT_USE_EP_API_ADAPTERS) guard validate the kernel adapter plumbing when built as a plugin library; they are intentionally excluded from the monolithic test binary (known pre-existing limitation due to ORT_API_MANUAL_INIT header requirement).

Motivation and Context

The CUDA plugin EP is a standalone shared library EP using the OrtEpFactory API. It was missing support for several OrtCUDAProviderOptionsV2 options that users rely on in production: external CUDA streams (for framework interop with PyTorch/TensorFlow), copy-stream behavior, unified stream mode, and external GPU memory allocators (for memory pool sharing across frameworks). This PR brings the plugin EP to parity with the bundled EP for these options.

Checklist

  • Tests added/updated
  • No breaking changes
  • CI passes (local build + test verified)

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Brings the CUDA plugin execution provider closer to parity with the bundled CUDA EP by wiring additional OrtCUDAProviderOptionsV2-style stream/copy/unified-stream and external allocator options through the plugin EP’s session config parsing, stream creation, and kernel-adapter runtime config.

Changes:

  • Add new CUDA plugin EP config fields and plumb do_copy_in_default_stream into the kernel adapter runtime config (plus accessor on CUDAExecutionProvider).
  • Support wrapping a user-provided CUDA compute stream by creating per-stream cuBLAS/cuDNN/cuBLASLt handles (InitHandlesWithUserStream).
  • Add an external device allocator implementation and factory plumbing, plus a new unit test covering do_copy_in_default_stream.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
onnxruntime/test/framework/dynamic_plugin_ep_test.cc Updates adapter-config access pattern and adds a test for do_copy_in_default_stream exposure.
onnxruntime/core/providers/cuda/plugin/cuda_stream_plugin.h Declares CudaSyncStream::InitHandlesWithUserStream for user stream scenarios.
onnxruntime/core/providers/cuda/plugin/cuda_stream_plugin.cc Implements handle creation/binding for a user-provided CUDA stream.
onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h Adds do_copy_in_default_stream to runtime config and exposes it via CUDAExecutionProvider.
onnxruntime/core/providers/cuda/plugin/cuda_ep.h Extends CudaEp::Config with stream/unified-stream and external allocator fields.
onnxruntime/core/providers/cuda/plugin/cuda_ep.cc Passes do_copy_in_default_stream into adapter config; uses user stream wrapper; disables concurrent runs when unified stream is active.
onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.h Extends DeviceCacheEntry with external allocator state/refcount and helper.
onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc Parses new options (incl. pointer parsing), validates combinations, and creates/releases external allocator from the device cache.
onnxruntime/core/providers/cuda/plugin/cuda_allocator_plugin.h Declares CudaExternalDeviceAllocator using user-provided function pointers.
onnxruntime/core/providers/cuda/plugin/cuda_allocator_plugin.cc Implements external allocator alloc/free/reserve behavior (currently without empty_cache semantics).
Comments suppressed due to low confidence (3)

onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc:577

  • has_user_compute_stream can remain true even when user_compute_stream is null (e.g., user sets the flag but forgets/invalidly parses the pointer). This diverges from the bundled CUDA EP behavior (which derives has_user_compute_stream from whether the pointer is non-null) and can incorrectly force unified-stream mode and trigger the mutual-exclusion errors with CUDA graph/external allocator. Suggest setting config.has_user_compute_stream = (config.user_compute_stream != nullptr) after parsing, and/or returning a clear error when the flag is set but the pointer is missing.
  // If user_compute_stream is provided, force has_user_compute_stream to true.
  if (config.user_compute_stream != nullptr) {
    config.has_user_compute_stream = true;
  }

onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc:633

  • External allocator callbacks are persisted in DeviceCacheEntry and never cleared. This means one session that sets gpu_external_alloc/free can cause subsequent sessions on the same device (that do not configure an external allocator) to still use the external allocator, since CreateAllocatorImpl only checks entry->UseExternalAllocator(). It also makes the callbacks effectively global across all sessions and can break per-session semantics and concurrency assumptions. Consider scoping external allocator config to the EP/session (e.g., include callback tuple in the cache key, or store per-session state and ensure non-configured sessions don’t inherit prior callbacks).
  // Store external allocator info in the device cache entry so CreateAllocatorImpl can use it.
  if (config.external_alloc != nullptr && config.external_free != nullptr) {
    std::lock_guard<std::mutex> lock(factory->device_cache_mutex_);
    auto* entry = factory->FindDeviceCacheEntryByOrdinalLocked(config.device_id);
    if (entry) {
      std::lock_guard<std::mutex> arena_lock(entry->arena_mutex);
      entry->external_alloc = config.external_alloc;
      entry->external_free = config.external_free;
      entry->external_empty_cache = config.external_empty_cache;
    }

onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc:704

  • external_device_allocator is created once and then reused without validating that entry->external_alloc/free/empty_cache still match the allocator’s stored callbacks. If a later session updates the callback pointers, CreateAllocatorImpl will keep returning the allocator constructed with the old function pointers, leading to calling stale/wrong callbacks. If callbacks are meant to be configurable, recreate the allocator when the callback tuple changes (or reject changes once an allocator exists).
    // If external allocator function pointers are configured, use those directly
    // (no arena, no mempool — the external allocator manages its own caching).
    if (entry->UseExternalAllocator()) {
      if (!entry->external_device_allocator) {
        entry->external_device_allocator = std::make_unique<CudaExternalDeviceAllocator>(
            memory_info, req_device_id,
            entry->external_alloc, entry->external_free, entry->external_empty_cache);
      }
      ++entry->num_external_allocator_users;
      *allocator = entry->external_device_allocator.get();
      return nullptr;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc
Comment thread onnxruntime/core/providers/cuda/plugin/cuda_allocator_plugin.cc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants