[CUDA Plugin EP] Add provider options: user_compute_stream, do_copy_in_default_stream, use_ep_level_unified_stream, external allocator by tianleiwu · Pull Request #28603 · microsoft/onnxruntime

tianleiwu · 2026-05-20T22:07:40Z

Description

Wires up the remaining provider options from OrtCUDAProviderOptionsV2 that were previously missing in the CUDA plugin EP (CudaEpFactory::CreateEpImpl). This closes the "Provider Options Gaps" section (§2) of the plugin EP tracking doc for the stream/allocator options that are not blocked on the tunable ops framework.

The bundled CUDA EP already supports these through OrtCUDAProviderOptionsV2. This PR brings parity to the plugin EP so that users can configure external streams, copy behavior, unified stream mode, and external GPU allocators via session options.

Summary of Changes

Stream and copy options

File	Change
`onnxruntime/core/providers/cuda/plugin/cuda_ep.h`	Add `has_user_compute_stream`, `user_compute_stream`, `do_copy_in_default_stream`, `use_ep_level_unified_stream`, `external_alloc`, `external_free`, `external_empty_cache` to `Config`.
`onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc`	Parse all new options from session config with hex-safe pointer parsing (base 0 + full-string validation). Add validations: incomplete external allocator warning, mutual exclusion of `user_compute_stream` + `external_allocator`, mutual exclusion of `user_compute_stream` + `enable_cuda_graph`. Auto-force unified stream when user stream or external allocator is configured.
`onnxruntime/core/providers/cuda/plugin/cuda_ep.cc`	Use `user_compute_stream` in `CreateSyncStreamForDeviceImpl` via new `InitHandlesWithUserStream`. Disable concurrent runs when unified stream is active.
`onnxruntime/core/providers/cuda/plugin/cuda_stream_plugin.cc/h`	New `CudaSyncStream::InitHandlesWithUserStream()` — wraps user CUDA stream with owned cuBLAS/cuDNN/cuBLASLt handles without taking ownership of the stream itself.

External allocator

File	Change
`onnxruntime/core/providers/cuda/plugin/cuda_allocator_plugin.cc/h`	New `CudaExternalDeviceAllocator` class delegating alloc/free/empty_cache to user-provided function pointers.
`onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.h`	Extend `DeviceCacheEntry` with external allocator state, refcount, and `UseExternalAllocator()` helper.
`onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc`	`CreateAllocatorImpl` creates `CudaExternalDeviceAllocator` when external pointers are configured; `ReleaseAllocatorImpl` handles its refcount and cleanup.

Kernel adapter plumbing

File	Change
`onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h`	Add `do_copy_in_default_stream` to `CudaKernelAdapterRuntimeConfig`; add `CUDAExecutionProvider::DoCopyOnDefaultStream()`.

Tests

File	Change
`onnxruntime/test/framework/dynamic_plugin_ep_test.cc`	New test `CudaKernelAdapterRuntimeConfigExposesDoCopyInDefaultStream`; fix shared_ptr access pattern in existing tests (`auto&` → `auto`, `.` → `->`).

Testing

All 33 *Plugin* gtest cases pass (4 DynamicPluginEpInfraTest + 29 PluginExecutionProviderTest).
Plugin .so builds clean with no warnings.
Tests under #if defined(ORT_USE_EP_API_ADAPTERS) guard validate the kernel adapter plumbing when built as a plugin library; they are intentionally excluded from the monolithic test binary (known pre-existing limitation due to ORT_API_MANUAL_INIT header requirement).

Motivation and Context

The CUDA plugin EP is a standalone shared library EP using the OrtEpFactory API. It was missing support for several OrtCUDAProviderOptionsV2 options that users rely on in production: external CUDA streams (for framework interop with PyTorch/TensorFlow), copy-stream behavior, unified stream mode, and external GPU memory allocators (for memory pool sharing across frameworks). This PR brings the plugin EP to parity with the bundled EP for these options.

Checklist

Tests added/updated
No breaking changes
CI passes (local build + test verified)

Copilot

Pull request overview

Brings the CUDA plugin execution provider closer to parity with the bundled CUDA EP by wiring additional OrtCUDAProviderOptionsV2-style stream/copy/unified-stream and external allocator options through the plugin EP’s session config parsing, stream creation, and kernel-adapter runtime config.

Changes:

Add new CUDA plugin EP config fields and plumb do_copy_in_default_stream into the kernel adapter runtime config (plus accessor on CUDAExecutionProvider).
Support wrapping a user-provided CUDA compute stream by creating per-stream cuBLAS/cuDNN/cuBLASLt handles (InitHandlesWithUserStream).
Add an external device allocator implementation and factory plumbing, plus a new unit test covering do_copy_in_default_stream.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
onnxruntime/test/framework/dynamic_plugin_ep_test.cc	Updates adapter-config access pattern and adds a test for `do_copy_in_default_stream` exposure.
onnxruntime/core/providers/cuda/plugin/cuda_stream_plugin.h	Declares `CudaSyncStream::InitHandlesWithUserStream` for user stream scenarios.
onnxruntime/core/providers/cuda/plugin/cuda_stream_plugin.cc	Implements handle creation/binding for a user-provided CUDA stream.
onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h	Adds `do_copy_in_default_stream` to runtime config and exposes it via `CUDAExecutionProvider`.
onnxruntime/core/providers/cuda/plugin/cuda_ep.h	Extends `CudaEp::Config` with stream/unified-stream and external allocator fields.
onnxruntime/core/providers/cuda/plugin/cuda_ep.cc	Passes `do_copy_in_default_stream` into adapter config; uses user stream wrapper; disables concurrent runs when unified stream is active.
onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.h	Extends `DeviceCacheEntry` with external allocator state/refcount and helper.
onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc	Parses new options (incl. pointer parsing), validates combinations, and creates/releases external allocator from the device cache.
onnxruntime/core/providers/cuda/plugin/cuda_allocator_plugin.h	Declares `CudaExternalDeviceAllocator` using user-provided function pointers.
onnxruntime/core/providers/cuda/plugin/cuda_allocator_plugin.cc	Implements external allocator alloc/free/reserve behavior (currently without `empty_cache` semantics).

Comments suppressed due to low confidence (3)

onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc:577

has_user_compute_stream can remain true even when user_compute_stream is null (e.g., user sets the flag but forgets/invalidly parses the pointer). This diverges from the bundled CUDA EP behavior (which derives has_user_compute_stream from whether the pointer is non-null) and can incorrectly force unified-stream mode and trigger the mutual-exclusion errors with CUDA graph/external allocator. Suggest setting config.has_user_compute_stream = (config.user_compute_stream != nullptr) after parsing, and/or returning a clear error when the flag is set but the pointer is missing.

  // If user_compute_stream is provided, force has_user_compute_stream to true.
  if (config.user_compute_stream != nullptr) {
    config.has_user_compute_stream = true;
  }

onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc:633

External allocator callbacks are persisted in DeviceCacheEntry and never cleared. This means one session that sets gpu_external_alloc/free can cause subsequent sessions on the same device (that do not configure an external allocator) to still use the external allocator, since CreateAllocatorImpl only checks entry->UseExternalAllocator(). It also makes the callbacks effectively global across all sessions and can break per-session semantics and concurrency assumptions. Consider scoping external allocator config to the EP/session (e.g., include callback tuple in the cache key, or store per-session state and ensure non-configured sessions don’t inherit prior callbacks).

  // Store external allocator info in the device cache entry so CreateAllocatorImpl can use it.
  if (config.external_alloc != nullptr && config.external_free != nullptr) {
    std::lock_guard<std::mutex> lock(factory->device_cache_mutex_);
    auto* entry = factory->FindDeviceCacheEntryByOrdinalLocked(config.device_id);
    if (entry) {
      std::lock_guard<std::mutex> arena_lock(entry->arena_mutex);
      entry->external_alloc = config.external_alloc;
      entry->external_free = config.external_free;
      entry->external_empty_cache = config.external_empty_cache;
    }

onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc:704

external_device_allocator is created once and then reused without validating that entry->external_alloc/free/empty_cache still match the allocator’s stored callbacks. If a later session updates the callback pointers, CreateAllocatorImpl will keep returning the allocator constructed with the old function pointers, leading to calling stale/wrong callbacks. If callbacks are meant to be configurable, recreate the allocator when the callback tuple changes (or reject changes once an allocator exists).

    // If external allocator function pointers are configured, use those directly
    // (no arena, no mempool — the external allocator manages its own caching).
    if (entry->UseExternalAllocator()) {
      if (!entry->external_device_allocator) {
        entry->external_device_allocator = std::make_unique<CudaExternalDeviceAllocator>(
            memory_info, req_device_id,
            entry->external_alloc, entry->external_free, entry->external_empty_cache);
      }
      ++entry->num_external_allocator_users;
      *allocator = entry->external_device_allocator.get();
      return nullptr;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

add provider options

cfa5126

tianleiwu requested review from Copilot and yuslepukhin May 20, 2026 22:15

Copilot started reviewing on behalf of tianleiwu May 20, 2026 22:16 View session

Copilot AI reviewed May 20, 2026

View reviewed changes

Comment thread onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc

Comment thread onnxruntime/core/providers/cuda/plugin/cuda_allocator_plugin.cc

address feedbacks

798af85

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA Plugin EP] Add provider options: user_compute_stream, do_copy_in_default_stream, use_ep_level_unified_stream, external allocator#28603

[CUDA Plugin EP] Add provider options: user_compute_stream, do_copy_in_default_stream, use_ep_level_unified_stream, external allocator#28603
tianleiwu wants to merge 2 commits into
mainfrom
tlwu/20260520/cuda_plugin_options

tianleiwu commented May 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tianleiwu commented May 20, 2026

Description

Summary of Changes

Stream and copy options

External allocator

Kernel adapter plumbing

Tests

Testing

Motivation and Context

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants