Fix CUDA build with contrib ops disabled#28554
Open
Copilot wants to merge 6 commits into
Open
Conversation
The CUDA Attention kernel implementation (core/providers/cuda/llm/attention.cc) depends on contrib ops (flash attention, memory efficient attention, unfused attention helpers from contrib_ops/cuda/bert/). When DISABLE_CONTRIB_OPS is defined, these dependencies are unavailable causing compilation failures. Fix by: 1. Excluding attention.h/attention.cc from the CUDA provider build when contrib ops are disabled (cmake change). 2. Guarding the Attention kernel class declarations and registrations in cuda_execution_provider.cc with #ifndef DISABLE_CONTRIB_OPS. The CPU EP still provides the standard ONNX domain Attention kernel as fallback when the CUDA implementation is unavailable. Agent-Logs-Url: https://github.com/microsoft/onnxruntime/sessions/4bbef367-4e58-49e5-9bca-8d5a2c8ee872 Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Fix onnxruntime build with CUDA enabled and contrib ops disabled
Fix CUDA build with contrib ops disabled
May 19, 2026
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
tianleiwu
reviewed
May 20, 2026
tianleiwu
requested changes
May 20, 2026
Instead of removing the CUDA Attention kernel when contrib ops are disabled, include the necessary attention infrastructure files from contrib_ops/cuda/bert/ (flash attention, memory efficient attention, unfused attention helpers, etc.) so the ONNX Attention kernel can compile and link. Also removes #ifndef DISABLE_CONTRIB_OPS guards from GetAttentionKernelOptions() in cuda_kernel.h and cuda_execution_provider.h since AttentionKernelOptions is now always available. Agent-Logs-Url: https://github.com/microsoft/onnxruntime/sessions/4e3d0a4a-58f7-4e16-81d1-54e4f2f282e9 Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
tianleiwu
reviewed
May 21, 2026
tianleiwu
requested changes
May 21, 2026
The #ifndef DISABLE_CONTRIB_OPS guards around Attention kernel registrations are unnecessary since the necessary contrib_ops/cuda/bert/ files are now always included in the build. Agent-Logs-Url: https://github.com/microsoft/onnxruntime/sessions/fac2d4d5-add5-4431-95f1-c4f38205994e Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
tianleiwu
requested changes
May 21, 2026
Contributor
There was a problem hiding this comment.
There is CI build failure for Build Linux TensorRT CUDA Minimal x64 Release
@copilot, please update cmake/onnxruntime_providers_cuda.cmake to handle cuda minimal build nicely.
…and not CUDA minimal The CUDA minimal build doesn't compile llm/attention.cc (non-recursive glob) so the bert/ infrastructure isn't needed. The cudnn_fmha files require cudnn_frontend.h which isn't available in minimal builds. Agent-Logs-Url: https://github.com/microsoft/onnxruntime/sessions/508d31a5-92d6-4418-94ad-ac0ca6d78b68 Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
Contributor
Author
Fixed in 8b79ac7. The issue was that the |
tianleiwu
approved these changes
May 21, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
The CUDA Attention kernel (
core/providers/cuda/llm/attention.cc) depends on contrib_ops internals (flash attention, memory efficient attention, unfused attention helpers) but was compiled unconditionally. When building with--disable_contrib_ops,GetAttentionKernelOptions()is unavailable (guarded by#ifndef DISABLE_CONTRIB_OPSincuda_kernel.h), causing a compile error.Changes:
cmake/onnxruntime_providers_cuda.cmake— When contrib ops are disabled (and not in CUDA minimal mode), include thecontrib_ops/cuda/bert/attention infrastructure files (flash attention, memory efficient attention, unfused attention helpers, etc.) so the ONNX domain Attention kernel can compile and link. Useselseif(onnxruntime_DISABLE_CONTRIB_OPS AND NOT onnxruntime_CUDA_MINIMAL)to avoid including these files in CUDA minimal builds wherellm/attention.ccisn't compiled andcudnn_frontend.hisn't available.onnxruntime/core/providers/cuda/cuda_execution_provider.h— Remove#ifndef DISABLE_CONTRIB_OPSguards from theAttentionKernelOptionsinclude,GetAttentionKernelOptions()method, andattention_kernel_options_member variableonnxruntime/core/providers/cuda/cuda_kernel.h— Remove#ifndef DISABLE_CONTRIB_OPSguard fromGetAttentionKernelOptions()The CUDA Attention kernel and its underlying attention backends (flash, memory efficient, unfused) are now always available in full CUDA builds regardless of whether contrib ops are enabled. No changes are needed in
cuda_execution_provider.ccsince the Attention kernel registrations remain unconditional.Motivation and Context
Building onnxruntime with CUDA enabled and
--disable_contrib_opsfails:This is a valid build configuration (useful for reducing compile time) that should be supported. Rather than excluding the CUDA Attention kernel when contrib ops are disabled, the necessary attention infrastructure from
contrib_ops/cuda/bert/is included in the build so the ONNX domain Attention op retains full CUDA acceleration. The fix is scoped to non-minimal CUDA builds only, since CUDA minimal builds use a non-recursive glob that doesn't includellm/attention.ccand don't havecudnn_frontendavailable.