[MIGraphX EP] Fix stream_ nullptr bug and enable end-to-end HIP graph capture. Identical diff and commit message as PR 2a. The AMD ROCm fork by zhihuidu-amd · Pull Request #241 · ROCm/onnxruntime

zhihuidu-amd · 2026-05-29T03:22:22Z

Problem

Three related bugs in the MIGraphX EP cause incorrect stream synchronization
and prevent end-to-end HIP graph capture when an external stream is provided
via the user_compute_stream ProviderOptions key.

Bug 1: `stream_` is never assigned — every sync hits the null stream

hipStream_t stream_ = nullptr is declared in the header and never assigned
in the constructor, even when user_compute_stream is present in
ProviderOptions. As a result:

Sync() calls hipStreamSynchronize(nullptr) — the HIP legacy default
stream, which has implicit synchronization with all other streams on
the device. Under concurrent request serving this serialises every inference
through a single global sync point.
OnRunEnd() calls hipStreamQuery(stream_) and
hipStreamSynchronize(stream_) on the null stream — always sees it complete
immediately (the null stream is never "busy" from the EP's perspective),
masking the case where the actual compute is still in flight.

Bug 2: `RegisterStreamHandlers` ignores the external stream — HIP graph capture is incomplete

RegisterStreamHandlers hardcodes use_existing_stream=false (acknowledged
by the /*TODO:external_stream_*/ comment). ORT therefore creates a fresh
hipStreamNonBlocking stream for H2D/D2H memory copies, separate from the
MIGraphX compute stream. When a HIP graph capture is in progress, memory
copies on this separate stream are not recorded, producing an incomplete
graph that replays compute but re-issues host-device transfers on every call.

Bug 3 (consequence of Bug 1+2): `user_compute_stream` is silently ignored

The ProviderOptions parsing infrastructure (migraphx_execution_provider_info)
has no user_compute_stream key, so the pointer passed by the caller is never
stored. Bugs 1 and 2 would exist even if it were.

Fix

migraphx_execution_provider_info.h

Add constexpr auto kUserComputeStream = "user_compute_stream"sv;
Add void* user_compute_stream{nullptr}; field to MIGraphXExecutionProviderInfo

migraphx_execution_provider_info.cc

Add AddValueParser for kUserComputeStream, parsing the address string
and storing it in user_compute_stream. Follows the exact pattern used for
kGpuExternalAlloc / kGpuExternalFree.

migraphx_execution_provider.cc

Constructor: if (info.user_compute_stream) stream_ = static_cast<hipStream_t>(info.user_compute_stream);
RegisterStreamHandlers: const bool use_existing_stream = (stream_ != nullptr); — resolves the TODO.
Sync(): sync stream_ (or null if unset); remove the unconditional null-stream sync that caused the global stall.
OnRunEnd(): query/sync check_stream = stream_ ? stream_ : nullptr — same fix.

Backward compatibility

When user_compute_stream is not provided (the common case), stream_
remains nullptr. RegisterStreamHandlers creates a new
hipStreamNonBlocking stream as before. Sync() falls back to syncing
that session stream (via nullptr), which is at worst equivalent to the
old behaviour and strictly more correct for single-session use.

Measured impact

Tested on AMD MI300X (gfx942), ROCm 7.x, ORT 1.23.2, via Triton Inference
Server 25.x, production ONNX ranking model (843 nodes, dynamic batch 1–256):

Eliminates the global device stall on every session.Run().
Enables end-to-end HIP graph capture: H2D copy → MIGraphX compute → D2H
copy all captured together, allowing zero-overhead graph replay.
Quantified latency improvement: results forthcoming from an in-progress A/B
experiment (will update this PR).

Files changed

onnxruntime/core/providers/migraphx/migraphx_execution_provider_info.h (+2)
onnxruntime/core/providers/migraphx/migraphx_execution_provider_info.cc (+11)
onnxruntime/core/providers/migraphx/migraphx_execution_provider.cc (+21/-10)

This is a cherry-pick of the same fix submitted to microsoft/onnxruntime
as PR microsoft#28715. Filing separately because rocm/vllm ships the AMD fork and
the fix is needed there to land in production before the upstream merge.

…soft#28189) This pull request addresses a bug where in-memory-only initializer markers (such as `_ORT_MEM_ADDR_`) could be incorrectly serialized into ONNX models when using custom initializer handling. The changes ensure that any pre-existing initializers are cleared before re-adding them via the custom handler, preventing duplicate or stale data. Additionally, a regression test is added to verify correct behavior after large initializers are externalized. **Bug Fixes:** * Clear pre-existing initializers (and sparse initializers, if enabled) from `GraphProto` and subgraph protos before re-adding them with custom initializer handling, preventing duplicate or stale initializers (including `_ORT_MEM_ADDR_` markers) from being serialized. (`onnxruntime/core/graph/graph.cc`) [[1]](diffhunk://#diff-e231a92b40d89409cc8e82436be0a15bc87ef95c93b303b9feaeab6e50c8835cR5107-R5116) [[2]](diffhunk://#diff-e231a92b40d89409cc8e82436be0a15bc87ef95c93b303b9feaeab6e50c8835cR5208-R5217) **Testing Improvements:** * Add a regression test to verify that after converting large initializers into OrtValues (which creates in-memory markers), custom initializer handling does not serialize `_ORT_MEM_ADDR_` markers into the output model, and that all initializers are correctly inlined. (`onnxruntime/test/ir/graph_test.cc`) Closing microsoft#28158

…n fusion (microsoft#28168) ### Description Add correctness tests for fusions introduced in microsoft#27883 and microsoft#27747. The tests introduced in those PRs only check if fusion went through but not if the fused nodes produced semantically right results as the unfused subgraphs. Adding those tests to prevent accidental breakage in case something changed in the fused node's backing kernel implementation. ### Motivation and Context Adress test coverage gap

…8096) # Add exponential-backoff option for thread pool spin loop ## Description This PR adds an opt-in exponential-backoff mode to the thread pool's idle spin loop, complementing the configurable `spin_duration_us` introduced in microsoft#27916. When enabled, each spin iteration emits a geometrically increasing number of `SpinPause()` calls (1, 2, 4, … capped at `spin_backoff_max`), which reduces pause-instruction density and lowers CPU/power usage during the spin window—particularly on hybrid (P/E core) and mobile platforms. The iteration count is automatically scaled so the wall-clock spin budget targeted by `spin_duration_us` is preserved. The idea is adapted from microsoft#23278 (microsoft#21545 and microsoft#22315) which showed measurable power and latency improvements on Intel Meteor Lake by reducing busy-wait density. This PR makes the technique opt-in and composable with the time-bounded spin knob from microsoft#27916, so users can independently control *how long* to spin and *how densely* to spin. ## Summary of Changes ### Core thread pool (`EigenNonBlockingThreadPool.h`) | File | Change | |------|--------| | `include/onnxruntime/core/platform/EigenNonBlockingThreadPool.h` | Add `ThreadPoolWaiter` inner class implementing exponential backoff; add `NormalizeBackoff()` and `ScaleSpinCountForBackoff()` helpers; replace bare `SpinPause()` in `WorkerLoop` with `waiter.wait()`; store `spin_backoff_max_` member and accept it in constructor | ### Configuration plumbing | File | Change | |------|--------| | `include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h` | New config keys `session.intra_op.spin_backoff_max` and `session.inter_op.spin_backoff_max` | | `include/onnxruntime/core/platform/threadpool.h` | Add `spin_backoff_max` parameter to `ThreadPool` constructor (default `1`, backward-compatible) | | `onnxruntime/core/common/threadpool.cc` | Forward `spin_backoff_max` to `ThreadPoolTempl` | | `onnxruntime/core/util/thread_utils.h` | Add `spin_backoff_max` field to `OrtThreadPoolParams` | | `onnxruntime/core/util/thread_utils.cc` | Pass `spin_backoff_max` into `ThreadPool`; log it in `operator<<(OrtThreadPoolParams)` | | `onnxruntime/core/session/inference_session.cc` | Add `ParseSpinBackoffMax()` helper; parse & apply both intra-op and inter-op config keys | ### Perf test CLI & benchmark script | File | Change | |------|--------| | `onnxruntime/test/perftest/command_args_parser.cc` | New `--spin_backoff_max` flag | | `onnxruntime/test/perftest/ort_test_session.cc` | Apply flag to session options | | `onnxruntime/test/perftest/test_configuration.h` | New `spin_backoff_max` field in `RunConfig` | | `tools/perftest/benchmark_spin_settings.py` | New benchmark script that runs `onnxruntime_perf_test` across a matrix of spin settings (duration × backoff) and reports latency, throughput, CPU% | ## Key Design Decisions 1. **Default preserves existing behavior.** `spin_backoff_max = 1` means one `SpinPause()` per iteration—identical to today. No performance change for users who don't opt in. 2. **Wall-clock budget preservation.** When backoff is enabled, the iteration count is divided by `spin_backoff_max` so the total number of `SpinPause()` calls—and therefore the approximate spin duration—stays the same as the non-backoff path. 3. **Composable with `spin_duration_us`.** Backoff and time-bounded spinning are orthogonal knobs. Users can use either independently or combine them (e.g., `spin_duration_us=1000` + `spin_backoff_max=8`). 4. **Subordinate to `allow_spinning`.** When spinning is disabled, `spin_backoff_max` is ignored—same as `spin_duration_us`. ## Session option usage ```cpp // Enable exponential backoff with cap 8, combined with 1ms time-bounded spinning session_options.AddConfigEntry("session.intra_op.spin_duration_us", "1000"); session_options.AddConfigEntry("session.intra_op.spin_backoff_max", "8"); ``` ## Benchmark Results Benchmarks run on an Intel i9-13900KF (6P cores / 12 threads under WSL2), 32 GB RAM, Release build with CPU EP, using `tools/perftest/benchmark_spin_settings.py`. Each configuration was repeated 3–5 times (median latency, mean throughput/CPU reported). Duration: 10 seconds per run. ### SqueezeNet (5 MB CNN) — 16 intra-op threads, 5 repeats High thread count amplifies spin contention, making this the most illustrative test: | Config | `spin_duration_us` | `spin_backoff_max` | Avg Latency (ms) | Throughput (IPS) | CPU % | |--------|---|---|:-:|:-:|:-:| | `default` | (legacy) | (legacy) | 3.243 | 303.0 | 1245.8 | | `no_spin` | — | — | 5.489 | 176.3 | 332.7 | | `spin_1000` | 1000 | — | 1.870 | 514.5 | 1214.6 | | `spin_2000` | 2000 | — | 2.040 | 478.2 | 1219.9 | | `backoff_8` | (legacy) | 8 | 3.268 | 303.4 | 1257.4 | | `spin_1000_backoff_4` | 1000 | 4 | 1.849 | 513.8 | 1221.4 | | **`spin_1000_backoff_8`** | **1000** | **8** | **1.835** | **534.5** | **1221.1** | | `spin_2000_backoff_8` | 2000 | 8 | 2.050 | 470.3 | 1223.2 | **Best: `spin_1000_backoff_8`** — **43% lower latency**, **76% higher throughput** vs default, while using **2% less CPU**. ### SqueezeNet — 8 intra-op threads | Config | Avg Latency (ms) | Throughput (IPS) | CPU % | |--------|:-:|:-:|:-:| | `default` | 1.578 | 628.7 | 826.5 | | `no_spin` | 3.742 | 261.1 | 322.9 | | `spin_1000` | 1.547 | 618.3 | 826.2 | | `backoff_8` | 1.545 | 628.7 | 830.1 | | `spin_1000_backoff_8` | **1.519** | **657.5** | 838.6 | | `spin_2000_backoff_8` | **1.503** | 634.9 | 832.8 | **Best: `spin_1000_backoff_8`** — **3.7% lower latency**, **4.6% higher throughput** vs default. ### DistilBERT (254 MB Transformer) — 4 intra-op threads | Config | Avg Latency (ms) | Throughput (IPS) | CPU % | |--------|:-:|:-:|:-:| | `default` | 30.468 | 31.4 | 329.3 | | `no_spin` | 33.483 | 28.8 | 284.5 | | `spin_1000` | 30.421 | 31.5 | 338.1 | | `backoff_8` | 30.254 | 31.3 | 344.4 | | **`spin_1000_backoff_8`** | **29.583** | **31.8** | 340.5 | **Best: `spin_1000_backoff_8`** — **2.9% lower latency** vs default. ### DistilBERT — 8 intra-op threads | Config | Avg Latency (ms) | Throughput (IPS) | CPU % | |--------|:-:|:-:|:-:| | `default` | 23.194 | 41.4 | 672.1 | | `no_spin` | 32.548 | 32.2 | 395.0 | | `spin_1000` | 23.291 | 41.3 | 675.3 | | **`backoff_8`** | **22.995** | **43.2** | 705.3 | | `spin_1000_backoff_8` | 23.535 | 41.1 | 662.8 | **Best: `backoff_8`** — **0.9% lower latency**, **4.3% higher throughput** vs default. ### Summary - **`spin_1000_backoff_8`** is the most consistent best performer across models and thread counts. - Benefits grow with thread count: from ~3% at 4T to **43–76%** at 16T. - No throughput regressions observed in any backoff configuration vs its non-backoff equivalent. - Backoff configs use slightly less CPU than raw spinning while achieving higher throughput — a win-win on power/efficiency. ## Testing - **Backward compatibility:** Default `spin_backoff_max = 1` produces identical spin behavior to `main`. Existing thread pool tests (`SpinDurationDefault`, `SpinDurationZero_NoSpinning`, `SpinDurationPositive_TimeBased`) continue to pass unmodified since the default backoff is 1. - **Benchmark script:** Use the new benchmark tool to compare settings on a model: ```bash python tools/perftest/benchmark_spin_settings.py \ --perf_test build/Release/onnxruntime_perf_test \ --model path/to/model.onnx \ --intra_op 4 --duration 10 --repeats 3 \ --configs default spin_1000 spin_1000_backoff_8 ``` - **Build verification:** All modified translation units compile cleanly under `-Wall -Wextra -Werror` in the existing cu128 Release build.

…ckQuantBDataSize_Lasx (microsoft#28179) ### Description PR microsoft#27136 introduced a new `const MLAS_BACKEND_KERNEL_SELECTOR_CONFIG*` parameter to the `Q4BitGemmPackQuantBDataSize_Fn` typedef in `MLAS_QNBIT_GEMM_DISPATCH`, and updated the three `PackQuantBData*` functions in `sqnbitgemm_kernel_lasx.cpp` accordingly. However, `QNBitGemmPackQuantBDataSize_Lasx` was missed, leaving it with a 5-parameter signature that no longer matches the typedef. This causes a compile error on LoongArch LASX builds when the function's address is assigned to `d.Q4BitGemmPackQuantBDataSize`. This change adds the missing 6th parameter to `QNBitGemmPackQuantBDataSize_Lasx`, matching the signature used by the three sibling functions in the same file (parameter name commented out since it is unused by this implementation). ### Motivation and Context Fixes microsoft#28157. Restores the LoongArch LASX build without altering runtime behavior on any platform. ### Changes - `onnxruntime/core/mlas/lib/sqnbitgemm_kernel_lasx.cpp`: add `const MLAS_BACKEND_KERNEL_SELECTOR_CONFIG* /*BackendKernelSelectorConfig*/` as the 6th parameter of `QNBitGemmPackQuantBDataSize_Lasx`. ### Test Plan - The change is a signature-only fix with no behavioral delta; the added parameter is unused (consistent with the three sibling functions in the same file). - Existing x86 and ARM CI should continue to pass since no logic was changed. - LoongArch LASX build, which previously failed, now matches the dispatch typedef and compiles.

) To avoid "azp run Windows GPU Doc Gen CI Pipeline" for PR from external contributors. The "Windows GPU Doc Gen CI Pipeline" can be removed after this PR is merged. Example run that caught doc mismatch, and correct docs are uploaded in "Upload updated documentation": https://github.com/microsoft/onnxruntime/actions/runs/24813688480/job/72623549343?pr=28185 Artifact download URL: https://github.com/microsoft/onnxruntime/actions/runs/24813688480/artifacts/6593754537

…t#28176) ### Description `EmbedLayerNormalizationShapeInference` unconditionally wrote to output index 2 when `getNumOutputs() == 2 && mask_index_type == 0`, causing a heap out-of-bounds write during model loading — no `session.run()` required. **Fix:** Replace the flawed condition with a simple bounds check: ```cpp // Before (vulnerable): if (ctx.getNumOutputs() == 3 || (ctx.getNumOutputs() == 2 && mask_index_type == 0)) { updateOutputShape(ctx, 2, output_shape); // OOB when numOutputs == 2 propagateElemTypeFromInputToOutput(ctx, 0, 2); } // After (fixed): if (ctx.getNumOutputs() > 2) { updateOutputShape(ctx, 2, output_shape); propagateElemTypeFromInputToOutput(ctx, 0, 2); } ``` A regression test `EmbedLayerNormBatch1_NoMaskIndex_NoSumOutput` has been added to `embed_layer_norm_op_test.cc` to cover the previously vulnerable path: `mask_index_type=0` with exactly 2 outputs (no `embedding_sum`). ### Motivation and Context A crafted ONNX model with an `EmbedLayerNormalization` node declaring 2 outputs and `mask_index_type=0` triggers the vulnerable path. `getOutputType(2)` returns a pointer one past the end of the internal `node_output_types_` vector; subsequent writes through that pointer corrupt adjacent heap memory. In release builds this is silent — no assertion, no crash, exploitable via heap shaping. The `embedding_sum` output is always at index 2 by definition (confirmed in the CPU kernel). The old special-casing for `mask_index_type == 0` was both incorrect and unnecessary. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: xadupre <22452781+xadupre@users.noreply.github.com> Co-authored-by: Xavier Dupré <xadupre@microsoft.com>

…8211) ### Description  This pull request improves error handling in the `CreateEpFactories` function in `onnxruntime/core/providers/webgpu/ep/api.cc`. The main enhancement is the addition of a more conservative mechanism for creating `OrtStatus` objects when exceptions occur during API initialization. The existing code was attempting to create a C++ API `Ort::Status` instance even if the API was not initialized successfully. ### Motivation and Context  Fix crash that was hiding underlying error message.

### Description Adds `HardSigmoid` to the CoreML Execution Provider's activation op builder. Both MLProgram (`sigmoid_hard`) and NeuralNetwork (`ActivationSigmoidHard`) code paths are implemented; the op's ONNX definition matches CoreML MIL's `sigmoid_hard` exactly, so no decomposition is required. Adds a dedicated CoreML-EP test (`CoreMLExecutionProviderTest.HardSigmoidTest`) that builds a single-node HardSigmoid model with non-default `alpha`/`beta` and uses `RunAndVerifyOutputsWithEP` with `ExpectedEPNodeAssignment::All` to confirm (a) the entire graph is claimed by the CoreML EP in both NN and MLProgram formats, and (b) the output matches the CPU reference. I verified the test is not trivially passing by temporarily unregistering HardSigmoid from the activation builder — the test fails with `VerifyEPNodeAssignment` emitting a fatal failure, proving it genuinely exercises the CoreML path. (The existing multi-EP test in `activation_op_test.cc` silently falls back to CPU when an EP rejects the node, so it does not give CoreML coverage on its own.) Also updates `coreml_supported_mlprogram_ops.md`. ### Motivation and Context Fixes microsoft#28181. On a DWPose pose-estimation model (`dw-ll_ucoco_384.onnx`), 4 HardSigmoid ops were each forcing a CoreML → CPU → CoreML round-trip, and also caused downstream ops to be rejected with "unsupported inputs" because their producers had been sent to CPU. Adding HardSigmoid collapses the graph from 5 CoreML subgraphs to 1, and drops inference from 9.22 ms to 6.92 ms (−25%) on Apple Silicon with MLProgram + ComputeUnits=ALL. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…sts (microsoft#28065) Add #include core/framework/tensorprotoutils.h to test_nv_trt_rtx_ep_util.cc to resolve build error where onnxruntime::utils::SetRawDataInTensorProto was not found. ### Description  ### Motivation and Context

…#28218) Those pipelines have been replaced by workflows in microsoft#28192 and microsoft#28194 TODO: remove those pipelines from required pipelines for pull request, and add "ONNX Runtime Windows GPU Doc Gen CI / Windows GPU Kernel Documentation Validation" instead.

## Description This PR adds an Azure Pipelines packaging flow for the CUDA plugin EP, following the existing WebGPU plugin packaging pipeline pattern. The new pipeline can package Windows x64 and Linux x64 builds for both CUDA 12.8 and 13.0, and optionally package Linux aarch64 builds when CUDA 13.0 is selected. The flow is parameterized for CUDA version, package version, build type, and Python configuration so the packaging matrix can be expanded without duplicating pipeline logic. It also adds validation to reject unsupported combinations such as Linux aarch64 with CUDA 12.8. ## Summary of Changes ### Azure Pipelines packaging flow | File | Change | |---|---| | `tools/ci_build/github/azure-pipelines/plugin-cuda-pipeline.yml` | Adds the top-level official pipeline with CUDA 12.8/13.0 selection, package/build-type validation, and aarch64 gating for CUDA 13.0 only. | | `tools/ci_build/github/azure-pipelines/stages/plugin-cuda-packaging-stage.yml` | Adds the packaging orchestrator that fans out per-platform/per-Python build stages and merges Linux artifacts. | | `tools/ci_build/github/azure-pipelines/stages/plugin-linux-cuda-stage.yml` | Adds the Linux packaging stage template for x64 and aarch64, parameterized by CUDA version, Python executable, Docker image, and CUDA architectures. | | `tools/ci_build/github/azure-pipelines/stages/plugin-win-cuda-stage.yml` | Adds the Windows packaging stage template with CUDA-version-specific SDK setup and CUDA 13.0 cuDNN handling. | ### Linux build script | File | Change | |---|---| | `tools/ci_build/github/linux/build_cuda_plugin_package.sh` | Adds a Docker-based CUDA plugin packaging script with parameters for build config, Python executable, CUDA version, and `CMAKE_CUDA_ARCHITECTURES`. | ### Packaging behavior - Supports `cuda_version` = `12.8` or `13.0`. - Restricts Linux aarch64 packaging to CUDA 13.0 because the aarch64 CUDA Docker image is only available for CUDA 13.x. - Uses CUDA-version-specific Docker base images and CUDA architecture lists. - Threads Python configuration through the Linux packaging path so wheel-producing builds can be selected per Python version. - Merges Linux per-version artifacts into a combined Linux artifact for downstream consumption. ## Testing - Not run locally. This change adds CI pipeline definitions and packaging scripts only. ## Motivation and Context The CUDA plugin EP already has GitHub Actions CI coverage for Linux and Windows builds, but it did not yet have a matching Azure Pipelines packaging flow like the WebGPU plugin EP. Adding this packaging pipeline makes it possible to publish packaged CUDA plugin artifacts through the same official packaging infrastructure, while also supporting the newer CUDA 13.0 configuration and Linux aarch64 packaging where the required Docker image exists. ## Checklist - [x] Tests added/updated or not required - [x] Documentation updated or not applicable - [x] No breaking changes - [ ] CI passes

…#28045) ### Description Implements `ai.onnx.ml.LabelEncoder` on the CUDA execution provider for numeric key/value types using sorted arrays + binary search (O(log n) per element). **New files** (`onnxruntime/core/providers/cuda/ml/`): - `label_encoder_impl.cu` / `.h` — CUDA kernel: per-thread binary search on sorted keys, NaN-aware for float/double - `label_encoder.cc` / `.h` — Host-side op classes (`CudaLabelEncoder` for opset 2-3, `CudaLabelEncoder_4` for opset 4+). Constructor sorts keys, copies to GPU; `ComputeInternal` launches kernel. **Modified files**: - `cuda_execution_provider.cc` — Register 11 kernel variants (4 versioned opset 2-3, 7 opset 4+) - `provider_api.h` — Add missing `kMLDomain` constant (first ML-domain op on CUDA EP) - `docs/OperatorKernels.md` — Add `ai.onnx.ml` section to CUDA provider table **Supported type combinations**: | Opset | Types | |-------|-------| | 2-3 | `int64↔float`, `int64↔int64`, `float↔float` | | 4+ | Above + `double↔double`, `double↔int64`, `int64↔double` | String types remain CPU-only. NaN keys are placed at end of sorted array and short-circuited before binary search. **Tests**: 5 new test cases covering NaN-key-to-numeric-value mappings and double type combinations. Existing numeric tests (`FloatToInt64Opset2`, `Int64ToFloatOpset2`, etc.) will automatically run on CUDA via `OpTester::Run()`. ### Motivation and Context Models with large LabelEncoder nodes (>100k entries) force a CPU round-trip when all other nodes run on GPU. This adds the CUDA implementation to eliminate that data transfer bottleneck. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com> Co-authored-by: Tianlei Wu <tlwu@microsoft.com>

microsoft#28109) …+ add NVIDIA 16x16x16 Refactor subgroup matrix MatMulNBits support from vendor-specific (Apple/Intel) to a vendor-agnostic config-based approach. Any GPU reporting a matching subgroup matrix config from Dawn is now automatically supported. Key changes: - Replace vendor-specific config table with SupportedSubgroupMatrixConfig struct containing {componentType, resultComponentType, M, N, K, subgroupMinSize, subgroupMaxSize, needsPrepack}. No architecture or backendType required. - Remove vendor_ member from SubgroupMatrixMatMulNBitsProgram. Shader selection is now driven by config dimensions (8x8x8, 8x16x16, 16x16x16). - Remove vendor gate in matmul_nbits.cc call site. - Rename shader templates: _apple -> _8x8x8, _intel -> _8x16x16. - Add new 16x16x16 shader template for NVIDIA Blackwell (RTX 5080). - 4 subgroups x 32 lanes = 128 threads per workgroup - 64x64 tile with 16x16 subgroup matrices - Bounds-checked output via scratch buffer for partial M tiles - Fix prepack shader OOB reads: add scalar fallback with zero-fill for partial blocks where M is not a multiple of kSgMatM. - Prioritize larger configs (16x16x16 > 8x16x16 > 8x8x8) when multiple match. Verified on NVIDIA RTX 5080 (Blackwell, Vulkan backend): - Correctness: model-qa.py with phi4-graph-prune produces identical output to D3D12 baseline - Prefill (phi4, l=1024): phi4-graph-prune | D3D12 DP4A | Vulkan DP4A | Vulkan TC (16x16x16) | Vulkan TC (16x16x16_128) -- | -- | -- | -- | -- Prefill (tps) | 3,134 | 6,389 | 7,089 | 10,744 - NVIDIA reports ChromiumExperimentalSubgroupMatrix with F16/F16 16x16x16 config ### Description  ### Motivation and Context

… (microsoft#28198) ## Description Add a GQA-capable unfused CUDA attention kernel that writes Q·K^T to an FP32 scratch buffer, fixing fp16/bf16 overflow producing NaN when `head_size > 256` at `scale=1.0` (issue microsoft#28195, e.g. Gemma 4 global attention layers with `head_dim=512`). ### Motivation Gemma 4 uses `head_dim=512` for its global attention layers (`num_attention_heads=8, num_key_value_heads=4`). Flash Attention and Memory-Efficient Attention cap at `head_size=256`, so these fall through to the unfused path. The existing unfused MHA runner produces NaN because even though cuBLAS accumulates in FP32, the Q·K^T output tensor is fp16 and overflows. Additionally, the MHA unfused runner cannot handle GQA (`q_num_heads != kv_num_heads`). ### Key Changes **New kernel** (`contrib_ops/cuda/bert/gqa_unfused_attention.cu/.h`): - 3-stage pipeline: QK GEMM → softmax → AV GEMM - QK GEMM uses `CUBLAS_COMPUTE_32F` with `CUDA_R_32F` output type — raw Q·K^T scores written to FP32 scratch, eliminating fp16 overflow - Reshape-Q trick for native GQA support (no K/V head replication needed) - Softmax supports causal mask, sliding window (`local_window_size`), softcap, additive attention bias, and per-batch `seqlens_k` - Per-batch `past` calculation for correct sliding-window masking with variable-length sequences **GQA contrib op integration** (`group_query_attention.cc`, `group_query_attention_impl.cu`): - Activates when Flash/MEA/XQA are all ineligible and KV cache is not quantized - Uses `PrepareQKV` for RoPE and K/V cache management, then routes to the new kernel **ONNX Attention op integration** (`attention.cc`, `attention.h`): - New `RunGqaUnfusedAttention` path for GQA and fp16/bf16 with `head_size > 128` - Handles BSNH↔BNSH transposes, past_key concatenation, attn_mask→bias conversion, `nonpad_kv_seqlen` - Optimized: transposes BSNH K/V directly into `present_key`/`present_value` when available **`UnpackRoPEAppend` kernel** (`group_query_attention_qkv.cuh`): - Raised `MAX_HEAD_SIZE` from 256 to 512 to support Gemma 4 head dimensions **Safety improvements**: - `SafeInt<size_t>` for workspace size arithmetic (overflow protection) - `static_assert` guarding GQA transpose paths against accidental float instantiation ### Testing - 8 new Gemma 4 regression tests in `test_gqa.py`: prompt/decode × fp16/bf16, softcap, sliding window, long past sequences - 2 new Gemma 4 benchmark configs in `benchmark_gqa.py` (global + local attention) - All `TestGQARegressions` tests pass locally (12/12) ### Fixes Fixes microsoft#28195

…crosoft#23268) (microsoft#28201) ### Description Adds a regression test covering the bug reported in microsoft#23268 and fixed in microsoft#23322. Before microsoft#23322, calling `quantize_static()` with an in-memory `ModelProto` whose weights were large enough (>= 1024 bytes) to trigger ONNX's external-data serialization would mutate the caller's `ModelProto` inside `save_and_reload_model_with_shape_infer` (via `onnx.save_model(..., save_as_external_data=True)`), clearing `raw_data` and pointing the tensor at a temp-directory path that was then deleted. Subsequent calibration would load the now-invalid proto and raise `onnx.onnx_cpp2py_export.checker.ValidationError`. PR microsoft#23322 addressed the issue by wrapping the input proto in `copy.deepcopy` inside `save_and_reload_model_with_shape_infer`, but no regression test was added. This PR adds that test. ### Motivation and Context Fixes microsoft#23268 (adds the missing regression test for the original bug). ### Changes - `onnxruntime/test/python/quantization/test_quant_issues.py` — adds `test_issue_23268_quantize_static_modelproto_no_validation_error` to `TestQuantIssues`. The test builds an in-memory Add model with a 32x32 float32 initializer (4096 bytes, above the 1024-byte external-data threshold), runs `quantize_static` against it with an inline minimal calibration reader, and asserts the call completes without raising and produces a quantized output file. No production code changes. ### Test Plan ``` python -m pytest onnxruntime/test/python/quantization/test_quant_issues.py -v ``` Both tests pass locally (`test_minimal_model` and the new `test_issue_23268_quantize_static_modelproto_no_validation_error`) in ~0.6s. `lintrunner -a` is clean.

…uested (microsoft#28027) ### Description When MultiHeadAttention has only 1 output (no present_key/present_value outputs), past key/value inputs should be completely ignored, matching CPU EP semantics. The WebGPU EP was passing pastKey/pastValue TensorViews to shader creation functions even when outputCount <= 1, which affected shader cache keys and allowed past data to leak into the attention computation. This caused the test "MultiHeadAttention Basic, one head and head-size=4 with pastKey and pastValue" to fail with output [17,18,19,20] (pastValue data) instead of expected [9,10,11,12] (V data). The failing output matches exactly what happens when past IS used: Q·pastKey=75 dominates Q·K=35, so softmax gives ~100% weight to pastValue. ### Fix In `applyAttention()`, introduce `effectivePastKey`/`effectivePastValue` that are set to `undefined` when `outputCount <= 1`. All downstream usage (shader creation, input arrays) uses these effective values instead of the raw parameters. This ensures: - Shader cache keys correctly reflect the "no past" configuration - Past tensors are never passed to any shader creation function - Behavior matches CPU EP (which ignores past when present outputs are null) - GQA is unaffected (always has outputCount >= 3) - Vanilla Attention is unaffected (always passes undefined for past)

### Description In the CPU RNN operator's \\Assign_Y_h\\ function, when \\sequence_lens\\ contains a value of 0, the computation \\sequence_lens[batch] - 1 = -1\\ produces a negative offset into the Y output buffer. \\CopyVector\\ then reads \\hidden_size\\ floats from heap memory before the buffer, leaking heap data into the \\Y_h\\ output tensor. LSTM and GRU already handle zero-length sequences correctly (early return + zero-fill in compute path), but the basic RNN operator had neither protection. ### Changes - **rnn.cc \\Compute()\\**: Add early return when \\max_sequence_length == 0\\ — zero-fills Y and Y_h outputs and returns immediately (matches existing LSTM/GRU pattern) - **rnn.cc \\Assign_Y_h()\\**: Add bounds check on \\last_time_step\\ before computing buffer offset — guards against both negative index (\\seq_lens=0\\) and index >= seq_length, zero-fills Y_h for invalid entries Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…icrosoft#28241) ### Description CI Python packaging pipelines now specify their packaging type (nightly vs. release) via an explicit pipeline parameter rather than the implicitly defined pipeline var `NIGHTLY_BUILD`. ### Motivation and Context Much less error prone than an implicitly defined pipeline variable.

### Description Fixes 3 ICM fixes: https://portal.microsofticm.com/imp/v5/incidents/details/31000000572208/summary https://portal.microsofticm.com/imp/v5/incidents/details/31000000573313/summary https://portal.microsofticm.com/imp/v5/incidents/details/31000000575583/summary ### Motivation and Context Fix ICM issues --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

This pull request makes a small change to the CUDA label encoder kernel to address unused parameter warnings. The change marks the `attr_name` parameter as unused in the `TryGetScalarTensorAttribute` function when building with the plugin execution provider. * Code quality improvement: * Marked the `attr_name` parameter as unused with `ORT_UNUSED_PARAMETER(attr_name);` to suppress compiler warnings when building with `BUILD_CUDA_EP_AS_PLUGIN`.

### Description Pass base timestamp for vitisai profiling Notify EP that profiling has started with the base timestamp (in nanoseconds since epoch) The VitisAI EP can use this to: 1. Calculate relative timestamps (event_ts - base_ts) for the profiling timeline 2. Store the absolute base timestamp if needed for other purposes ### Motivation and Context Due to onnxruntime default profiling json file just have the offset timestamp, it doesn't provider the base timestamp for VitisAI EP, To combine the VaitisAI timeline profiling info and the onnxruntime default profiling json file info, We need pass the timestamp for VitisAI EP. --------- Signed-off-by: Andrew Luo <junpengl@amd.com> Co-authored-by: Andrew Luo <junpengl@amd.com>

…nd (microsoft#28083) ## Summary Fixes a critical security vulnerability in the ONNX Runtime Python backend where user-controlled `kwargs` were applied to `SessionOptions` and `RunOptions` via unrestricted `setattr()`, allowing arbitrary file overwrites. ## Vulnerability The `prepare()` method in `onnxruntime/python/backend/backend.py` iterated over user-controlled `kwargs` and used `setattr()` to apply them directly to a `SessionOptions` instance. The `hasattr()` check was not a security guard — it returned `True` for all exposed properties including dangerous ones like `optimized_model_filepath`. **Attack vector:** ```python onnxruntime.backend.prepare( model_path, optimized_model_filepath="/etc/passwd", # overwrites any file with protobuf binary graph_optimization_level=onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL ) ``` The same pattern existed in `backend_rep.py` for `RunOptions`. ## Fix Replaced the unrestricted `hasattr/setattr` loop in both files with strict allowlists: - **`_ALLOWED_SESSION_OPTIONS`** (13 safe attrs) in `backend.py` - **`_ALLOWED_RUN_OPTIONS`** (4 safe attrs) in `backend_rep.py` **Both `SessionOptions` and `RunOptions` use identical validation logic** with three outcomes for each kwarg key: - **Allowlisted** — Applied via `setattr()` (e.g. `graph_optimization_level`, `log_severity_level`) - **Known-but-blocked** (real attribute on the object, but not on allowlist) — Raises `RuntimeError` (e.g. `optimized_model_filepath`, `terminate`) - **Completely unknown** (not a property on the object at all) — Silently ignored for forward compatibility (e.g. `nonexistent_option_xyz`) **Blocked dangerous attributes:** - `optimized_model_filepath` — triggers `Model::Save()`, overwrites arbitrary files with protobuf binary - `profile_file_prefix` — writes profiling JSON to arbitrary path - `enable_profiling` — causes uncontrolled file writes to cwd - `terminate` (RunOptions) — denies the current inference call - `training_mode` (RunOptions) — silently switches inference behavior in training builds ## Tests Added `TestBackendKwargsAllowlist` with 13 new test methods covering all exploit vectors (blocked attrs raise `RuntimeError`), safe allowlisted attrs (accepted), unknown attrs (silently ignored), and end-to-end `run_model()` paths for both session and run options. All 15 tests pass (13 new + 2 pre-existing in `TestBackend`), no regressions. ## Files Changed - `onnxruntime/python/backend/backend.py` - `onnxruntime/python/backend/backend_rep.py` - `onnxruntime/test/python/onnxruntime_test_python_backend.py` - `.agents/skills/python-kwargs-setattr-security/SKILL.md` --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@tianleiwu

…ttention (microsoft#28200) ### Description Adds a CUTLASS memory-efficient attention (MEA) fallback to the CUDA PagedAttention op, enabling the operator on **sm<80 (Turing / Volta / Pascal) with fp16** for the first time. On sm>=80 the default FlashAttention path is unchanged; MEA is reachable via `ORT_DISABLE_FLASH_ATTENTION=1` or the `sdpa_kernel` CUDA provider option for debugging and perf comparison. | Environment | Before | After | |---|:---:|:---:| | sm<80 + fp16 | ❌ error | ✅ MEA | | sm<80 + bf16 | ❌ error | ❌ error (MEA requires sm>=80 for bf16) | | sm>=80 + fp16/bf16 (default) | ✅ FA | ✅ FA (unchanged) | | sm>=80 + `ORT_DISABLE_FLASH_ATTENTION=1` / `sdpa_kernel=EFFICIENT_ATTENTION` | ❌ error | ✅ MEA | ### Motivation and Context The original PagedAttention PR (microsoft#24595) landed with the title "CUDA SM80 support" — the op errors out immediately whenever FlashAttention isn't available (sm<80 or `USE_FLASH_ATTENTION=0` builds). During that review, @tianleiwu flagged that the interface was too FlashAttention-specific (*"not good for other EP like WebGPU, CPU etc."*) and @aciddelgado agreed the FA-specific dependencies could be lifted at the kernel level. This PR closes that gap for sm<80 fp16 by mirroring the exact pattern established in microsoft#20012 ("Packed QKV and Rotary Embedding Support for sm<80 GQA"). The same CUTLASS memory-efficient attention backend that covers GQA's sm<80 path now covers PagedAttention. Related work: - microsoft#20012 — direct pattern template (sm<80 GQA MEA fallback) - microsoft#24595 — original PagedAttention PR - microsoft#27516 — MS canonical FA → MEA → Unfused cascade ordering - microsoft#27880 — ONNX Attention CUDA fallback coverage gaps - microsoft#27992 — MEA decode + unfused softcap work (same flavor) ### Implementation **Dispatch cascade** in `paged_attention.cc`: FlashAttention preferred; fall back to MemoryEfficientAttention via `has_memory_efficient_attention(sm, is_half, is_bf16, head_size, head_size)`. No custom head-size or dtype bounds hardcoded — MEA's own helper gates fp16 sm>=53 / bf16 sm>=80 / head_size <= 1024 and `% 8 == 0`. This keeps us forward-compatible with any future expansion of MEA's supported range. **MEA path** (`UnfusedAttention<T>`): 1. Reuses existing preprocessing: `LaunchGetCumulativeSeqlensKV` (hoisted to `paged_attention.cc` so both FA and MEA paths consume a pre-populated buffer — single-producer refactor), rotary, packed-QKV unpack, `ReshapeAndCache`. 2. New `GatherAndExpandPagedKVCache` CUDA kernel walks `block_table` to gather paged K/V into a packed-varlen `[total_kv_tokens, num_heads, head_size]` buffer, folding in GQA head expansion (so downstream MEA sees `num_heads` uniformly). 3. Dispatches to `run_memory_efficient_attention` in **varlen mode** via `seqstart_q_ptr = cumulative_seqlens_q` + `seqstart_k_ptr = cumulative_seqlens_kv` (and `has_custom_right_padding = false`). No padding required; layout matches the kernel's expected `[total_tokens, num_heads, head_size]` with BSNH strides. **Scratch allocation**: the MEA path D->H syncs `cumulative_seqlens_kv[batch_size]` via a pinned buffer to obtain `total_kv_tokens` on the host for tight `gathered_key` / `gathered_value` / `fmha_buffer` allocation. This adds a forward-per-call `cudaStreamSynchronize` — acceptable for a compatibility fallback (FA remains the hot path on supported hardware). Over-allocation (the no-sync alternative) would consume `B × max_num_blocks_per_seq × block_size × num_heads × head_size × 2 × sizeof(T)`, which reaches GB-scale for realistic GQA models and was rejected. `fmha_buffer` is sized with `sizeof(float)` (matching the GQA EfficientAttention pattern at `group_query_attention.cc:482`) because MEA's output accumulator is fp32 regardless of input dtype. ### Testing New `TestPagedAttentionMEA` class in `test_paged_attention_cuda.py` runs the existing parity matrix (rotary on/off, rotary_interleaved on/off, packed-QKV on/off, local window on/off, softcap 0/50, varied head sizes/shapes) against the MEA path via the `sdpa_kernel` CUDA provider option set to `EFFICIENT_ATTENTION` (=2, from `AttentionBackend` enum). Using a per-session provider option instead of an env var means both FA and MEA test classes coexist in the same pytest process — each InferenceSession creates its own CUDA EP with its own `attention_kernel_options_`. The existing `TestPagedAttention` class is skipped wholesale on sm<80 by its `has_flash_attention()` gate, so without the new MEA class the fallback path would have no CI coverage. **Local verification** (NVIDIA A100 80GB, CUDA 12.8, GCC 13.3): ``` TestPagedAttention: 24/24 passed (~60s) # FA baseline — no regression TestPagedAttentionMEA: 24/24 passed (~59s) # new MEA path ``` Tolerance: `rtol = atol = 5e-3` against the same torch reference used by the FA parity test. All combinations match. **sm<80 hardware coverage**: I don't have local Turing / Volta / Pascal hardware, so real-SM coverage relies on MS CI. The code path exercised on A100 via `sdpa_kernel=EFFICIENT_ATTENTION` is the same one taken on sm<80; only the underlying CUTLASS kernel (`run_memory_efficient_attention_sm50/70/75/80`) differs per SM, and those are upstream and unmodified by this change. **Build note**: built with `--cmake_extra_defines CMAKE_CUDA_ARCHITECTURES=80 CMAKE_CXX_STANDARD=20`. The explicit C++20 define was needed because the initial configure resolved `CMAKE_CXX_STANDARD=17`, under which `ort_version_check.h`'s `consteval` usage fails to compile. Unrelated to this change.

…pkg set (microsoft#28254) ### Description Remove `react-native` package from set of packages required for RC/release publishing. We will need to revisit this and decide whether to remove it entirely or properly fix it. ### Motivation and Context The React Native package is having build issues and we don't need it for the next few immediate releases.

### Description Adds support for `com.microsoft:QuickGelu` (`x * Sigmoid(alpha * x)`) to the CoreML Execution Provider's MLProgram path. The builder decomposes QuickGelu into three MIL ops (`mul` / `sigmoid` / `mul`), matching the op's own schema function-body in `contrib_defs.cc:605-631` and the approach the QNN EP already uses in `qnn/builder/opbuilder/quick_gelu_op_builder.cc`. Only the MLProgram path is implemented; NeuralNetwork is deprecated on Apple Silicon. Adds `CoreMLExecutionProviderTest.QuickGeluTest` which builds a single `com.microsoft:QuickGelu` node with non-default `alpha=1.5` and verifies the entire graph is claimed by the CoreML EP via `ExpectedEPNodeAssignment::All`. Verified with a negative test: temporarily removing the `CreateQuickGeluOpBuilder` registration causes the new test to fail with a `VerifyEPNodeAssignment` fatal failure, proving it genuinely exercises the CoreML path. Also updates `coreml_supported_mlprogram_ops.md`. ### Motivation and Context Fixes microsoft#28183. QuickGelu is produced by ORT's own `QuickGeluFusion` optimizer pass (`onnxruntime/core/optimizer/quick_gelu_fusion.cc`), which runs at `ORT_ENABLE_EXTENDED` — and therefore also at `ORT_ENABLE_ALL`, the default session optimization level. So any model that contains the `x * sigmoid(alpha * x)` pattern (CLIP, several mobile transformers, the DWPose pose estimator) gets silently mutated by ORT into a graph with `QuickGelu` nodes that the CoreML EP then rejects — turning 3 supported primitives into 1 unsupported op, making the fusion strictly harmful for CoreML. On the DWPose `dw-ll_ucoco_384.onnx` model with batch=1 and `ORT_ENABLE_EXTENDED`, 76 `QuickGelu` nodes get produced. Running the result on the CoreML EP: | ORT build | CoreML subgraphs | Inference (ms) | | --- | --- | --- | | main (QuickGelu rejected) | ~80 (each QuickGelu is a graph break) | 54.77 | | this PR (QuickGelu supported) | 10 | 13.91 | The remaining breaks are other ops — see "Related gaps" below. A ~4× speedup at EXTENDED level from this patch alone. Even at the default `ORT_ENABLE_ALL` with a symbolic batch dim (where partial shape inference inhibits most fusions), 3 `QuickGelu` nodes still get produced — so this patch helps any CoreML user who hasn't explicitly downgraded to `ORT_ENABLE_BASIC`. ### Related CoreML EP gaps observed (out of scope for this PR) With QuickGelu fixed, the remaining 9 CPU-fallback nodes on the EXTENDED-optimized DWPose pose model are: - **`com.microsoft:FusedConv`** (×4) — produced by `ConvActivationFusion`. Fuses `Conv + activation` into one node. Same failure mode as QuickGelu: `Conv` and the activations (`Relu`, `Sigmoid`, `HardSigmoid`, etc.) are individually CoreML-supported, but the fused form isn't. Decomposition is straightforward — emit the underlying `conv` MIL op, then the corresponding activation. - **`com.microsoft:FusedMatMul`** (×2, from `MatMulScaleFusion`) — `MatMul * alpha` with an optional transpose. Decomposition: `matmul` + scalar `mul`. - **`ai.onnx:Split`** (×2) — pre-existing CoreML EP gap unrelated to fusion. CoreML MIL has a native `split` op; this one is a straight op-builder omission. Happy to send follow-up PRs for any of these after this one lands, following the same pattern. Flagging here so they're on the EP coverage roadmap. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tilities (microsoft#28227) This pull request significantly improves the safety and robustness of sparse tensor handling in ONNX Runtime. The main focus is on adding thorough bounds checking and using safe integer arithmetic to prevent overflows and invalid memory accesses when working with sparse tensor indices. Additionally, the Python bindings for sparse tensors are refactored to ensure correct object lifetimes and memory management when exposing data to NumPy. **Sparse Tensor Index Validation and Safety** * Added comprehensive bounds checks for COO and CSR sparse tensor indices in both the C API (`onnxruntime_c_api.cc`) and core conversion utilities, ensuring indices are within valid ranges and, for CSR, that outer indices are non-decreasing and within bounds. [[1]](diffhunk://#diff-cff364b6b1ab4ef507d87a661a97b873405f569797fcaf91af29491f223555a8R449-R485) [[2]](diffhunk://#diff-cff364b6b1ab4ef507d87a661a97b873405f569797fcaf91af29491f223555a8R521-R547) [[3]](diffhunk://#diff-cff364b6b1ab4ef507d87a661a97b873405f569797fcaf91af29491f223555a8R659-R696) [[4]](diffhunk://#diff-cff364b6b1ab4ef507d87a661a97b873405f569797fcaf91af29491f223555a8R721-R747) [[5]](diffhunk://#diff-620fd022510c5134fc9bd3c8d01bc5772cc78a82043b0da5e44cf2482038dc37L267-R273) [[6]](diffhunk://#diff-620fd022510c5134fc9bd3c8d01bc5772cc78a82043b0da5e44cf2482038dc37L359-R376) * Replaced direct arithmetic with `SafeInt` for all index and size calculations to prevent integer overflows, especially when converting between types or computing dense tensor offsets. [[1]](diffhunk://#diff-620fd022510c5134fc9bd3c8d01bc5772cc78a82043b0da5e44cf2482038dc37L267-R273) [[2]](diffhunk://#diff-d31e9fbe0f5334fcd949833e035f2b25d5ae810dcd505c545f6b372b546b1406L2077-R2077) [[3]](diffhunk://#diff-d31e9fbe0f5334fcd949833e035f2b25d5ae810dcd505c545f6b372b546b1406L2091-R2091) [[4]](diffhunk://#diff-d31e9fbe0f5334fcd949833e035f2b25d5ae810dcd505c545f6b372b546b1406L2110-R2110) [[5]](diffhunk://#diff-d31e9fbe0f5334fcd949833e035f2b25d5ae810dcd505c545f6b372b546b1406L2291-R2298) * Improved error messages for invalid indices, making debugging easier by providing more context about the specific error. [[1]](diffhunk://#diff-cff364b6b1ab4ef507d87a661a97b873405f569797fcaf91af29491f223555a8R449-R485) [[2]](diffhunk://#diff-cff364b6b1ab4ef507d87a661a97b873405f569797fcaf91af29491f223555a8R521-R547) [[3]](diffhunk://#diff-cff364b6b1ab4ef507d87a661a97b873405f569797fcaf91af29491f223555a8R659-R696) [[4]](diffhunk://#diff-cff364b6b1ab4ef507d87a661a97b873405f569797fcaf91af29491f223555a8R721-R747) [[5]](diffhunk://#diff-620fd022510c5134fc9bd3c8d01bc5772cc78a82043b0da5e44cf2482038dc37L267-R273) [[6]](diffhunk://#diff-620fd022510c5134fc9bd3c8d01bc5772cc78a82043b0da5e44cf2482038dc37L359-R376) **Python Bindings Improvements** * Refactored the pybind11 bindings for sparse tensor views so that NumPy arrays referencing sparse tensor memory correctly keep the parent Python object alive, preventing potential memory issues when the sparse tensor is on the GPU or managed by Python. [[1]](diffhunk://#diff-3c1b21fe3d5903c277b4d3888f5a4c57ff8f8f6f593183a3f4865825c5ab8e0cL98-R120) [[2]](diffhunk://#diff-3c1b21fe3d5903c277b4d3888f5a4c57ff8f8f6f593183a3f4865825c5ab8e0cL299-R304) [[3]](diffhunk://#diff-3c1b21fe3d5903c277b4d3888f5a4c57ff8f8f6f593183a3f4865825c5ab8e0cL314-R319) **General Code Quality** * Added missing header include for `safeint.h` to ensure `SafeInt` is available where needed. * Minor cleanups and improved assertions to clarify intent and ensure correctness. These changes collectively make sparse tensor support in ONNX Runtime safer, more reliable, and easier to use from both C++ and Python.

### Description vector_per_class dimension was not verified, it could lead to illegal memory access ### Motivation and Context security issue --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: xadupre <22452781+xadupre@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Tianlei Wu <tlwu@microsoft.com>

…icrosoft#28248) ### Description  - Correct misleading 'SemVer 1.0.0' label; the universal version regex actually validates SemVer 2.0.0 syntax without build metadata, which is what Azure Universal Packages requires. - Prefix the dev short SHA with 'commit-' in universal_version so the pre-release identifier always contains a non-digit, avoiding spurious validation failures for all-numeric SHAs with leading zeros. ### Motivation and Context  Fix invalid version when we have an all-numeric commit SHA starting with 0.

…ls (microsoft#28214) This PR adds position_ids bounds checking to WebGPU and JS RotaryEmbedding implementations, completing the security fix started in PR microsoft#27597 (commit 056bab3) which covered CPU and CUDA. ## Problem The `com.microsoft::RotaryEmbedding` kernel uses position_ids as row indices into cos_cache/sin_cache without bounds validation. While PR microsoft#27597 fixed CPU and CUDA paths, WebGPU and JS implementations were still missing bounds checks, which could produce silently wrong results (WebGPU hardware clamps OOB reads). ## Changes - **contrib_ops/webgpu/bert/rotary_embedding.cc**: Host-side validation (ORT_MAKE_STATUS) + shader-side defense-in-depth (pass-through on OOB) - **core/providers/webgpu/llm/rotary_embedding.cc**: Host-side validation with format-0 awareness - **js/web/lib/wasm/jsep/webgpu/ops/rotary-embedding.ts**: TypeScript validation using getBigInt64Array - **7 new C++ OOB test cases** across contrib and ONNX domains targeting WebGPU EP ## Security Addresses the same vulnerability as microsoft#27597 (OOB read via position_ids, CVSS 7.5-9.1) for WebGPU/JS execution providers. ## Testing - 7 new unit tests (3 contrib + 4 ONNX domain) with GTEST_SKIP when WebGPU EP unavailable - JS/TS error tests not feasible with current JSONC test format (documented) - Build environment lacks C++20/emsdk for full compilation verification; validated structurally --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ntime/ep/adapter/op_kernel_info.h` (microsoft#28081) ### Description  Remove reinterpret_cast of OrtKernelInfo* to internal OpKernelInfo* that breaks ABI across DLL boundaries (vtable mismatch between plugin EP and ORT core). - KernelInfoCache: use Ort::ConstKernelInfo::GetEp() instead of casting to OpKernelInfo* and calling GetExecutionProvider()->GetOrtEp() - GetAllocator: use C API KernelInfoGetAllocator + IAllocatorWrappingOrtAllocator instead of casting to OpKernelInfo* - Remove #include core/framework/op_kernel_info.h (no longer needed) - Add IAllocatorWrappingOrtAllocator adapter ### Motivation and Context  Address crash observed when testing WebGPU plugin EP with older ORT 1.24.4 binary where the number of `onnxruntime::IExecutionProvider` virtual functions had changed between the two builds. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

## Changes - Update `requirements.txt` to `protobuf>=4.25.8` - Update `requirements-training.txt` to `protobuf>=4.25.8` - Update `requirements-dev.txt` to `protobuf>=4.25.8` - Update `docs/python/requirements.txt` to `protobuf>=4.25.8` ## Notes This change addresses the direct Python manifest surface only. It does not claim to resolve every transitive Component Governance finding. Co-authored-by: arajendra <arajendra@users.noreply.github.com>

### Description Update GatherBlockQuantized to support 2-bits. Updated op schema, implemented the CPU and WebGPU EP. This helps to make the model smaller.

rev some npm packages --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: guschmue <22941064+guschmue@users.noreply.github.com> Co-authored-by: Copilot <copilot@github.com>

…osoft#28659) ### Description Fix the CUDA plugin EP package test pipeline failure where the plugin is built with the latest code (which includes `float8e8m0` and other newer data types), but the host ORT 1.26 release does not support these types. When the plugin attempts to register kernel type constraints containing unsupported types, `GetTensorDataType` fails and the plugin load crashes. ### Motivation and Context The plugin EP architecture allows plugins to be built against a newer version of the ONNX Runtime headers while being loaded into an older host ORT. However, the existing `KernelDefBuilder::TypeConstraint` methods call `GetTensorType` (which throws on unsupported types), making it impossible for a forward-compatible plugin to register kernels that include newer data types in their type constraint lists. ### Changes - Add `TryGetTensorType()` — a non-throwing variant of `GetTensorType()` that returns `nullptr` when the host ORT does not recognize a tensor element type. - Add `TryMLDataTypeToOrtDataType()` — a non-throwing variant of `MLDataTypeToOrtDataType()` that returns `nullptr` instead of asserting/throwing. - Update `KernelDefBuilder::TypeConstraint` (both vector and single-type overloads) to use the `Try` variants and gracefully skip unsupported types rather than failing. ### Impact - Plugins built with newer headers can now load into older host ORT releases without crashing on unknown data types. - If all types in a constraint list are unsupported, the constraint is simply not registered (the kernel will not match, which is the correct fallback behavior). - No behavioral change when the host supports all types — the code path is identical to before.

…osoft#28675) ## Description The `windows_x64_asan / build_x64` CI pipeline has been failing with OOM (out-of-memory) because ASan-instrumented test binaries consume significantly more memory than normal builds, and CTest was running them at full CPU-count parallelism. This PR adds a `--test_parallel` argument to `build.py` that allows CTest concurrency to be configured independently from the build parallelism (`--parallel`). It then uses `--test_parallel 4` in the Windows x64 ASan workflow to cap test execution to 4 parallel jobs, preventing OOM while keeping build parallelism at full speed. ## Motivation and Context - ASan instrumentation inflates per-process memory usage by ~2-3x. - The existing `--parallel` flag controls both MSBuild and CTest concurrency together; there was no way to keep fast parallel builds while limiting test concurrency. - The CI runner has limited memory, and running all tests in parallel under ASan exceeded available RAM. ## Changes | File | Change | |------|--------| | `tools/ci_build/build_args.py` | Add `--test_parallel` argument (default: `None`, falls back to `--parallel`) | | `tools/ci_build/build.py` | Add `number_of_test_parallel_jobs()` helper; use it for CTest `--parallel`; validate negative values | | `.github/workflows/windows_build_x64_asan.yml` | Pass `--test_parallel 4` to cap ASan test concurrency | ## Testing - `python -m py_compile tools/ci_build/build.py tools/ci_build/build_args.py` — passes - `python tools/ci_build/build.py --help` shows the new `--test_parallel` option - `git diff --check` — no whitespace issues - When `--test_parallel` is omitted, behavior is unchanged (falls back to `--parallel` value)

…6345 (microsoft#28524) This pull request introduces comprehensive validation and error handling improvements for the ConvTranspose operator across CPU, CUDA, WebGPU, and XNNPACK backends, as well as in shape inference and unit tests. The main focus is to ensure that invalid input shapes (especially rank-0 or rank-1 tensors) are properly detected and reported, preventing undefined behavior and improving robustness. Additionally, error messages are clarified, and several helper functions now return `Status` for better error propagation. **Validation and Error Handling Improvements:** * All ConvTranspose implementations (CPU, CUDA, WebGPU) now explicitly check that input `X` and filter `W` tensors have at least 3 dimensions, returning clear error messages if not. (`[[1]](diffhunk://#diff-72fa27d94d5d92dd1e78ff510ef9a84d1ad74426c19af9722cf6511f8d38a5a8R65-R79)`, `[[2]](diffhunk://#diff-d1bbcb0542b5acea587ac929cd6362cfd11172c522505c6db8b457a9d470c63dR273-R289)`, `[[3]](diffhunk://#diff-b615243d0702e9613bd815173108306495b0f690294001e606823b77322f6fafR22-L28)`) * The shape inference function for `ConvTransposeWithDynamicPads` now fails gracefully with descriptive errors if input or weight tensors have fewer than 2 dimensions. (`[onnxruntime/core/graph/contrib_ops/contrib_defs.ccL62-R67](diffhunk://#diff-81f57d9adc2cce94f85a2949a895b7ff82efcc13d05e23ee6567661f0fecb7c0L62-R67)`) * Additional validation ensures that `output_padding` and dynamic pads have correct sizes, and that `output_padding` values are within ONNX-specified limits. (`[[1]](diffhunk://#diff-72fa27d94d5d92dd1e78ff510ef9a84d1ad74426c19af9722cf6511f8d38a5a8R138-R153)`, `[[2]](diffhunk://#diff-72fa27d94d5d92dd1e78ff510ef9a84d1ad74426c19af9722cf6511f8d38a5a8R171-R187)`) **Refactoring for Robustness:** * Helper functions such as `ComputePadsAndOutputShape` and `ComputeTransposePadAndOutputShape` now return `Status`, allowing errors to propagate and be handled appropriately rather than causing crashes or silent failures. (`[[1]](diffhunk://#diff-72fa27d94d5d92dd1e78ff510ef9a84d1ad74426c19af9722cf6511f8d38a5a8L165-R234)`, `[[2]](diffhunk://#diff-72fa27d94d5d92dd1e78ff510ef9a84d1ad74426c19af9722cf6511f8d38a5a8L194-R262)`, `[[3]](diffhunk://#diff-72fa27d94d5d92dd1e78ff510ef9a84d1ad74426c19af9722cf6511f8d38a5a8L220-R282)`, `[[4]](diffhunk://#diff-72fa27d94d5d92dd1e78ff510ef9a84d1ad74426c19af9722cf6511f8d38a5a8R291-R302)`) * All call sites (CPU, CUDA, WebGPU, XNNPACK) are updated to handle and propagate these errors using `ORT_RETURN_IF_ERROR` or `ORT_THROW_IF_ERROR`. (`[[1]](diffhunk://#diff-72fa27d94d5d92dd1e78ff510ef9a84d1ad74426c19af9722cf6511f8d38a5a8R171-R187)`, `[[2]](diffhunk://#diff-d1bbcb0542b5acea587ac929cd6362cfd11172c522505c6db8b457a9d470c63dL362-R379)`, `[[3]](diffhunk://#diff-b615243d0702e9613bd815173108306495b0f690294001e606823b77322f6fafL48-R60)`, `[[4]](diffhunk://#diff-6a2f8672090f25850b90b266aff3c7212552fc81b14bb7b539e9e5161c9fd526L494-R497)`) **Unit Test Enhancements:** * New negative tests are added to verify that rank-0 and rank-1 weight tensors are properly rejected and produce the expected error messages, increasing test coverage and reliability. (`[onnxruntime/test/contrib_ops/conv_transpose_with_dynamic_pads_test.ccR22-R56](diffhunk://#diff-cb5bfc51d0c8096922eb674d142f0e970d5becd140b47bdfd7729a06a818b598R22-R56)`) **Minor Code Quality Improvements:** * Improved memory management in the CPU implementation by wrapping the allocated buffer in `BufferUniquePtr` immediately to prevent leaks if exceptions are thrown. (`[onnxruntime/core/providers/cpu/nn/conv_transpose.ccR79-R89](diffhunk://#diff-0dcb5a9c8ba0c4e67940e9d77f77cb706bbf82d67bf6757967099b0a69c797b5R79-R89)`) * Minor includes and type safety improvements (e.g., use of `SafeInt` for overflow protection). (`[[1]](diffhunk://#diff-72fa27d94d5d92dd1e78ff510ef9a84d1ad74426c19af9722cf6511f8d38a5a8R22)`, `[[2]](diffhunk://#diff-72fa27d94d5d92dd1e78ff510ef9a84d1ad74426c19af9722cf6511f8d38a5a8R291-R302)`) **Summary of Most Important Changes:** **1. Validation and Error Handling** - All ConvTranspose implementations now check that input and filter tensors have at least 3 dimensions, returning clear errors if not. (`[[1]](diffhunk://#diff-72fa27d94d5d92dd1e78ff510ef9a84d1ad74426c19af9722cf6511f8d38a5a8R65-R79)`, `[[2]](diffhunk://#diff-d1bbcb0542b5acea587ac929cd6362cfd11172c522505c6db8b457a9d470c63dR273-R289)`, `[[3]](diffhunk://#diff-b615243d0702e9613bd815173108306495b0f690294001e606823b77322f6fafR22-L28)`) - Shape inference for `ConvTransposeWithDynamicPads` fails with descriptive errors for invalid input or weight tensor ranks. (`[onnxruntime/core/graph/contrib_ops/contrib_defs.ccL62-R67](diffhunk://#diff-81f57d9adc2cce94f85a2949a895b7ff82efcc13d05e23ee6567661f0fecb7c0L62-R67)`) - Additional checks for `output_padding` and dynamic pads sizes and values, with ONNX spec compliance. (`[[1]](diffhunk://#diff-72fa27d94d5d92dd1e78ff510ef9a84d1ad74426c19af9722cf6511f8d38a5a8R138-R153)`, `[[2]](diffhunk://#diff-72fa27d94d5d92dd1e78ff510ef9a84d1ad74426c19af9722cf6511f8d38a5a8R171-R187)`) **2. Error Propagation and Refactoring** - Helper functions now return `Status` and propagate errors; all call sites updated to handle these errors. (`[[1]](diffhunk://#diff-72fa27d94d5d92dd1e78ff510ef9a84d1ad74426c19af9722cf6511f8d38a5a8L165-R234)`, `[[2]](diffhunk://#diff-72fa27d94d5d92dd1e78ff510ef9a84d1ad74426c19af9722cf6511f8d38a5a8L194-R262)`, `[[3]](diffhunk://#diff-72fa27d94d5d92dd1e78ff510ef9a84d1ad74426c19af9722cf6511f8d38a5a8L220-R282)`, `[[4]](diffhunk://#diff-72fa27d94d5d92dd1e78ff510ef9a84d1ad74426c19af9722cf6511f8d38a5a8R291-R302)`, `[[5]](diffhunk://#diff-d1bbcb0542b5acea587ac929cd6362cfd11172c522505c6db8b457a9d470c63dL362-R379)`, `[[6]](diffhunk://#diff-b615243d0702e9613bd815173108306495b0f690294001e606823b77322f6fafL48-R60)`, `[[7]](diffhunk://#diff-6a2f8672090f25850b90b266aff3c7212552fc81b14bb7b539e9e5161c9fd526L494-R497)`) **3. Unit Testing** - Added tests to ensure invalid weight tensor ranks are rejected with proper error messages. (`[onnxruntime/test/contrib_ops/conv_transpose_with_dynamic_pads_test.ccR22-R56](diffhunk://#diff-cb5bfc51d0c8096922eb674d142f0e970d5becd140b47bdfd7729a06a818b598R22-R56)`) **4. Code Quality** - Improved buffer management and type safety in CPU backend. (`[[1]](diffhunk://#diff-0dcb5a9c8ba0c4e67940e9d77f77cb706bbf82d67bf6757967099b0a69c797b5R79-R89)`, `[[2]](diffhunk://#diff-72fa27d94d5d92dd1e78ff510ef9a84d1ad74426c19af9722cf6511f8d38a5a8R22)`, `[[3]](diffhunk://#diff-72fa27d94d5d92dd1e78ff510ef9a84d1ad74426c19af9722cf6511f8d38a5a8R291-R302)`)

- disable dynamic wgsl template tesst - disable shader cache key checks

…osoft#28369) BiasLoader hardcoded 128-bit vectorized loads (`ElementsPerAccess = 128/sizeof_bits = 8` for fp16) regardless of the `isAligned` template flag. When bias stride was not a multiple of 8, the unaligned kernel was selected but BiasLoader still used 128-bit loads → `cudaErrorMisalignedAddress`. **Fix**: Use `kAlignmentA` (4 for unaligned, 8 for aligned) instead of hardcoded 8. Tested with Gemma4 Attention + mask at all seq lengths 1–32. --------- Signed-off-by: Justin Chu <justinchu@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Tianlei Wu <tlwu@microsoft.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>

…rosoft#27707) ## Summary - Fix ORT raising "does not have type information set by parent node" when a subgraph references an initializer declared in the outer (parent) graph without explicit `value_info` in the subgraph - Propagate type info from implicit input defs to subgraph NodeArgs before subgraph verification in `VerifyNodeAndOpMatch` - Add regression test with an `If` node whose subgraph references an outer scope initializer without `value_info` ## Motivation Fixes microsoft#24880 When a node's op schema type inference function does not invoke subgraph inferencing (e.g., contrib ops like `BeamSearch`, `GreedySearch`, `WhisperBeamSearch`, `Sampling`), `InferAndVerifySubgraphTypes` is never called. This means type info from outer scope values — such as initializers declared in the parent graph — is never propagated to the subgraph's NodeArgs. When the subgraph is later verified in the second pass of `VerifyNodeAndOpMatch`, nodes referencing these outer scope values fail with a null type error. The existing workaround in `convert_generation.py` (manually adding `value_info` entries for moved initializers) confirms this gap in the type propagation path. ## Changes **`onnxruntime/core/graph/graph.cc`**: In `VerifyNodeAndOpMatch`'s subgraph verification loop, propagate type info from the containing node's `implicit_input_defs` to the subgraph's NodeArgs before calling `VerifyNodeAndOpMatch` on the subgraph. The propagation is guarded by `subgraph_nodearg->Type() == nullptr`, making it a safe no-op for standard ONNX ops (If/Loop/Scan) where `InferAndVerifySubgraphTypes` already set the types. For nested subgraphs, the recursive call to `VerifyNodeAndOpMatch` handles propagation at each level. **`onnxruntime/test/ir/graph_test.cc`**: Add `OuterScopeInitializerTypeInfoPropagatedToSubgraph` test that constructs a model proto with an `If` node whose subgraphs reference an outer graph initializer without `value_info`, and verifies `Model::Load` (which calls `Graph::Resolve`) succeeds. ## Test Plan - [ ] New C++ unit test `OuterScopeInitializerTypeInfoPropagatedToSubgraph` verifies model resolution succeeds - [ ] Existing `graph_test.cc` tests continue to pass (no regression in type inference for standard ONNX ops) - [ ] Existing control flow tests (If/Loop/Scan) continue to pass - [ ] CI lint checks pass (verified locally with `lintrunner`)

…lash_nvcc_threads, and enable quick build mode (microsoft#28645) ## Description Speed up CUDA CI build times by splitting the monolithic CUDA provider into architecture-specific OBJECT libraries with independent `nvcc --threads` control, and introducing a quick build mode (`onnxruntime_QUICK_BUILD`) that reduces kernel instantiations for CI validation. ## Motivation and Context CUDA builds were bottlenecked by `--nvcc_threads 1` across all targets because flash attention (48 .cu files, SM80+) requires ~4GB per nvcc thread and caused OOM when compiled with higher thread counts. The old heuristic in `build.py` used `psutil` to auto-detect memory but was unreliable and always conservative. By splitting flash attention into its own OBJECT library, the rest of the build can safely use `--threads 4` while flash attention stays at `--threads 2`. Combined with quick build mode (fewer kernel variants), this significantly reduces CI wall-clock time. ## CI Time Saving * N1F1: `--nvcc_threads 1`. CI time is from checks of [PR 28607](https://github.com/microsoft/onnxruntime/pull/28607/checks). * N4F2: `--nvcc_threads 4 --flash_nvcc_threads 2`: CI time is from this PR. * N8F4: `--nvcc_threads 8 --flash_nvcc_threads 4`: CI time is from this PR. * N4F4: `--nvcc_threads 4 --flash_nvcc_threads 4`: CI time is from this PR. This is the final candidate. * Saving = N1F1 - N4F4 * Saving Ratio = (N1F1 - N4F4) / N1F1 Here is CI time (Build + Test time in minutes) saving: CI | N1F1 | N4F2 | N8F4 | N4F4 | Saved Minutes | Saving Ratio -- | -- | -- | -- | -- | -- | -- Linux CI | 35 + 38 | 35 + 32 | 35 + 32 | 36 + 27 | 10 | 14% Windows CI | 58 + 36 | 53 + 38 | 54 + 38 | 48 + 36 | 10 | 11% Plugin Linux CI | 53 + 26 | 38 + 17 | 39 + 39 | 39 + 15 | 25 | 32% Plugin Windows CI | 77 + 16 | 57 + 14 | 54 + 14 | 53 + 12 | 28 | 30% Windows TRT CI | 54 + 43 | 38 + 38 | 42 + 43 | 41 + 37 | 19 | 20% Note that this is only one time comparison. Cache might take effect with more runs, and might change the statistics. The CI time is reduced in the range of 11% to 32%. Total CI time saving is more than 90 minutes. ## Key Changes ### 1. CMake: Architecture-specific OBJECT Libraries | File | Change | |------|--------| | `cmake/onnxruntime_cuda_source_filters.cmake` | New macros: `onnxruntime_extract_flash_attention_sources()`, `onnxruntime_extract_llm_sources()`, `onnxruntime_extract_sm_specific_cuda_sources()` to partition sources by SM arch | | `cmake/onnxruntime_providers_cuda.cmake` | Create `flash_attention` (SM80+), `llm` (SM75+), `sm90_tma`, and `sm120_tma` OBJECT libraries with per-target `--threads`; merge fpA_intB SM90 launchers into SM90 TMA lib | | `cmake/onnxruntime_providers_cuda_plugin.cmake` | Mirror OBJECT library pattern for plugin EP build; consolidate shared compile options into a variable; fix `-Xcudafe --diag_suppress=550,2810` and `--std c++20` for CUDA 12.8 compatibility | | `cmake/onnxruntime_unittests.cmake` | Link new OBJECT libraries into test target | ### 2. Build Script: `--flash_nvcc_threads` and Default 4 | File | Change | |------|--------| | `tools/ci_build/build.py` | Remove `psutil`-based memory heuristic; add `--flash_nvcc_threads` forwarding; default `nvcc_threads` to 4 | | `tools/ci_build/build_args.py` | Add `--flash_nvcc_threads` CLI argument (default: same as `--nvcc_threads`) | ### 3. Quick Build Mode (`onnxruntime_QUICK_BUILD`) - Reduces flash attention kernels to hdim128 fp16 only (skips hdim32/64/96/192/256) - Guards some MoE SM90 generated launchers with `#ifndef ORT_QUICK_BUILD` - Restricts CUTLASS SM80 tile configs to 3 instantiations - Skips test cases that depend on excluded kernel variants (e.g., `test_gqa_fp8_fallback_unsupported_head_size` needs hdim64) - Applied to all CI pipelines **except** Linux CUDA CI (full build) and packaging pipelines ### 4. CI and Packaging Pipeline Updates All CUDA CI pipelines updated from `--nvcc_threads 1` to `--nvcc_threads 4 --flash_nvcc_threads 4`: - `.github/workflows/linux_cuda_ci.yml` - `.github/workflows/linux_cuda_plugin_ci.yml` (+ `QUICK_BUILD=ON`) - `.github/workflows/linux_tensorrt_ci.yml` (+ `QUICK_BUILD=ON`) - `.github/workflows/windows_cuda.yml` (+ `QUICK_BUILD=ON`) - `.github/workflows/windows_cuda_plugin.yml` (+ `QUICK_BUILD=ON`) - `.github/workflows/windows_tensorrt.yml` (+ `QUICK_BUILD=ON`) Packaging pipeline updated to use `--nvcc_threads 4 --flash_nvcc_threads 2`, except `--nvcc_threads 2 --flash_nvcc_threads 1` for cuda plugin: - Azure Pipelines: `custom-nuget-packaging-pipeline.yml`, `nuget-win-cuda-packaging-stage.yml`, `plugin-win-cuda-stage.yml`, `py-win-gpu-stage.yml` - Linux scripts: `build_cuda_plugin_package.sh`, `build_linux_python_package.sh` ### 5. Bug Fix: CUTLASS Heuristic for SIMT Kernels - `onnxruntime/contrib_ops/cuda/llm/cutlass_heuristic.cc`: Fixed `ORT_QUICK_BUILD` path to return proper tile config for SIMT (float) gemm type instead of discarding the type info ## Architecture Mapping | OBJECT Library | Min SM | Sources | Threads | |---|---|---|---| | `*_flash_attention` | SM80+ | `bert/flash_attention/*.cu` (48 files) | `onnxruntime_FLASH_NVCC_THREADS` (default: same as nvcc_threads) | | `*_llm` | SM75+ | `contrib_ops/cuda/llm/*.cu` (excl. SM90/SM120 launchers) | `onnxruntime_NVCC_THREADS` (default 4) | | `*_sm90_tma` | 90a-real | MoE TMA + fpA_intB SM90 launchers | `onnxruntime_NVCC_THREADS` | | `*_sm120_tma` | SM120+ | MoE SM120 TMA generated files | `onnxruntime_NVCC_THREADS` | | Parent target | All archs | Everything else | `onnxruntime_NVCC_THREADS` | ## New Build Options - `--nvcc_threads N` (default 4) — threads for all CUDA targets except flash attention - `--flash_nvcc_threads N` (default: same as `--nvcc_threads`) — threads specifically for flash attention compilation CMake cache variables: `onnxruntime_NVCC_THREADS`, `onnxruntime_FLASH_NVCC_THREADS` ## Testing - Built locally with `CMAKE_CUDA_ARCHITECTURES="75;80;86;89;90;100;120"`, `--nvcc_threads 4 --flash_nvcc_threads 2` - Verified flash attention .cu files compile only for SM80+ (checked `build.ninja` / VS project) - Verified LLM .cu files compile for SM75+ - Ran `onnxruntime_provider_test` — all CUDA EP tests pass - Ran `python test_qmoe_cuda.py` (MoE kernels), flash attention / GQA tests - No link errors in both in-tree provider and plugin EP builds - No nvcc warnings about duplicate `--threads` flags - Plugin CI compile options verified: `--std c++20`, `-Xcudafe --diag_suppress=550,2810`, MSVC `/bigobj` all applied to OBJECT libraries

### Summary Lower ONNX `Sin` and `Cos` to the CoreML ML Program `sin` / `cos` elementwise ops via the existing `UnaryOpBuilder`, registered in the op builder factory. Like `Erf` / `Round` / `Exp`, these have no NeuralNetwork lowering (`UnaryFunctionLayerParams` has no sin/cos), so `IsOpSupportedImpl` rejects them on the NeuralNetwork format. ### Why `Sin` / `Cos` form the sinusoidal timestep embedding of diffusion UNets. Supporting them keeps that prologue on CoreML instead of splitting the graph — a tiny Stable-Diffusion UNet goes from **2 CoreML partitions → 1, zero graph breaks** with this change alone. This PR is **independent** of the rest of the series (it touches only the unary builder) and can be reviewed/merged in any order. ### Tests (`coreml_basic_test.cc`) - `SinCos_MLProgram` — a Sin + Cos graph runs fully on CoreML and matches the CPU reference. - `SinCosNeuralNetworkNotSupported` — the same graph falls back to CPU on the NeuralNetwork format. Doc: `coreml_supported_mlprogram_ops.md` lists `Sin` and `Cos`. ### Series — CoreML EP coverage for transformer / diffusion graphs - microsoft#28595 — Support bool Cast in ML Program *(prerequisite)* - **microsoft#28596 — Add Sin and Cos unary ops** *(this PR — independent)* - microsoft#28597 — Add Where and And builders *(depends on microsoft#28595)* - microsoft#28598 — Add GatherND builder *(depends on microsoft#28595)* Together with microsoft#28278 (scalar-`Gather`), the series takes BERT / GPT-2 / ViT / diffusion-UNet graphs — tiny and full-size — from 2 CoreML partitions to 1, with zero graph breaks. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

### Description TRT 11.0 removes many deprecated APIs, so guard TRT-EP code accordingly to support TRT 11 builds. ### Motivation and Context Fixes compilation with TRT 11.0 Signed-off-by: Kevin Chen <kevinch@nvidia.com>

…p-webgpu/MIN_ONNXRUNTIME_VERSION` (microsoft#28687) ### Description  Bake the contents of `plugin-ep-webgpu/MIN_ONNXRUNTIME_VERSION` into the plugin EP library as the `ORT_PLUGIN_EP_MIN_ORT_VERSION` preprocessor definition and pass it to `ApiInit()` so the EP refuses to load against an older ORT runtime. Rework `ApiInit()` to strictly parse the runtime version string as "MAJOR.MINOR.PATCH", optionally enforce a caller-supplied minimum, require MAJOR == 1, and use MINOR as the API version. All failure modes now throw `std::runtime_error` with a descriptive message. ### Motivation and Context  Enforce minimum ORT version for WebGPU plugin EP as specified in the minimum ORT version file. Previously, the version was hardcoded in `ApiInit()`. Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…icrosoft#28682) ## Description Use fp32 accumulation in SkipLayerNormalization, SkipSimplifiedLayerNormalization, and EmbedLayerNormalization CUDA kernels to avoid overflow and improve numerical accuracy when processing fp16/bf16 data. The original implementation accumulated mean and variance statistics in the input data type (fp16/bf16), which can overflow for large hidden sizes or when input values have large magnitude. This change promotes all intermediate accumulation (mean, variance, normalization math) to fp32, matching the approach used by TensorRT-LLM's LayerNorm kernels. ## Motivation - fp16 has limited range (max ~65504) and precision (10-bit mantissa). Accumulating `x²/ld` across thousands of elements in fp16 easily overflows or loses precision. - bf16 has even less precision (7-bit mantissa), making accumulation errors more severe. - The fix is straightforward: cast to float before accumulating, compute normalization in float, cast back to the output type. ## Key Changes | File | Change | |------|--------| | `layer_norm.cuh` | Changed `LayerNorm`, `SimplifiedLayerNorm`, `LayerNormSmall`, `SimplifiedLayerNormSmall` to accept and operate on `float` for thread_data, epsilon, mu, rsigma. Removed unused `KeyValuePairSum` overloads for half/bfloat16. | | `skip_layer_norm_impl.cu` | Changed `SkipLayerNormKernel` and `SkipLayerNormKernelSmall` to accumulate in fp32 (`cub::KeyValuePair<float, float>`). Removed `maybe2half` helper (no longer needed). | | `embed_layer_norm_impl.cu` | Changed epsilon from `T` to `float`, accumulation to use `float` thread_data. | | `profile_skip_layer_norm.py` | New profiling script for nsys-based kernel timing analysis. | | `profile_skip_layer_norm.sh` | Shell wrapper for running nsys profiling. | | `parse_nsys.py` | Utility to parse nsys SQLite output and extract CUDA kernel timings. | ## Performance Results Profiled on NVIDIA GPU with nsys (B=1, seq_len=2048, fp16 data, 200 iterations, skip first 5 warmup): | Hidden Size | fp16 accum (μs) | fp32 accum (μs) | Regression | |---|---|---|---| | 768 | 3.81 | 3.81 | **0.0%** | | 1024 | 4.22 | 4.22 | **0.0%** | | 4096 | 13.01 | 13.03 | **+0.15%** (noise) | | 8192 | 28.94 | 28.94 | **0.0%** | **No measurable performance regression.** The kernel is memory-bandwidth-bound, so fp32 arithmetic is completely hidden behind memory latency. ## Testing - Existing unit tests pass (SkipLayerNorm, EmbedLayerNorm ops). - Profiling scripts added for reproducible performance measurement: ```bash cd onnxruntime/test/python/transformers nsys profile -o sln_fp16 --export=sqlite python profile_skip_layer_norm.py --mode fp16 --warmup 5 --repeat 100 python parse_nsys.py sln_fp16.sqlite --skip-first 5 ``` ## Related PRs microsoft#28442 microsoft#15660

This pull request makes a targeted update to the operator schema in the ONNX Runtime codebase, specifically clarifying the optional nature of certain outputs. Schema definition improvements: * Marked the `present_key` and `present_value` outputs as optional in the `ONNX_MS_OPERATOR_SET_SCHEMA` macro within `bert_defs.cc`, making the operator schema clearer and more flexible for consumers.

…rosoft#28260) ### Summary Add per-annotation-ID buffer managers and captured command storage so multiple generators can each capture and replay their own graph independently without cross-contamination Add ReleaseGraph API through the full ORT stack (EP base → C API → InferenceSession → plugin EP) to release captured commands and GPU buffers when a generator is destroyed Replace the single graph_buffer_mgr_ / is_graph_captured_ bool with per_graph_buffer_mgrs_ map and captured_graph_ids_ set keyed by annotation ID Use a std::function getter with cached pointer pattern in GpuBufferAllocator to dynamically route allocations to the active per-graph buffer manager during runs, while keeping Alloc/Free as simple pointer dereferences ### Motivation Edge's Prompt API speed benchmark creates multiple sessions/generators sequentially with graph capture enabled. With the existing single-graph design, the second generator replays the first generator's captured commands with wrong buffers, producing incorrect output and ultimately a QuotaExceededError in the browser. This PR isolates each generator's graph capture state so they don't interfere with each other. ### Related PR The GenAI side change is in microsoft/onnxruntime-genai#2106, which calls SessionReleaseGraph when a generator is destroyed to release the captured graph's GPU buffers. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: qjia7 <4221210+qjia7@users.noreply.github.com>

### Description We have internal alerts asking to update protobuf version in response to CVE-2026-0994. The alert asks tools/ci_build/github/linux/docker/inference/aarch64/python/cpu/scripts/requirements.txt to be updated to 5.29.6 while the rest are asked to be set to 6.33.5.

…28280) ### Description ## Summary Adds two WebGPU-only graph fusions and the contrib ops they target, plus a small refactor of the existing `MatMulNBits` dispatch logic so the new fused kernels can share its predicates. | Component | Files | Purpose | |---|---|---| | **`MatMulNBitsMlp` op + kernel** | `contrib_ops/webgpu/quantization/matmul_nbits_mlp.{cc,h}`, `*.wgsl.template` (3) | Fuses the SwiGLU MLP block: optional `(Skip)SimplifiedLayerNormalization` + two `MatMulNBits` projections (gate, up) + optional biases + `Sigmoid`/`Mul` (SiLU) + element-wise `Mul`. Single dispatch instead of 5–7. | | **`MatMulNBitsQkv` op + kernel** | `contrib_ops/webgpu/quantization/matmul_nbits_qkv.{cc,h}`, `*.wgsl.template` | Fuses `(Skip)SimplifiedLayerNormalization` + three `MatMulNBits` projections (Q, K, V) sharing the same input. Single dispatch instead of 4. | | **Op schemas** | `core/graph/contrib_ops/contrib_defs.cc` | `MatMulNBitsMlp` and `MatMulNBitsQkv` contrib op schemas (kMSDomain, opset 1). | | **Graph transformers** | `core/optimizer/matmul_nbits_{mlp,qkv}_fusion.{cc,h}` | Pattern-match the source subgraphs and emit the fused ops. EP-gated to WebGPU only — no impact on other EPs. Registered in `graph_transformer_utils.cc`. | | **Dispatch helpers** | `contrib_ops/webgpu/quantization/matmul_nbits_common.{cc,h}` + `matmul_nbits.cc` | Extracts the "would this dispatch use Subgroup-Matrix / DP4A / WideTile?" predicates into pure functions reusable by the fused kernels. No behavior change in the unfused `MatMulNBits` path. | | **Tests** | `test/optimizer/matmul_nbits_{mlp,qkv}_fusion_test.cc`, `graph_transform_utils_test.cc` | Unit tests for the new transformers (positive + negative cases). | ### Motivation and Context ~25-30% decode TPS throughput improvement on WebGPU + D3D backend on Windows. GPU used: RTX 5060Ti for Qwe3-1.7B. BEFORE (**95 decode TPS**): main branch <img width="344" height="140" alt="image" src="https://github.com/user-attachments/assets/0f5d7cfb-05f9-4f25-acb5-4becb8f5addd" /> AFTER (**120+ decode TPS**): PR branch <img width="359" height="134" alt="image" src="https://github.com/user-attachments/assets/f1254d8e-a400-4dbb-9d06-ab6116f929bb" /> --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

…ft#28681) This pull request significantly improves the numerical correctness and robustness of L1 and L2 reduction operations (norms) for integer types on both CPU and CUDA backends. The main changes address integer overflow, undefined behavior, and precision loss in norm calculations, especially for edge cases like minimum representable integers and large accumulations. The changes also ensure consistency between CPU and CUDA implementations, and add detailed documentation for future maintainability. **Numerical correctness and overflow handling for integer norm reductions:** * On CPU, L1 and L2 reductions (`ReduceAggregatorL1` and `ReduceAggregatorL2` in `reduction_ops.h`) now accumulate in double precision to avoid integer overflow and undefined behavior, with Kahan summation for int64+ to minimize precision loss. Results are clamped to the maximum representable value to prevent overflow. [[1]](diffhunk://#diff-ca0c9224442a3c46251b0fb7326aacc1469bdee20ab409b930556f439d560015R722-R805) [[2]](diffhunk://#diff-ca0c9224442a3c46251b0fb7326aacc1469bdee20ab409b930556f439d560015R814-R893) * Introduced a `saturating_abs` function on CPU and a device-side `Impl_SaturatingAbs` kernel on CUDA to safely compute the absolute value for signed integer types, clamping to `max()` if `abs(min())` would overflow. [[1]](diffhunk://#diff-ca0c9224442a3c46251b0fb7326aacc1469bdee20ab409b930556f439d560015R722-R805) [[2]](diffhunk://#diff-f7138acd21464814d1793c9d334bee07d0cbe69719691e67efe3b2e23e4d06c7R516-R566) [[3]](diffhunk://#diff-945ab1deb57e1ff44b790cb2054537c252e23b7c7c374f44da66475361910abdR110-R119) **CUDA backend improvements and consistency:** * For integer reductions on CUDA, the input is cast to double before reduction, and the result is cast back to integer with saturating semantics (using PTX `cvt.sat`), matching the CPU's explicit clamping. This avoids precision loss and undefined behavior. * For no-op reductions (where input and output counts are equal), norm operations now use the saturating absolute value kernel to ensure non-negative results, even for edge-case values like `INT_MIN`. [[1]](diffhunk://#diff-ee5316fc3898058f70e942d9a84de36be4c7da09f144633a2504236430d5d033L209-R217) [[2]](diffhunk://#diff-ee5316fc3898058f70e942d9a84de36be4c7da09f144633a2504236430d5d033L592-R607) [[3]](diffhunk://#diff-ee5316fc3898058f70e942d9a84de36be4c7da09f144633a2504236430d5d033L778-R794) **Documentation and maintainability:** * Added detailed comments explaining the rationale and numerical properties of the new implementations, including why double precision is used, the limitations of float, and the behavior for large reductions and edge cases. [[1]](diffhunk://#diff-ca0c9224442a3c46251b0fb7326aacc1469bdee20ab409b930556f439d560015R722-R805) [[2]](diffhunk://#diff-ca0c9224442a3c46251b0fb7326aacc1469bdee20ab409b930556f439d560015R814-R893) [[3]](diffhunk://#diff-f7138acd21464814d1793c9d334bee07d0cbe69719691e67efe3b2e23e4d06c7R516-R566) [[4]](diffhunk://#diff-ee5316fc3898058f70e942d9a84de36be4c7da09f144633a2504236430d5d033L807-R861) These changes make norm reductions for integer types safe, mathematically correct, and consistent across CPU and CUDA, even for extreme or previously problematic inputs.

…ove shader key validation to nightly build (microsoft#28674) ### Description  Allow shader code to be dumped to the file specified in the `ORT_WEBGPU_EP_SHADER_DUMP_FILE` environment variable. Previously, shader code was only dumped by verbose logging. Create new nightly CI pipeline to run shader key validation test. That test is removed from the CI pipeline in microsoft#28642. ### Motivation and Context  More shader dump output options. Moving shader key validation test.

…u_inc/cub.cuh" wrapper. (microsoft#28705) ### Description  Replace direct inclusion of `<cub/cub.cuh>` with `"core/providers/cuda/cu_inc/cub.cuh"` wrapper. The wrapper accounts for a problematic macro definition which causes issues. ### Motivation and Context  Fix pipeline build error.

…mic quantization (microsoft#28228) ## Summary - Fix `quantize_dynamic(per_channel=True)` so weights quantized per-channel produce a `DequantizeLinear` node with the correct `axis` attribute. - Stop dropping the channel axis when `quantize_weight_per_channel` populates `QuantizedValue` (was hardcoded to `None`). - Gate the scalar-scale assertion in `_dequantize_value` on `axis is None` so per-channel scales (1-D tensors) are accepted. ## Motivation Fixes microsoft#19997. When a model is quantized with `quantize_dynamic(..., per_channel=True)` and a per-channel weight reaches `_dequantize_value` (e.g. via `_dequantize_outputs` when the weight is in the graph outputs), two bugs surface: 1. `quantize_weight_per_channel` stores `QuantizedValue.axis = None` even though it received a real `channel_axis`, so the per-channel information is lost. 2. `_dequantize_value` (a) asserts `scale_init.size == 1`, which fails for a 1-D per-channel scale, and (b) builds the `DequantizeLinear` node without an `axis` attribute, producing an invalid ONNX node when the model is consumed. PR microsoft#22283 (Nov 2024) softened the assertion against `None`-typed scales but left the underlying axis-propagation bug in place. ## Changes - `onnxruntime/python/tools/quantization/onnx_quantizer.py` - `quantize_weight_per_channel`: pass `channel_axis` (was `None`) into `QuantizedValue`. - `_dequantize_value`: only require a scalar scale on the per-tensor path (`axis is None`); forward `axis=quantized_value.axis` to `onnx.helper.make_node("DequantizeLinear", ...)`. `make_node` silently omits the attribute when `axis` is `None`, so the per-tensor path is unchanged. - `onnxruntime/test/python/quantization/test_quant_issues.py` - New regression test `test_dynamic_quantize_per_channel_emits_axis_attribute` that builds a minimal MatMul model with the weight routed to a graph output (to force the `_dequantize_outputs` -> `_dequantize_value` path), runs `quantize_dynamic(per_channel=True)`, and asserts the emitted `DequantizeLinear` has the `axis` attribute and a 1-D multi-element scale initializer. ## Test Plan - `python -m pytest onnxruntime/test/python/quantization/test_quant_issues.py -xvs` — new test passes; existing test skipped as before. - `python -m pytest onnxruntime/test/python/quantization/test_op_matmul.py` — 7 passed, 8 skipped (no regression). - `python -m pytest onnxruntime/test/python/quantization/test_qdq.py -k per_channel` — 1 passed. - `lintrunner -a` on changed files: clean.

…Def (microsoft#28608) ## Summary `utils::MakeComputeCapability` is the shared helper used by `utils::CreateSupportedPartitions` to build an `IndexedSubGraph::MetaDef` from a group of supported nodes. When a supported group contains a control-flow op (`Loop`, `If`, `Scan`), `MakeComputeCapability` currently walks only `node->InputDefs()` and silently drops the outer-scope captures (`node->ImplicitInputDefs()`). The captures never enter `meta_def->inputs`, so after `Graph::FinalizeFuseSubGraph` the fused node's `InputDefs()` is missing them — the EP that owns the fused subgraph has no boundary value-info for the captured tensors and cannot bind them at Compute time. This PR adds a second loop in `MakeComputeCapability` that walks `node->ImplicitInputDefs()` with the same "produced inside the partition → skip, otherwise add to subgraph inputs" semantics already applied to `InputDefs()`. ## Why this is the right fix `onnxruntime::Node` partitions inputs into two arrays by design: - `InputDefs()` — formal operand list as declared in the op's ONNX schema. - `ImplicitInputDefs()` — outer-scope SSA values referenced from inside body subgraphs of `Loop` / `If` / `Scan`. These are real boundary inputs at runtime (the body kernel reads them) but they don't appear in the op's formal operand list. `Graph::FinalizeFuseSubGraph` consumes only `meta_def->inputs` to populate the fused node's `InputDefs()` and rewire outer-scope edges. So whatever `MakeComputeCapability` puts in `meta_def->inputs` is what ends up at the fused-node boundary. Omitting `ImplicitInputDefs()` here means the captures are unreachable downstream — there is no other place that can patch them back in. The fix is intentionally a mirror of the existing `InputDefs()` loop (same `Contains(node_outputs, ...)` produced-inside check, same `ordered_subgraph_inputs.push_back` ordering). The new loop runs after the explicit loop so explicit-operand index ordering in `meta_def->inputs` is preserved (EPs that have implicitly relied on `meta_def->inputs[i].name == node.InputDefs()[i].name` for non-control-flow op groups are not perturbed). ## Scope of impact Only EPs that consume `utils::MakeComputeCapability` / `utils::CreateSupportedPartitions` are affected. A quick audit: | EP | Uses `partitioning_utils::MakeComputeCapability`? | Affected by bug? | |---|---|---| | Plugin EPs (`EpGraphSupportInfo_AddNodesToFuse` → `PluginExecutionProvider::GetCapability`) | yes, in `onnxruntime/core/session/plugin_ep/ep_plugin_provider_interfaces.cc` | **yes** | | `internal_testing_ep` (used by ORT's own unit tests) | yes, in `onnxruntime/test/internal_testing_ep/internal_testing_execution_provider.cc` | **yes** | | TensorRT, MIGraphX, NV-TRT-RTX, VitisAI | no — they build `MetaDef::inputs` themselves and already walk `ImplicitInputDefs()` (e.g. `tensorrt_execution_provider.cc:2084`, `migraphx_execution_provider.cc:735`) | no | | DML / CPU / CUDA / ROCm / OpenVINO / QNN / CANN / WebGPU / CoreML | don't use it for Loop/If/Scan fusion paths | no | So the impact is bounded to the plugin EP architecture (ORT 1.23+) and the in-tree testing EP — both of which delegate boundary calculation to this shared helper. ## Reproduction The bug is reproducible against this repo's `internal_testing_ep`. No external code required. A minimal repro model with a Loop body that captures an outer-scope tensor `B`: ```python # build_repro.py — produces a ~1.5 KB onnx import numpy as np, onnx from onnx import TensorProto, helper as h, numpy_helper as nph A = h.make_tensor_value_info("A", TensorProto.FLOAT, ["N", 2, 2]) B = h.make_tensor_value_info("B", TensorProto.FLOAT, [2, 2]) out = h.make_tensor_value_info("v_final", TensorProto.FLOAT, [2, 2]) acc_init = nph.from_array(np.zeros((2, 2), np.float32), name="acc_init") cond_init = nph.from_array(np.array([1], np.bool_), name="cond_init") sq_ax = nph.from_array(np.array([0], np.int64), name="sq_ax") body = h.make_graph( nodes=[ h.make_node("Gather", ["A", "iter"], ["slice"], axis=0), h.make_node("Add", ["slice", "B"], ["tmp"]), # captures outer B h.make_node("Add", ["acc_in", "tmp"], ["acc_out"]), h.make_node("Identity", ["cond_in"], ["cond_out"]), ], name="loop_body", inputs=[h.make_tensor_value_info("iter", TensorProto.INT64, []), h.make_tensor_value_info("cond_in", TensorProto.BOOL, []), h.make_tensor_value_info("acc_in", TensorProto.FLOAT, [2, 2])], outputs=[h.make_tensor_value_info("cond_out", TensorProto.BOOL, []), h.make_tensor_value_info("acc_out", TensorProto.FLOAT, [2, 2])], ) g = h.make_graph( nodes=[ h.make_node("Shape", ["A"], ["M_1d"], start=0, end=1), h.make_node("Squeeze", ["M_1d", "sq_ax"], ["M"]), h.make_node("Loop", ["M", "cond_init", "acc_init"], ["v_final"], body=body), ], name="loop_with_outer_capture", inputs=[A, B], outputs=[out], initializer=[acc_init, cond_init, sq_ax], ) onnx.save(h.make_model(g, opset_imports=[h.make_opsetid("", 16)]), "loop_with_outer_capture.onnx") ``` Observable bug path (against any EP using `CreateSupportedPartitions`, e.g. `InternalTestingExecutionProvider`): ```cpp // Claim every node (Shape/Squeeze/Constant/Loop) as compiled. SessionOptions so; InferenceSession session(so, env); session.RegisterExecutionProvider( std::make_unique<InternalTestingExecutionProvider>(/*supported=*/{...})); session.Load("loop_with_outer_capture.onnx"); session.Initialize(); // In EP::Compile, iterate fused_node.InputDefs(): // for (const auto* in : fused_node.InputDefs()) std::cerr << in->Name() << "\n"; // BEFORE this fix: only "A" is printed (Shape(A) makes A explicit; // B is consumed only via Loop's ImplicitInputDefs and gets dropped). // AFTER this fix: both "A" and "B" are printed. ``` A small unit-test fixture exercising the same path can be added to `onnxruntime/test/providers/partitioning_utils_test.cc` following the existing `CheckAllNodesProcessed` pattern, asserting that `result[0]->sub_graph->GetMetaDef()->inputs` contains `B` when the supported group includes the Loop. ## What this PR changes A single hunk in `onnxruntime/core/providers/partitioning_utils.cc::MakeComputeCapability`, immediately after the existing `for (const auto* input : node->InputDefs()) { ... }`: ```cpp // Region-bearing ops (Loop/If/Scan) reference outer-scope SSA values via // ImplicitInputDefs rather than InputDefs. When an EP claims the whole // control-flow op, those implicit captures must also be in MetaDef::inputs // so FinalizeFuseSubGraph can rewire the outer-scope edges onto the fused // node's InputDefs. Without this, plugin EPs that fuse Loop/If/Scan lose // the captures at the fused-node boundary and cannot resolve them at // Compute time. for (const auto* input : node->ImplicitInputDefs()) { if (!input->Exists()) { continue; } if (!Contains(node_outputs, input)) { if (!Contains(subgraph_inputs, input)) { subgraph_inputs.insert(input); ordered_subgraph_inputs.push_back(input); } } } ``` ## Risks / migration - **No ABI change.** `MakeComputeCapability` signature unchanged. `IndexedSubGraph::MetaDef` schema unchanged. - **No semantic regression for op groups without control flow.** The new loop only adds elements; for partitions that contain no `Loop` / `If` / `Scan`, `ImplicitInputDefs()` is empty on every node and the new loop is a no-op. - **Behavior change for plugin EPs that fuse Loop/If/Scan.** Their fused node's `InputDefs()` gains the captures. EPs that were silently fishing out captures via a workaround (e.g. walking the original Loop node's `ImplicitInputDefs()` themselves at Compile time) would see those names show up via the standard fused-node `InputDefs()` API. Audit above shows no in-tree EP that uses `partitioning_utils` had such a workaround — TRT / MIGraphX / etc. roll their own MetaDef without calling `MakeComputeCapability`. ## Validation - Verified the fix end-to-end against a downstream plugin EP that claims a `Loop` node as part of a fused partition (Loop body captures an outer-scope tensor): without this fix, the EP cannot resolve the captured tensor name at the fused-node boundary; with the fix the captured tensor appears in `fused_node.InputDefs()` and session initialization + the EP's Compile both succeed. - No `partitioning_utils.cc` changes between `origin/main` and the patch base, so it applies cleanly. - Existing `onnxruntime_test_all --gtest_filter=PartitioningUtilsTest.*` cases still pass (the fix only adds behavior for control-flow ops; non-control-flow partitions are byte-for-byte identical to before).

This pull request strengthens security checks around loading external tensor data in ONNX Runtime, particularly to prevent malicious models from referencing unsafe file paths or in-memory address markers that could lead to arbitrary file access or unsafe memory dereferencing. The changes introduce stricter validation for external data paths and add explicit rejections for ORT in-memory address markers found in model protobufs, along with new and improved regression tests to verify this behavior. **Security hardening for external data loading:** * Added `ValidateExternalFilePathForTensor` to enforce that external data paths are validated for all code paths loading external data (including those outside `Graph::Resolve`), rejecting absolute or directory-escaping paths and passing through only trusted in-memory markers. This is now called in `GetExtDataFromTensorProto` and `LoadExtDataToTensorFromTensorProto` to ensure defense-in-depth. [[1]](diffhunk://#diff-d31e9fbe0f5334fcd949833e035f2b25d5ae810dcd505c545f6b372b546b1406R1568-R1596) [[2]](diffhunk://#diff-d31e9fbe0f5334fcd949833e035f2b25d5ae810dcd505c545f6b372b546b1406R1760-R1762) * Updated the validation logic for sparse tensor sub-tensors with `ValidateSparseSubTensorExternalDataPath`, clarifying the handling of in-memory markers and ensuring only legitimate file paths are accepted. * Changed `SparseTensorProtoToDenseTensorProto` to use the new sparse sub-tensor validation for both values and indices. **Model loading and graph construction protections:** * In `Graph::Graph`, added explicit rejection of ORT in-memory address markers in sparse tensor attributes and initializers when loading from a protobuf, preventing attackers from crafting models that could cause unsafe memory access during sparse-to-dense conversion or initializer resolution. [[1]](diffhunk://#diff-e231a92b40d89409cc8e82436be0a15bc87ef95c93b303b9feaeab6e50c8835cR1268-R1282) [[2]](diffhunk://#diff-e231a92b40d89409cc8e82436be0a15bc87ef95c93b303b9feaeab6e50c8835cR1322-R1331) [[3]](diffhunk://#diff-e231a92b40d89409cc8e82436be0a15bc87ef95c93b303b9feaeab6e50c8835cR1373-R1380) **Expanded and improved testing:** * Added new unit tests to verify that absolute and directory-escaping external paths are rejected even when loading tensors directly (not via graph resolution), and that in-memory address markers are not accepted in dense or sparse initializers loaded from protobufs. [[1]](diffhunk://#diff-d75ec5db9cc4642f78b6ff568aff6d10398fc211b0fb7c862d3ec88738e3eda6R1156-R1217) [[2]](diffhunk://#diff-1d3978c99d95a56af0f2603bdd0b10cf02bdc1cecbd4fe5db353a8c8388696efR1365-R1484) * Updated an optimizer initializer test to reflect the new error handling for invalid external data paths.

microsoft#28695) ## Description Add a flash attention-style tiled computation path to the CPU GroupQueryAttention operator for quantized KV cache (INT8/INT4). Instead of materializing the full `[B, N, S, T]` attention probability matrix, this processes K/V in L2-cache-sized blocks with online softmax — reducing peak memory from O(S×T) to O(S×Bc) per head where Bc is the KV block size. Additionally, implements **flash decoding** for the decode phase (S=1): when `batch×heads < threads`, idle threads are repurposed to partition the KV sequence across parallel workers. Each worker computes partial softmax statistics on its KV chunk, then a lightweight reduce phase merges the partials — achieving 2–5x decode speedup for long sequences. ### Motivation For long-sequence LLM inference with quantized KV cache on CPU: - **Prefill**: The full attention matrix allocation becomes a significant memory bottleneck. With 16 heads and S=4096, the naive path allocates ~1 GB for attention scores alone. The tiled approach reduces peak memory by 13–24x and latency by 1.2–2.7x. - **Decode**: When batch size is small relative to available threads, many threads sit idle. Flash decoding partitions the KV sequence across these idle threads, achieving 2–5x speedup for long KV lengths. ## Key Changes | File | Change | |------|--------| | `onnxruntime/core/mlas/lib/flashattn_qkv.cpp` | MLAS kernel: tiled prefill with online softmax, flash decoding (two-phase KV partitioning), and reduce | | `onnxruntime/core/mlas/inc/mlas_qkv_quant.h` | `MlasFlashAttentionQuantizedKVArgs` struct with `flash_decoding_partials` and `kv_chunk_count` fields | | `onnxruntime/contrib_ops/cpu/bert/gqa_attention_base.h` | `ApplyAttentionQuantizedFlash()` with L2-cache-aware block sizing, KV concat, flash decoding setup | | `onnxruntime/contrib_ops/cpu/bert/group_query_attention.cc` | Dispatch logic: activates flash path when no softcap/smooth softmax/output_qk | | `cmake/onnxruntime_mlas.cmake` | Added `flashattn_qkv.cpp` to the MLAS build | | `docs/contrib_ops/cpu/gqa.md` | Documentation with algorithm details, benchmark results, and reproduction steps | | `onnxruntime/test/mlas/bench/bench_qkv_quant.cpp` | MLAS-level C++ benchmark (`BM_GQA_Naive` vs `BM_GQA_Flash`) | | `onnxruntime/test/python/transformers/benchmark_gqa_cpu_flash.py` | Operator-level Python benchmark | ## Algorithm ### Prefill (S > 1): Tiled Flash Attention Per (batch, head, q_block) tile: 1. **QK GEMM** — `MlasQKGemm` on a block slice of quantized K cache 2. **Causal + local window masking** — Set masked positions to -inf before softmax 3. **Online softmax** — Track running max `m` and sum `l`, rescale accumulated output with `exp(m_old - m_new)` 4. **SV accumulation** — Dequantize V block to FP32, then accumulate weighted V into output ### Decode (S = 1): Flash Decoding When `sequence_length == 1 && batch_size * num_heads < thread_count && kv_chunk_count > 1`: **Phase 1 — Parallel KV scan**: Each idle thread processes a disjoint KV chunk for a (batch, head) pair. For each chunk: compute QK dot products, find local max, compute local softmax sum, and accumulate partial weighted V output. Store per-chunk `(max_score, sum_exp, partial_output[head_size])` into a partials buffer. **Phase 2 — Reduce**: One thread per (batch, head) merges all chunk partials using the log-sum-exp trick: find global max, rescale each chunk's sum and partial output, then normalize by global sum. This is analogous to GPU flash decoding (Dao et al.) but adapted for CPU threading. ### Activation Conditions Flash path activates when ALL of: - `ORT_GQA_DISABLE_FLASH_ATTENTION` env var is not set - `total_sequence_length > 1` - No softcap, no smooth softmax, no output_qk (attention bias IS supported) Flash decoding additionally requires: - `sequence_length == 1` (decode phase) - `batch_size * num_heads < thread_count` (idle threads available) - `kv_chunk_count > 1` (enough KV to partition) ## Benchmark Results Measured on Intel Xeon Platinum 8480C, 96 CPUs, threads=8. MLAS-level C++ benchmark. ### Latency — Prefill (S = T) Shape: B=1, num_heads=16, kv_num_heads=8, head_size=128. | Seq Length | Naive (ms) | Flash (ms) | Speedup | Quant | |---:|---:|---:|---:|:---| | 512 | 9.9 | 8.1 | 1.2x | per-tensor | | 1024 | 44.4 | 27.0 | 1.6x | per-tensor | | 2048 | 190.9 | 116.9 | 1.6x | per-tensor | | 4096 | 1257.8 | 461.6 | 2.7x | per-tensor | | 512 | 10.7 | 10.8 | 1.0x | per-channel | | 1024 | 49.5 | 41.7 | 1.2x | per-channel | | 2048 | 212.1 | 164.1 | 1.3x | per-channel | | 4096 | 1223.9 | 607.8 | 2.0x | per-channel | ### Latency — Decode (S = 1, no flash decoding) Shape: B=1, num_heads=16, kv_num_heads=8, head_size=128. Flash decoding NOT active (batch×heads=16 > threads=8). | Total Seqlen | Naive (us) | Flash (us) | Speedup | Quant | |---:|---:|---:|---:|:---| | 512 | 32 | 22 | 1.4x | per-tensor | | 1024 | 71 | 47 | 1.5x | per-tensor | | 2048 | 120 | 87 | 1.4x | per-tensor | | 4096 | 210 | 174 | 1.2x | per-tensor | | 512 | 53 | 31 | 1.7x | per-channel | | 1024 | 86 | 52 | 1.7x | per-channel | | 2048 | 172 | 97 | 1.8x | per-channel | | 4096 | 299 | 191 | 1.6x | per-channel | ### Latency — Flash Decoding (S = 1, KV partitioned across threads) Shape: B=1, num_heads=4, kv_num_heads=4 (MHA), head_size=128. Flash decoding IS active (batch×heads=4 < threads=8). | Total Seqlen | Naive (us) | Flash (us) | Speedup | Quant | |---:|---:|---:|---:|:---| | 512 | 31 | 25 | 1.2x | per-tensor | | 1024 | 41 | 25 | 1.6x | per-tensor | | 2048 | 67 | 34 | 2.0x | per-tensor | | 4096 | 197 | 54 | 3.7x | per-tensor | | 512 | 25 | 28 | 0.9x | per-channel | | 1024 | 72 | 27 | 2.7x | per-channel | | 2048 | 144 | 37 | 3.9x | per-channel | | 4096 | 304 | 60 | 5.1x | per-channel | ### Peak Memory — Prefill | Seq Length | Naive Peak | Flash Peak | Memory Reduction | |---:|---:|---:|---:| | 2048 (N=16) | +294 MB | +44 MB | 6.7x | | 4096 (N=16) | +1107 MB | +82 MB | 13.5x | | 4096 (N=32) | +2131 MB | +87 MB | 24.5x | **Summary**: Prefill gains 1.2–2.7x latency + 7–24x memory reduction from tiled online softmax. Decode gains 1.2–1.8x from fused dequant+dot alone. Flash decoding adds 2–5x for long sequences when idle threads are available to partition the KV scan. ### How to Reproduce ```bash # Build ORT python tools/ci_build/build.py --build_dir build/cpu --config Release \ --parallel --build_wheel --skip_tests # MLAS-level C++ benchmark: cd build/cpu/Release ./onnxruntime_mlas_benchmark \ --benchmark_filter='BM_GQA_(Naive|Flash)' \ --benchmark_min_time=0.5s \ --benchmark_repetitions=3 \ --benchmark_report_aggregates_only=true ``` ## Testing - All 35 CPU `GroupQueryAttentionTest.*` tests pass (INT8/INT4, per-tensor/per-channel, multi-batch, large head, GQA ratio variants) - Set `ORT_GQA_DISABLE_FLASH_ATTENTION=1` to verify fallback path still works - End-to-end verified with `quantized_kv_cache_cpu_demo.py` - Numerical agreement between flash and naive paths: max diff < 1e-7

…t#28710) ### Description  Add a new `--paths` option to `compile_contributors.py` to limit git history queries using pathspecs. Apply the path filter to both base and target git log collection and log the active path filter in logs.txt. ### Motivation and Context  Allow `compile_contributors.py` to be used for releases where relevant changes are largely limited to a subset of the codebase. E.g., we can limit the paths to WebGPU EP-related files for the WebGPU plugin EP release. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Address Copilot review feedback on initial submission: - migraphx_stream_handle.cc: set own_stream_=false when wrapping an external user_compute_stream so ~MIGraphXStream does not destroy a stream it did not create (was: own_stream_ defaults true, causing hipStreamDestroy on the caller's stream at session teardown). - migraphx_execution_provider.cc (OnRunEnd): honor the sync_stream parameter instead of always querying/synchronizing. ORT passes sync_stream=false in certain paths (e.g. during HIP graph capture); ignoring it caused unexpected blocking. - migraphx_execution_provider.cc (Sync): correct misleading comment that claimed 'no default-stream sync' — when stream_ is nullptr the fallback is still hipStreamSynchronize(nullptr), a full-device sync. Comment now accurately documents the fallback behaviour. - migraphx_execution_provider_info.cc (ToProviderOptions): add user_compute_stream to the round-trip map so the option is not silently dropped when provider options are queried via GetProviderOptions. - migraphx_execution_provider_info.h (std::hash): include user_compute_stream in the hash so info structs that differ only in external stream are not collapsed to the same map key.

yuslepukhin and others added 30 commits April 23, 2026 15:59

anzzraju1997-glitch and others added 27 commits May 26, 2026 15:29

Update GatherBlockQuantized to support 2-bits (microsoft#28530)

97cfc86

### Description Update GatherBlockQuantized to support 2-bits. Updated op schema, implemented the CPU and WebGPU EP. This helps to make the model smaller.

rev some npm packages (microsoft#28547)

e0a8dac

rev some npm packages --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

remove some webgpu ci tasks to speedup ci (microsoft#28642)

a42351e

- disable dynamic wgsl template tesst - disable shader cache key checks

Guard removed APIs in TRT 11 (microsoft#28611)

53fcead

### Description TRT 11.0 removes many deprecated APIs, so guard TRT-EP code accordingly to support TRT 11 builds. ### Motivation and Context Fixes compilation with TRT 11.0 Signed-off-by: Kevin Chen <kevinch@nvidia.com>

zhihuidu-amd force-pushed the migraphx-fix-stream-synchronization branch from 7b6955d to 20e6217 Compare May 29, 2026 11:24

zhihuidu-amd force-pushed the migraphx-fix-stream-synchronization branch from 20e6217 to 60c5ecc Compare May 29, 2026 13:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MIGraphX EP] Fix stream_ nullptr bug and enable end-to-end HIP graph capture. Identical diff and commit message as PR 2a. The AMD ROCm fork #241

[MIGraphX EP] Fix stream_ nullptr bug and enable end-to-end HIP graph capture. Identical diff and commit message as PR 2a. The AMD ROCm fork #241
zhihuidu-amd wants to merge 841 commits into
ROCm:mainfrom
zhihuidu-amd:migraphx-fix-stream-synchronization

zhihuidu-amd commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

zhihuidu-amd commented May 29, 2026

Problem

Bug 1: stream_ is never assigned — every sync hits the null stream

Bug 2: RegisterStreamHandlers ignores the external stream — HIP graph capture is incomplete

Bug 3 (consequence of Bug 1+2): user_compute_stream is silently ignored

Fix

Backward compatibility

Measured impact

Files changed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Bug 1: `stream_` is never assigned — every sync hits the null stream

Bug 2: `RegisterStreamHandlers` ignores the external stream — HIP graph capture is incomplete

Bug 3 (consequence of Bug 1+2): `user_compute_stream` is silently ignored