Skip to content

[MIGraphX EP] Fix stream_ nullptr bug and enable end-to-end HIP graph capture. Identical diff and commit message as PR 2a. The AMD ROCm fork #241

Open
zhihuidu-amd wants to merge 841 commits into
ROCm:mainfrom
zhihuidu-amd:migraphx-fix-stream-synchronization
Open

[MIGraphX EP] Fix stream_ nullptr bug and enable end-to-end HIP graph capture. Identical diff and commit message as PR 2a. The AMD ROCm fork #241
zhihuidu-amd wants to merge 841 commits into
ROCm:mainfrom
zhihuidu-amd:migraphx-fix-stream-synchronization

Conversation

@zhihuidu-amd
Copy link
Copy Markdown

Problem

Three related bugs in the MIGraphX EP cause incorrect stream synchronization
and prevent end-to-end HIP graph capture when an external stream is provided
via the user_compute_stream ProviderOptions key.

Bug 1: stream_ is never assigned — every sync hits the null stream

hipStream_t stream_ = nullptr is declared in the header and never assigned
in the constructor, even when user_compute_stream is present in
ProviderOptions. As a result:

  • Sync() calls hipStreamSynchronize(nullptr) — the HIP legacy default
    stream
    , which has implicit synchronization with all other streams on
    the device. Under concurrent request serving this serialises every inference
    through a single global sync point.
  • OnRunEnd() calls hipStreamQuery(stream_) and
    hipStreamSynchronize(stream_) on the null stream — always sees it complete
    immediately (the null stream is never "busy" from the EP's perspective),
    masking the case where the actual compute is still in flight.

Bug 2: RegisterStreamHandlers ignores the external stream — HIP graph capture is incomplete

RegisterStreamHandlers hardcodes use_existing_stream=false (acknowledged
by the /*TODO:external_stream_*/ comment). ORT therefore creates a fresh
hipStreamNonBlocking stream for H2D/D2H memory copies, separate from the
MIGraphX compute stream. When a HIP graph capture is in progress, memory
copies on this separate stream are not recorded, producing an incomplete
graph that replays compute but re-issues host-device transfers on every call.

Bug 3 (consequence of Bug 1+2): user_compute_stream is silently ignored

The ProviderOptions parsing infrastructure (migraphx_execution_provider_info)
has no user_compute_stream key, so the pointer passed by the caller is never
stored. Bugs 1 and 2 would exist even if it were.

Fix

migraphx_execution_provider_info.h

  • Add constexpr auto kUserComputeStream = "user_compute_stream"sv;
  • Add void* user_compute_stream{nullptr}; field to MIGraphXExecutionProviderInfo

migraphx_execution_provider_info.cc

  • Add AddValueParser for kUserComputeStream, parsing the address string
    and storing it in user_compute_stream. Follows the exact pattern used for
    kGpuExternalAlloc / kGpuExternalFree.

migraphx_execution_provider.cc

  • Constructor: if (info.user_compute_stream) stream_ = static_cast<hipStream_t>(info.user_compute_stream);
  • RegisterStreamHandlers: const bool use_existing_stream = (stream_ != nullptr); — resolves the TODO.
  • Sync(): sync stream_ (or null if unset); remove the unconditional null-stream sync that caused the global stall.
  • OnRunEnd(): query/sync check_stream = stream_ ? stream_ : nullptr — same fix.

Backward compatibility

When user_compute_stream is not provided (the common case), stream_
remains nullptr. RegisterStreamHandlers creates a new
hipStreamNonBlocking stream as before. Sync() falls back to syncing
that session stream (via nullptr), which is at worst equivalent to the
old behaviour and strictly more correct for single-session use.

Measured impact

Tested on AMD MI300X (gfx942), ROCm 7.x, ORT 1.23.2, via Triton Inference
Server 25.x, production ONNX ranking model (843 nodes, dynamic batch 1–256):

  • Eliminates the global device stall on every session.Run().
  • Enables end-to-end HIP graph capture: H2D copy → MIGraphX compute → D2H
    copy all captured together, allowing zero-overhead graph replay.
  • Quantified latency improvement: results forthcoming from an in-progress A/B
    experiment (will update this PR).

Files changed

onnxruntime/core/providers/migraphx/migraphx_execution_provider_info.h (+2)
onnxruntime/core/providers/migraphx/migraphx_execution_provider_info.cc (+11)
onnxruntime/core/providers/migraphx/migraphx_execution_provider.cc (+21/-10)

This is a cherry-pick of the same fix submitted to microsoft/onnxruntime
as PR microsoft#28715. Filing separately because rocm/vllm ships the AMD fork and
the fix is needed there to land in production before the upstream merge.

yuslepukhin and others added 30 commits April 23, 2026 15:59
…soft#28189)

This pull request addresses a bug where in-memory-only initializer
markers (such as `_ORT_MEM_ADDR_`) could be incorrectly serialized into
ONNX models when using custom initializer handling. The changes ensure
that any pre-existing initializers are cleared before re-adding them via
the custom handler, preventing duplicate or stale data. Additionally, a
regression test is added to verify correct behavior after large
initializers are externalized.

**Bug Fixes:**

* Clear pre-existing initializers (and sparse initializers, if enabled)
from `GraphProto` and subgraph protos before re-adding them with custom
initializer handling, preventing duplicate or stale initializers
(including `_ORT_MEM_ADDR_` markers) from being serialized.
(`onnxruntime/core/graph/graph.cc`)
[[1]](diffhunk://#diff-e231a92b40d89409cc8e82436be0a15bc87ef95c93b303b9feaeab6e50c8835cR5107-R5116)
[[2]](diffhunk://#diff-e231a92b40d89409cc8e82436be0a15bc87ef95c93b303b9feaeab6e50c8835cR5208-R5217)

**Testing Improvements:**

* Add a regression test to verify that after converting large
initializers into OrtValues (which creates in-memory markers), custom
initializer handling does not serialize `_ORT_MEM_ADDR_` markers into
the output model, and that all initializers are correctly inlined.
(`onnxruntime/test/ir/graph_test.cc`)

Closing microsoft#28158
…n fusion (microsoft#28168)

### Description
Add correctness tests for fusions introduced in
microsoft#27883 and
microsoft#27747. The tests
introduced in those PRs only check if fusion went through but not if the
fused nodes produced semantically right results as the unfused
subgraphs. Adding those tests to prevent accidental breakage in case
something changed in the fused node's backing kernel implementation.

### Motivation and Context
Adress test coverage gap
…8096)

# Add exponential-backoff option for thread pool spin loop

## Description

This PR adds an opt-in exponential-backoff mode to the thread pool's
idle spin loop, complementing the configurable `spin_duration_us`
introduced in microsoft#27916. When enabled, each spin iteration emits a
geometrically increasing number of `SpinPause()` calls (1, 2, 4, …
capped at `spin_backoff_max`), which reduces pause-instruction density
and lowers CPU/power usage during the spin window—particularly on hybrid
(P/E core) and mobile platforms. The iteration count is automatically
scaled so the wall-clock spin budget targeted by `spin_duration_us` is
preserved.

The idea is adapted from microsoft#23278
(microsoft#21545 and
microsoft#22315) which showed
measurable power and latency improvements on Intel Meteor Lake by
reducing busy-wait density. This PR makes the technique opt-in and
composable with the time-bounded spin knob from microsoft#27916, so users can
independently control *how long* to spin and *how densely* to spin.

## Summary of Changes

### Core thread pool (`EigenNonBlockingThreadPool.h`)

| File | Change |
|------|--------|
| `include/onnxruntime/core/platform/EigenNonBlockingThreadPool.h` | Add
`ThreadPoolWaiter` inner class implementing exponential backoff; add
`NormalizeBackoff()` and `ScaleSpinCountForBackoff()` helpers; replace
bare `SpinPause()` in `WorkerLoop` with `waiter.wait()`; store
`spin_backoff_max_` member and accept it in constructor |

### Configuration plumbing

| File | Change |
|------|--------|
|
`include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h`
| New config keys `session.intra_op.spin_backoff_max` and
`session.inter_op.spin_backoff_max` |
| `include/onnxruntime/core/platform/threadpool.h` | Add
`spin_backoff_max` parameter to `ThreadPool` constructor (default `1`,
backward-compatible) |
| `onnxruntime/core/common/threadpool.cc` | Forward `spin_backoff_max`
to `ThreadPoolTempl` |
| `onnxruntime/core/util/thread_utils.h` | Add `spin_backoff_max` field
to `OrtThreadPoolParams` |
| `onnxruntime/core/util/thread_utils.cc` | Pass `spin_backoff_max` into
`ThreadPool`; log it in `operator<<(OrtThreadPoolParams)` |
| `onnxruntime/core/session/inference_session.cc` | Add
`ParseSpinBackoffMax()` helper; parse & apply both intra-op and inter-op
config keys |

### Perf test CLI & benchmark script

| File | Change |
|------|--------|
| `onnxruntime/test/perftest/command_args_parser.cc` | New
`--spin_backoff_max` flag |
| `onnxruntime/test/perftest/ort_test_session.cc` | Apply flag to
session options |
| `onnxruntime/test/perftest/test_configuration.h` | New
`spin_backoff_max` field in `RunConfig` |
| `tools/perftest/benchmark_spin_settings.py` | New benchmark script
that runs `onnxruntime_perf_test` across a matrix of spin settings
(duration × backoff) and reports latency, throughput, CPU% |

## Key Design Decisions

1. **Default preserves existing behavior.** `spin_backoff_max = 1` means
one `SpinPause()` per iteration—identical to today. No performance
change for users who don't opt in.
2. **Wall-clock budget preservation.** When backoff is enabled, the
iteration count is divided by `spin_backoff_max` so the total number of
`SpinPause()` calls—and therefore the approximate spin duration—stays
the same as the non-backoff path.
3. **Composable with `spin_duration_us`.** Backoff and time-bounded
spinning are orthogonal knobs. Users can use either independently or
combine them (e.g., `spin_duration_us=1000` + `spin_backoff_max=8`).
4. **Subordinate to `allow_spinning`.** When spinning is disabled,
`spin_backoff_max` is ignored—same as `spin_duration_us`.

## Session option usage

```cpp
// Enable exponential backoff with cap 8, combined with 1ms time-bounded spinning
session_options.AddConfigEntry("session.intra_op.spin_duration_us", "1000");
session_options.AddConfigEntry("session.intra_op.spin_backoff_max", "8");
```

## Benchmark Results

Benchmarks run on an Intel i9-13900KF (6P cores / 12 threads under
WSL2), 32 GB RAM, Release build with CPU EP, using
`tools/perftest/benchmark_spin_settings.py`. Each configuration was
repeated 3–5 times (median latency, mean throughput/CPU reported).
Duration: 10 seconds per run.

### SqueezeNet (5 MB CNN) — 16 intra-op threads, 5 repeats

High thread count amplifies spin contention, making this the most
illustrative test:

| Config | `spin_duration_us` | `spin_backoff_max` | Avg Latency (ms) |
Throughput (IPS) | CPU % |
|--------|---|---|:-:|:-:|:-:|
| `default` | (legacy) | (legacy) | 3.243 | 303.0 | 1245.8 |
| `no_spin` | — | — | 5.489 | 176.3 | 332.7 |
| `spin_1000` | 1000 | — | 1.870 | 514.5 | 1214.6 |
| `spin_2000` | 2000 | — | 2.040 | 478.2 | 1219.9 |
| `backoff_8` | (legacy) | 8 | 3.268 | 303.4 | 1257.4 |
| `spin_1000_backoff_4` | 1000 | 4 | 1.849 | 513.8 | 1221.4 |
| **`spin_1000_backoff_8`** | **1000** | **8** | **1.835** | **534.5** |
**1221.1** |
| `spin_2000_backoff_8` | 2000 | 8 | 2.050 | 470.3 | 1223.2 |

**Best: `spin_1000_backoff_8`** — **43% lower latency**, **76% higher
throughput** vs default, while using **2% less CPU**.

### SqueezeNet — 8 intra-op threads

| Config | Avg Latency (ms) | Throughput (IPS) | CPU % |
|--------|:-:|:-:|:-:|
| `default` | 1.578 | 628.7 | 826.5 |
| `no_spin` | 3.742 | 261.1 | 322.9 |
| `spin_1000` | 1.547 | 618.3 | 826.2 |
| `backoff_8` | 1.545 | 628.7 | 830.1 |
| `spin_1000_backoff_8` | **1.519** | **657.5** | 838.6 |
| `spin_2000_backoff_8` | **1.503** | 634.9 | 832.8 |

**Best: `spin_1000_backoff_8`** — **3.7% lower latency**, **4.6% higher
throughput** vs default.

### DistilBERT (254 MB Transformer) — 4 intra-op threads

| Config | Avg Latency (ms) | Throughput (IPS) | CPU % |
|--------|:-:|:-:|:-:|
| `default` | 30.468 | 31.4 | 329.3 |
| `no_spin` | 33.483 | 28.8 | 284.5 |
| `spin_1000` | 30.421 | 31.5 | 338.1 |
| `backoff_8` | 30.254 | 31.3 | 344.4 |
| **`spin_1000_backoff_8`** | **29.583** | **31.8** | 340.5 |

**Best: `spin_1000_backoff_8`** — **2.9% lower latency** vs default.

### DistilBERT — 8 intra-op threads

| Config | Avg Latency (ms) | Throughput (IPS) | CPU % |
|--------|:-:|:-:|:-:|
| `default` | 23.194 | 41.4 | 672.1 |
| `no_spin` | 32.548 | 32.2 | 395.0 |
| `spin_1000` | 23.291 | 41.3 | 675.3 |
| **`backoff_8`** | **22.995** | **43.2** | 705.3 |
| `spin_1000_backoff_8` | 23.535 | 41.1 | 662.8 |

**Best: `backoff_8`** — **0.9% lower latency**, **4.3% higher
throughput** vs default.

### Summary

- **`spin_1000_backoff_8`** is the most consistent best performer across
models and thread counts.
- Benefits grow with thread count: from ~3% at 4T to **43–76%** at 16T.
- No throughput regressions observed in any backoff configuration vs its
non-backoff equivalent.
- Backoff configs use slightly less CPU than raw spinning while
achieving higher throughput — a win-win on power/efficiency.

## Testing

- **Backward compatibility:** Default `spin_backoff_max = 1` produces
identical spin behavior to `main`. Existing thread pool tests
(`SpinDurationDefault`, `SpinDurationZero_NoSpinning`,
`SpinDurationPositive_TimeBased`) continue to pass unmodified since the
default backoff is 1.
- **Benchmark script:** Use the new benchmark tool to compare settings
on a model:
  ```bash
  python tools/perftest/benchmark_spin_settings.py \
    --perf_test build/Release/onnxruntime_perf_test \
    --model path/to/model.onnx \
    --intra_op 4 --duration 10 --repeats 3 \
    --configs default spin_1000 spin_1000_backoff_8
  ```
- **Build verification:** All modified translation units compile cleanly
under `-Wall -Wextra -Werror` in the existing cu128 Release build.
…ckQuantBDataSize_Lasx (microsoft#28179)

### Description

PR microsoft#27136 introduced a new `const MLAS_BACKEND_KERNEL_SELECTOR_CONFIG*`
parameter to the `Q4BitGemmPackQuantBDataSize_Fn` typedef in
`MLAS_QNBIT_GEMM_DISPATCH`, and updated the three `PackQuantBData*`
functions in `sqnbitgemm_kernel_lasx.cpp` accordingly. However,
`QNBitGemmPackQuantBDataSize_Lasx` was missed, leaving it with a
5-parameter signature that no longer matches the typedef. This causes a
compile error on LoongArch LASX builds when the function's address is
assigned to `d.Q4BitGemmPackQuantBDataSize`.

This change adds the missing 6th parameter to
`QNBitGemmPackQuantBDataSize_Lasx`, matching the signature used by the
three sibling functions in the same file (parameter name commented out
since it is unused by this implementation).

### Motivation and Context

Fixes microsoft#28157.

Restores the LoongArch LASX build without altering runtime behavior on
any platform.

### Changes

- `onnxruntime/core/mlas/lib/sqnbitgemm_kernel_lasx.cpp`: add `const
MLAS_BACKEND_KERNEL_SELECTOR_CONFIG* /*BackendKernelSelectorConfig*/` as
the 6th parameter of `QNBitGemmPackQuantBDataSize_Lasx`.

### Test Plan

- The change is a signature-only fix with no behavioral delta; the added
parameter is unused (consistent with the three sibling functions in the
same file).
- Existing x86 and ARM CI should continue to pass since no logic was
changed.
- LoongArch LASX build, which previously failed, now matches the
dispatch typedef and compiles.
)

To avoid "azp run Windows GPU Doc Gen CI Pipeline" for PR from external
contributors.

The "Windows GPU Doc Gen CI Pipeline" can be removed after this PR is
merged.

Example run that caught doc mismatch, and correct docs are uploaded in
"Upload updated documentation":
https://github.com/microsoft/onnxruntime/actions/runs/24813688480/job/72623549343?pr=28185
Artifact download URL:
https://github.com/microsoft/onnxruntime/actions/runs/24813688480/artifacts/6593754537
…t#28176)

### Description

`EmbedLayerNormalizationShapeInference` unconditionally wrote to output
index 2 when `getNumOutputs() == 2 && mask_index_type == 0`, causing a
heap out-of-bounds write during model loading — no `session.run()`
required.

**Fix:** Replace the flawed condition with a simple bounds check:

```cpp
// Before (vulnerable):
if (ctx.getNumOutputs() == 3 || (ctx.getNumOutputs() == 2 && mask_index_type == 0)) {
    updateOutputShape(ctx, 2, output_shape);  // OOB when numOutputs == 2
    propagateElemTypeFromInputToOutput(ctx, 0, 2);
}

// After (fixed):
if (ctx.getNumOutputs() > 2) {
    updateOutputShape(ctx, 2, output_shape);
    propagateElemTypeFromInputToOutput(ctx, 0, 2);
}
```

A regression test `EmbedLayerNormBatch1_NoMaskIndex_NoSumOutput` has
been added to `embed_layer_norm_op_test.cc` to cover the previously
vulnerable path: `mask_index_type=0` with exactly 2 outputs (no
`embedding_sum`).

### Motivation and Context

A crafted ONNX model with an `EmbedLayerNormalization` node declaring 2
outputs and `mask_index_type=0` triggers the vulnerable path.
`getOutputType(2)` returns a pointer one past the end of the internal
`node_output_types_` vector; subsequent writes through that pointer
corrupt adjacent heap memory. In release builds this is silent — no
assertion, no crash, exploitable via heap shaping.

The `embedding_sum` output is always at index 2 by definition (confirmed
in the CPU kernel). The old special-casing for `mask_index_type == 0`
was both incorrect and unnecessary.

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: xadupre <22452781+xadupre@users.noreply.github.com>
Co-authored-by: Xavier Dupré <xadupre@microsoft.com>
…8211)

### Description
<!-- Describe your changes. -->

This pull request improves error handling in the `CreateEpFactories`
function in `onnxruntime/core/providers/webgpu/ep/api.cc`.

The main enhancement is the addition of a more conservative mechanism
for creating `OrtStatus` objects when exceptions occur during API
initialization.

The existing code was attempting to create a C++ API `Ort::Status`
instance even if the API was not initialized successfully.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Fix crash that was hiding underlying error message.
### Description

Adds `HardSigmoid` to the CoreML Execution Provider's activation op
builder. Both MLProgram (`sigmoid_hard`) and NeuralNetwork
(`ActivationSigmoidHard`) code paths are implemented; the op's ONNX
definition matches CoreML MIL's `sigmoid_hard` exactly, so no
decomposition is required.

Adds a dedicated CoreML-EP test
(`CoreMLExecutionProviderTest.HardSigmoidTest`) that builds a
single-node HardSigmoid model with non-default `alpha`/`beta` and uses
`RunAndVerifyOutputsWithEP` with `ExpectedEPNodeAssignment::All` to
confirm (a) the entire graph is claimed by the CoreML EP in both NN and
MLProgram formats, and (b) the output matches the CPU reference. I
verified the test is not trivially passing by temporarily unregistering
HardSigmoid from the activation builder — the test fails with
`VerifyEPNodeAssignment` emitting a fatal failure, proving it genuinely
exercises the CoreML path. (The existing multi-EP test in
`activation_op_test.cc` silently falls back to CPU when an EP rejects
the node, so it does not give CoreML coverage on its own.)

Also updates `coreml_supported_mlprogram_ops.md`.

### Motivation and Context

Fixes microsoft#28181.

On a DWPose pose-estimation model (`dw-ll_ucoco_384.onnx`), 4
HardSigmoid ops were each forcing a CoreML → CPU → CoreML round-trip,
and also caused downstream ops to be rejected with "unsupported inputs"
because their producers had been sent to CPU. Adding HardSigmoid
collapses the graph from 5 CoreML subgraphs to 1, and drops inference
from 9.22 ms to 6.92 ms (−25%) on Apple Silicon with MLProgram +
ComputeUnits=ALL.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sts (microsoft#28065)

Add #include core/framework/tensorprotoutils.h to
test_nv_trt_rtx_ep_util.cc to resolve build error where
onnxruntime::utils::SetRawDataInTensorProto was not found.



### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
…#28218)

Those pipelines have been replaced by workflows in
microsoft#28192 and
microsoft#28194

TODO: remove those pipelines from required pipelines for pull request,
and add "ONNX Runtime Windows GPU Doc Gen CI / Windows GPU Kernel
Documentation Validation" instead.
## Description

This PR adds an Azure Pipelines packaging flow for the CUDA plugin EP,
following the existing WebGPU plugin packaging pipeline pattern. The new
pipeline can package Windows x64 and Linux x64 builds for both CUDA 12.8
and 13.0, and optionally package Linux aarch64 builds when CUDA 13.0 is
selected.

The flow is parameterized for CUDA version, package version, build type,
and Python configuration so the packaging matrix can be expanded without
duplicating pipeline logic. It also adds validation to reject
unsupported combinations such as Linux aarch64 with CUDA 12.8.

## Summary of Changes

### Azure Pipelines packaging flow

| File | Change |
|---|---|
| `tools/ci_build/github/azure-pipelines/plugin-cuda-pipeline.yml` |
Adds the top-level official pipeline with CUDA 12.8/13.0 selection,
package/build-type validation, and aarch64 gating for CUDA 13.0 only. |
|
`tools/ci_build/github/azure-pipelines/stages/plugin-cuda-packaging-stage.yml`
| Adds the packaging orchestrator that fans out per-platform/per-Python
build stages and merges Linux artifacts. |
|
`tools/ci_build/github/azure-pipelines/stages/plugin-linux-cuda-stage.yml`
| Adds the Linux packaging stage template for x64 and aarch64,
parameterized by CUDA version, Python executable, Docker image, and CUDA
architectures. |
|
`tools/ci_build/github/azure-pipelines/stages/plugin-win-cuda-stage.yml`
| Adds the Windows packaging stage template with CUDA-version-specific
SDK setup and CUDA 13.0 cuDNN handling. |

### Linux build script

| File | Change |
|---|---|
| `tools/ci_build/github/linux/build_cuda_plugin_package.sh` | Adds a
Docker-based CUDA plugin packaging script with parameters for build
config, Python executable, CUDA version, and `CMAKE_CUDA_ARCHITECTURES`.
|

### Packaging behavior

- Supports `cuda_version` = `12.8` or `13.0`.
- Restricts Linux aarch64 packaging to CUDA 13.0 because the aarch64
CUDA Docker image is only available for CUDA 13.x.
- Uses CUDA-version-specific Docker base images and CUDA architecture
lists.
- Threads Python configuration through the Linux packaging path so
wheel-producing builds can be selected per Python version.
- Merges Linux per-version artifacts into a combined Linux artifact for
downstream consumption.

## Testing

- Not run locally. This change adds CI pipeline definitions and
packaging scripts only.

## Motivation and Context

The CUDA plugin EP already has GitHub Actions CI coverage for Linux and
Windows builds, but it did not yet have a matching Azure Pipelines
packaging flow like the WebGPU plugin EP. Adding this packaging pipeline
makes it possible to publish packaged CUDA plugin artifacts through the
same official packaging infrastructure, while also supporting the newer
CUDA 13.0 configuration and Linux aarch64 packaging where the required
Docker image exists.

## Checklist

- [x] Tests added/updated or not required
- [x] Documentation updated or not applicable
- [x] No breaking changes
- [ ] CI passes
…#28045)

### Description

Implements `ai.onnx.ml.LabelEncoder` on the CUDA execution provider for
numeric key/value types using sorted arrays + binary search (O(log n)
per element).

**New files** (`onnxruntime/core/providers/cuda/ml/`):
- `label_encoder_impl.cu` / `.h` — CUDA kernel: per-thread binary search
on sorted keys, NaN-aware for float/double
- `label_encoder.cc` / `.h` — Host-side op classes (`CudaLabelEncoder`
for opset 2-3, `CudaLabelEncoder_4` for opset 4+). Constructor sorts
keys, copies to GPU; `ComputeInternal` launches kernel.

**Modified files**:
- `cuda_execution_provider.cc` — Register 11 kernel variants (4
versioned opset 2-3, 7 opset 4+)
- `provider_api.h` — Add missing `kMLDomain` constant (first ML-domain
op on CUDA EP)
- `docs/OperatorKernels.md` — Add `ai.onnx.ml` section to CUDA provider
table

**Supported type combinations**:
| Opset | Types |
|-------|-------|
| 2-3 | `int64↔float`, `int64↔int64`, `float↔float` |
| 4+ | Above + `double↔double`, `double↔int64`, `int64↔double` |

String types remain CPU-only. NaN keys are placed at end of sorted array
and short-circuited before binary search.

**Tests**: 5 new test cases covering NaN-key-to-numeric-value mappings
and double type combinations. Existing numeric tests
(`FloatToInt64Opset2`, `Int64ToFloatOpset2`, etc.) will automatically
run on CUDA via `OpTester::Run()`.

### Motivation and Context

Models with large LabelEncoder nodes (>100k entries) force a CPU
round-trip when all other nodes run on GPU. This adds the CUDA
implementation to eliminate that data transfer bottleneck.

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
microsoft#28109)

…+ add NVIDIA 16x16x16

Refactor subgroup matrix MatMulNBits support from vendor-specific
(Apple/Intel) to a vendor-agnostic config-based approach. Any GPU
reporting a matching subgroup matrix config from Dawn is now
automatically supported.

Key changes:
- Replace vendor-specific config table with
SupportedSubgroupMatrixConfig struct containing {componentType,
resultComponentType, M, N, K, subgroupMinSize, subgroupMaxSize,
needsPrepack}. No architecture or backendType required.
- Remove vendor_ member from SubgroupMatrixMatMulNBitsProgram. Shader
selection is now driven by config dimensions (8x8x8, 8x16x16, 16x16x16).
- Remove vendor gate in matmul_nbits.cc call site.
- Rename shader templates: _apple -> _8x8x8, _intel -> _8x16x16.
- Add new 16x16x16 shader template for NVIDIA Blackwell (RTX 5080).
  - 4 subgroups x 32 lanes = 128 threads per workgroup
  - 64x64 tile with 16x16 subgroup matrices
  - Bounds-checked output via scratch buffer for partial M tiles
- Fix prepack shader OOB reads: add scalar fallback with zero-fill for
partial blocks where M is not a multiple of kSgMatM.
- Prioritize larger configs (16x16x16 > 8x16x16 > 8x8x8) when multiple
match.

Verified on NVIDIA RTX 5080 (Blackwell, Vulkan backend):
- Correctness: model-qa.py with phi4-graph-prune produces identical
output to D3D12 baseline
- Prefill (phi4, l=1024):

phi4-graph-prune | D3D12 DP4A | Vulkan DP4A | Vulkan TC (16x16x16) |
Vulkan TC (16x16x16_128)
-- | -- | -- | -- | --
Prefill (tps) | 3,134 | 6,389 | 7,089 | 10,744

- NVIDIA reports ChromiumExperimentalSubgroupMatrix with F16/F16
16x16x16 config

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
… (microsoft#28198)

## Description

Add a GQA-capable unfused CUDA attention kernel that writes Q·K^T to an
FP32 scratch buffer, fixing fp16/bf16 overflow producing NaN when
`head_size > 256` at `scale=1.0` (issue microsoft#28195, e.g. Gemma 4 global
attention layers with `head_dim=512`).

### Motivation

Gemma 4 uses `head_dim=512` for its global attention layers
(`num_attention_heads=8, num_key_value_heads=4`). Flash Attention and
Memory-Efficient Attention cap at `head_size=256`, so these fall through
to the unfused path. The existing unfused MHA runner produces NaN
because even though cuBLAS accumulates in FP32, the Q·K^T output tensor
is fp16 and overflows. Additionally, the MHA unfused runner cannot
handle GQA (`q_num_heads != kv_num_heads`).

### Key Changes

**New kernel** (`contrib_ops/cuda/bert/gqa_unfused_attention.cu/.h`):
- 3-stage pipeline: QK GEMM → softmax → AV GEMM
- QK GEMM uses `CUBLAS_COMPUTE_32F` with `CUDA_R_32F` output type — raw
Q·K^T scores written to FP32 scratch, eliminating fp16 overflow
- Reshape-Q trick for native GQA support (no K/V head replication
needed)
- Softmax supports causal mask, sliding window (`local_window_size`),
softcap, additive attention bias, and per-batch `seqlens_k`
- Per-batch `past` calculation for correct sliding-window masking with
variable-length sequences

**GQA contrib op integration** (`group_query_attention.cc`,
`group_query_attention_impl.cu`):
- Activates when Flash/MEA/XQA are all ineligible and KV cache is not
quantized
- Uses `PrepareQKV` for RoPE and K/V cache management, then routes to
the new kernel

**ONNX Attention op integration** (`attention.cc`, `attention.h`):
- New `RunGqaUnfusedAttention` path for GQA and fp16/bf16 with
`head_size > 128`
- Handles BSNH↔BNSH transposes, past_key concatenation, attn_mask→bias
conversion, `nonpad_kv_seqlen`
- Optimized: transposes BSNH K/V directly into
`present_key`/`present_value` when available

**`UnpackRoPEAppend` kernel** (`group_query_attention_qkv.cuh`):
- Raised `MAX_HEAD_SIZE` from 256 to 512 to support Gemma 4 head
dimensions

**Safety improvements**:
- `SafeInt<size_t>` for workspace size arithmetic (overflow protection)
- `static_assert` guarding GQA transpose paths against accidental float
instantiation

### Testing

- 8 new Gemma 4 regression tests in `test_gqa.py`: prompt/decode ×
fp16/bf16, softcap, sliding window, long past sequences
- 2 new Gemma 4 benchmark configs in `benchmark_gqa.py` (global + local
attention)
- All `TestGQARegressions` tests pass locally (12/12)

### Fixes

Fixes microsoft#28195
…crosoft#23268) (microsoft#28201)

### Description

Adds a regression test covering the bug reported in microsoft#23268 and fixed in
microsoft#23322.

Before microsoft#23322, calling `quantize_static()` with an in-memory
`ModelProto` whose weights were large enough (>= 1024 bytes) to trigger
ONNX's external-data serialization would mutate the caller's
`ModelProto` inside `save_and_reload_model_with_shape_infer` (via
`onnx.save_model(..., save_as_external_data=True)`), clearing `raw_data`
and pointing the tensor at a temp-directory path that was then deleted.
Subsequent calibration would load the now-invalid proto and raise
`onnx.onnx_cpp2py_export.checker.ValidationError`.

PR microsoft#23322 addressed the issue by wrapping the input proto in
`copy.deepcopy` inside `save_and_reload_model_with_shape_infer`, but no
regression test was added. This PR adds that test.

### Motivation and Context

Fixes microsoft#23268 (adds the missing regression test for the original bug).

### Changes

- `onnxruntime/test/python/quantization/test_quant_issues.py` — adds
`test_issue_23268_quantize_static_modelproto_no_validation_error` to
`TestQuantIssues`. The test builds an in-memory Add model with a 32x32
float32 initializer (4096 bytes, above the 1024-byte external-data
threshold), runs `quantize_static` against it with an inline minimal
calibration reader, and asserts the call completes without raising and
produces a quantized output file.

No production code changes.

### Test Plan

```
python -m pytest onnxruntime/test/python/quantization/test_quant_issues.py -v
```

Both tests pass locally (`test_minimal_model` and the new
`test_issue_23268_quantize_static_modelproto_no_validation_error`) in
~0.6s. `lintrunner -a` is clean.
…uested (microsoft#28027)

### Description
When MultiHeadAttention has only 1 output (no present_key/present_value
outputs), past key/value inputs should be completely ignored, matching
CPU EP semantics. The WebGPU EP was passing pastKey/pastValue
TensorViews to shader creation functions even when outputCount <= 1,
which affected shader cache keys and allowed past data to leak into the
attention computation.

This caused the test "MultiHeadAttention Basic, one head and head-size=4
with pastKey and pastValue" to fail with output [17,18,19,20] (pastValue
data) instead of expected [9,10,11,12] (V data). The failing output
matches exactly what happens when past IS used: Q·pastKey=75 dominates
Q·K=35, so softmax gives ~100% weight to pastValue.

### Fix
In `applyAttention()`, introduce `effectivePastKey`/`effectivePastValue`
that are set to `undefined` when `outputCount <= 1`. All downstream
usage (shader creation, input arrays) uses these effective values
instead of the raw parameters. This ensures:
- Shader cache keys correctly reflect the "no past" configuration
- Past tensors are never passed to any shader creation function
- Behavior matches CPU EP (which ignores past when present outputs are
null)
- GQA is unaffected (always has outputCount >= 3)
- Vanilla Attention is unaffected (always passes undefined for past)
### Description
In the CPU RNN operator's \\Assign_Y_h\\ function, when
\\sequence_lens\\ contains a value of 0, the computation
\\sequence_lens[batch] - 1 = -1\\ produces a negative offset into the Y
output buffer. \\CopyVector\\ then reads \\hidden_size\\ floats from
heap memory before the buffer, leaking heap data into the \\Y_h\\ output
tensor.

LSTM and GRU already handle zero-length sequences correctly (early
return + zero-fill in compute path), but the basic RNN operator had
neither protection.


### Changes
- **rnn.cc \\Compute()\\**: Add early return when \\max_sequence_length
== 0\\ — zero-fills Y and Y_h outputs and returns immediately (matches
existing LSTM/GRU pattern)
- **rnn.cc \\Assign_Y_h()\\**: Add bounds check on \\last_time_step\\
before computing buffer offset — guards against both negative index
(\\seq_lens=0\\) and index >= seq_length, zero-fills Y_h for invalid
entries

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…icrosoft#28241)

### Description

CI Python packaging pipelines now specify their packaging type (nightly
vs. release) via an explicit pipeline parameter rather than the
implicitly defined pipeline var `NIGHTLY_BUILD`.

### Motivation and Context

Much less error prone than an implicitly defined pipeline variable.
### Description
Fixes 3 ICM fixes:


https://portal.microsofticm.com/imp/v5/incidents/details/31000000572208/summary

https://portal.microsofticm.com/imp/v5/incidents/details/31000000573313/summary

https://portal.microsofticm.com/imp/v5/incidents/details/31000000575583/summary


### Motivation and Context
Fix ICM issues

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This pull request makes a small change to the CUDA label encoder kernel
to address unused parameter warnings. The change marks the `attr_name`
parameter as unused in the `TryGetScalarTensorAttribute` function when
building with the plugin execution provider.

* Code quality improvement:
* Marked the `attr_name` parameter as unused with
`ORT_UNUSED_PARAMETER(attr_name);` to suppress compiler warnings when
building with `BUILD_CUDA_EP_AS_PLUGIN`.
### Description
Pass base timestamp for vitisai profiling

Notify EP that profiling has started with the base timestamp (in
nanoseconds since epoch)
The VitisAI EP can use this to:
1. Calculate relative timestamps (event_ts - base_ts) for the profiling
timeline
2. Store the absolute base timestamp if needed for other purposes



### Motivation and Context
Due to onnxruntime default profiling json file just have the offset
timestamp, it doesn't provider the base timestamp for VitisAI EP, To
combine the VaitisAI timeline profiling info and the onnxruntime default
profiling json file info, We need pass the timestamp for VitisAI EP.

---------

Signed-off-by: Andrew Luo <junpengl@amd.com>
Co-authored-by: Andrew Luo <junpengl@amd.com>
…nd (microsoft#28083)

## Summary

Fixes a critical security vulnerability in the ONNX Runtime Python
backend where user-controlled `kwargs` were applied to `SessionOptions`
and `RunOptions` via unrestricted `setattr()`, allowing arbitrary file
overwrites.

## Vulnerability

The `prepare()` method in `onnxruntime/python/backend/backend.py`
iterated over user-controlled `kwargs` and used `setattr()` to apply
them directly to a `SessionOptions` instance. The `hasattr()` check was
not a security guard — it returned `True` for all exposed properties
including dangerous ones like `optimized_model_filepath`.

**Attack vector:**
```python
onnxruntime.backend.prepare(
    model_path,
    optimized_model_filepath="/etc/passwd",  # overwrites any file with protobuf binary
    graph_optimization_level=onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL
)
```

The same pattern existed in `backend_rep.py` for `RunOptions`.

## Fix

Replaced the unrestricted `hasattr/setattr` loop in both files with
strict allowlists:

- **`_ALLOWED_SESSION_OPTIONS`** (13 safe attrs) in `backend.py`
- **`_ALLOWED_RUN_OPTIONS`** (4 safe attrs) in `backend_rep.py`

**Both `SessionOptions` and `RunOptions` use identical validation
logic** with three outcomes for each kwarg key:

- **Allowlisted** — Applied via `setattr()` (e.g.
`graph_optimization_level`, `log_severity_level`)
- **Known-but-blocked** (real attribute on the object, but not on
allowlist) — Raises `RuntimeError` (e.g. `optimized_model_filepath`,
`terminate`)
- **Completely unknown** (not a property on the object at all) —
Silently ignored for forward compatibility (e.g.
`nonexistent_option_xyz`)

**Blocked dangerous attributes:**
- `optimized_model_filepath` — triggers `Model::Save()`, overwrites
arbitrary files with protobuf binary
- `profile_file_prefix` — writes profiling JSON to arbitrary path
- `enable_profiling` — causes uncontrolled file writes to cwd
- `terminate` (RunOptions) — denies the current inference call
- `training_mode` (RunOptions) — silently switches inference behavior in
training builds

## Tests

Added `TestBackendKwargsAllowlist` with 13 new test methods covering all
exploit vectors (blocked attrs raise `RuntimeError`), safe allowlisted
attrs (accepted), unknown attrs (silently ignored), and end-to-end
`run_model()` paths for both session and run options. All 15 tests pass
(13 new + 2 pre-existing in `TestBackend`), no regressions.

## Files Changed

- `onnxruntime/python/backend/backend.py`
- `onnxruntime/python/backend/backend_rep.py`
- `onnxruntime/test/python/onnxruntime_test_python_backend.py`
- `.agents/skills/python-kwargs-setattr-security/SKILL.md`

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ttention (microsoft#28200)

### Description

Adds a CUTLASS memory-efficient attention (MEA) fallback to the CUDA
PagedAttention op, enabling the operator on **sm<80 (Turing / Volta /
Pascal) with fp16** for the first time. On sm>=80 the default
FlashAttention path is unchanged; MEA is reachable via
`ORT_DISABLE_FLASH_ATTENTION=1` or the `sdpa_kernel` CUDA provider
option for debugging and perf comparison.


| Environment | Before | After |
|---|:---:|:---:|
| sm<80 + fp16 | ❌ error | ✅ MEA |
| sm<80 + bf16 | ❌ error | ❌ error (MEA requires sm>=80 for bf16) |
| sm>=80 + fp16/bf16 (default) | ✅ FA | ✅ FA (unchanged) |
| sm>=80 + `ORT_DISABLE_FLASH_ATTENTION=1` /
`sdpa_kernel=EFFICIENT_ATTENTION` | ❌ error | ✅ MEA |

### Motivation and Context

The original PagedAttention PR (microsoft#24595) landed with the title "CUDA SM80
support" — the op errors out immediately whenever FlashAttention isn't
available (sm<80 or `USE_FLASH_ATTENTION=0` builds). During that review,
@tianleiwu flagged that the interface was too FlashAttention-specific
(*"not good for other EP like WebGPU, CPU etc."*) and @aciddelgado
agreed the FA-specific dependencies could be lifted at the kernel level.

This PR closes that gap for sm<80 fp16 by mirroring the exact pattern
established in microsoft#20012 ("Packed QKV and Rotary Embedding Support for
sm<80 GQA"). The same CUTLASS memory-efficient attention backend that
covers GQA's sm<80 path now covers PagedAttention.

Related work:
- microsoft#20012 — direct pattern template (sm<80 GQA MEA fallback)
- microsoft#24595 — original PagedAttention PR
- microsoft#27516 — MS canonical FA → MEA → Unfused cascade ordering
- microsoft#27880 — ONNX Attention CUDA fallback coverage gaps
- microsoft#27992 — MEA decode + unfused softcap work (same flavor)

### Implementation

**Dispatch cascade** in `paged_attention.cc`: FlashAttention preferred;
fall back to MemoryEfficientAttention via
`has_memory_efficient_attention(sm, is_half, is_bf16, head_size,
head_size)`. No custom head-size or dtype bounds hardcoded — MEA's own
helper gates fp16 sm>=53 / bf16 sm>=80 / head_size <= 1024 and `% 8 ==
0`. This keeps us forward-compatible with any future expansion of MEA's
supported range.

**MEA path** (`UnfusedAttention<T>`):
1. Reuses existing preprocessing: `LaunchGetCumulativeSeqlensKV`
(hoisted to `paged_attention.cc` so both FA and MEA paths consume a
pre-populated buffer — single-producer refactor), rotary, packed-QKV
unpack, `ReshapeAndCache`.
2. New `GatherAndExpandPagedKVCache` CUDA kernel walks `block_table` to
gather paged K/V into a packed-varlen `[total_kv_tokens, num_heads,
head_size]` buffer, folding in GQA head expansion (so downstream MEA
sees `num_heads` uniformly).
3. Dispatches to `run_memory_efficient_attention` in **varlen mode** via
`seqstart_q_ptr = cumulative_seqlens_q` + `seqstart_k_ptr =
cumulative_seqlens_kv` (and `has_custom_right_padding = false`). No
padding required; layout matches the kernel's expected `[total_tokens,
num_heads, head_size]` with BSNH strides.

**Scratch allocation**: the MEA path D->H syncs
`cumulative_seqlens_kv[batch_size]` via a pinned buffer to obtain
`total_kv_tokens` on the host for tight `gathered_key` /
`gathered_value` / `fmha_buffer` allocation. This adds a
forward-per-call `cudaStreamSynchronize` — acceptable for a
compatibility fallback (FA remains the hot path on supported hardware).
Over-allocation (the no-sync alternative) would consume `B ×
max_num_blocks_per_seq × block_size × num_heads × head_size × 2 ×
sizeof(T)`, which reaches GB-scale for realistic GQA models and was
rejected.

`fmha_buffer` is sized with `sizeof(float)` (matching the GQA
EfficientAttention pattern at `group_query_attention.cc:482`) because
MEA's output accumulator is fp32 regardless of input dtype.

### Testing

New `TestPagedAttentionMEA` class in `test_paged_attention_cuda.py` runs
the existing parity matrix (rotary on/off, rotary_interleaved on/off,
packed-QKV on/off, local window on/off, softcap 0/50, varied head
sizes/shapes) against the MEA path via the `sdpa_kernel` CUDA provider
option set to `EFFICIENT_ATTENTION` (=2, from `AttentionBackend` enum).
Using a per-session provider option instead of an env var means both FA
and MEA test classes coexist in the same pytest process — each
InferenceSession creates its own CUDA EP with its own
`attention_kernel_options_`.

The existing `TestPagedAttention` class is skipped wholesale on sm<80 by
its `has_flash_attention()` gate, so without the new MEA class the
fallback path would have no CI coverage.

**Local verification** (NVIDIA A100 80GB, CUDA 12.8, GCC 13.3):

```
TestPagedAttention:       24/24 passed (~60s)   # FA baseline — no regression
TestPagedAttentionMEA:    24/24 passed (~59s)   # new MEA path
```

Tolerance: `rtol = atol = 5e-3` against the same torch reference used by
the FA parity test. All combinations match.

**sm<80 hardware coverage**: I don't have local Turing / Volta / Pascal
hardware, so real-SM coverage relies on MS CI. The code path exercised
on A100 via `sdpa_kernel=EFFICIENT_ATTENTION` is the same one taken on
sm<80; only the underlying CUTLASS kernel
(`run_memory_efficient_attention_sm50/70/75/80`) differs per SM, and
those are upstream and unmodified by this change.

**Build note**: built with `--cmake_extra_defines
CMAKE_CUDA_ARCHITECTURES=80 CMAKE_CXX_STANDARD=20`. The explicit C++20
define was needed because the initial configure resolved
`CMAKE_CXX_STANDARD=17`, under which `ort_version_check.h`'s `consteval`
usage fails to compile. Unrelated to this change.
…pkg set (microsoft#28254)

### Description

Remove `react-native` package from set of packages required for
RC/release publishing.
We will need to revisit this and decide whether to remove it entirely or
properly fix it.

### Motivation and Context

The React Native package is having build issues and we don't need it for
the next few immediate releases.
### Description

Adds support for `com.microsoft:QuickGelu` (`x * Sigmoid(alpha * x)`) to
the CoreML Execution Provider's MLProgram path. The builder decomposes
QuickGelu into three MIL ops (`mul` / `sigmoid` / `mul`), matching the
op's own schema function-body in `contrib_defs.cc:605-631` and the
approach the QNN EP already uses in
`qnn/builder/opbuilder/quick_gelu_op_builder.cc`. Only the MLProgram
path is implemented; NeuralNetwork is deprecated on Apple Silicon.

Adds `CoreMLExecutionProviderTest.QuickGeluTest` which builds a single
`com.microsoft:QuickGelu` node with non-default `alpha=1.5` and verifies
the entire graph is claimed by the CoreML EP via
`ExpectedEPNodeAssignment::All`. Verified with a negative test:
temporarily removing the `CreateQuickGeluOpBuilder` registration causes
the new test to fail with a `VerifyEPNodeAssignment` fatal failure,
proving it genuinely exercises the CoreML path.

Also updates `coreml_supported_mlprogram_ops.md`.

### Motivation and Context

Fixes microsoft#28183.

QuickGelu is produced by ORT's own `QuickGeluFusion` optimizer pass
(`onnxruntime/core/optimizer/quick_gelu_fusion.cc`), which runs at
`ORT_ENABLE_EXTENDED` — and therefore also at `ORT_ENABLE_ALL`, the
default session optimization level. So any model that contains the `x *
sigmoid(alpha * x)` pattern (CLIP, several mobile transformers, the
DWPose pose estimator) gets silently mutated by ORT into a graph with
`QuickGelu` nodes that the CoreML EP then rejects — turning 3 supported
primitives into 1 unsupported op, making the fusion strictly harmful for
CoreML.

On the DWPose `dw-ll_ucoco_384.onnx` model with batch=1 and
`ORT_ENABLE_EXTENDED`, 76 `QuickGelu` nodes get produced. Running the
result on the CoreML EP:

| ORT build | CoreML subgraphs | Inference (ms) |
| --- | --- | --- |
| main (QuickGelu rejected) | ~80 (each QuickGelu is a graph break) |
54.77 |
| this PR (QuickGelu supported) | 10 | 13.91 |

The remaining breaks are other ops — see "Related gaps" below. A ~4×
speedup at EXTENDED level from this patch alone.

Even at the default `ORT_ENABLE_ALL` with a symbolic batch dim (where
partial shape inference inhibits most fusions), 3 `QuickGelu` nodes
still get produced — so this patch helps any CoreML user who hasn't
explicitly downgraded to `ORT_ENABLE_BASIC`.

### Related CoreML EP gaps observed (out of scope for this PR)

With QuickGelu fixed, the remaining 9 CPU-fallback nodes on the
EXTENDED-optimized DWPose pose model are:

- **`com.microsoft:FusedConv`** (×4) — produced by
`ConvActivationFusion`. Fuses `Conv + activation` into one node. Same
failure mode as QuickGelu: `Conv` and the activations (`Relu`,
`Sigmoid`, `HardSigmoid`, etc.) are individually CoreML-supported, but
the fused form isn't. Decomposition is straightforward — emit the
underlying `conv` MIL op, then the corresponding activation.
- **`com.microsoft:FusedMatMul`** (×2, from `MatMulScaleFusion`) —
`MatMul * alpha` with an optional transpose. Decomposition: `matmul` +
scalar `mul`.
- **`ai.onnx:Split`** (×2) — pre-existing CoreML EP gap unrelated to
fusion. CoreML MIL has a native `split` op; this one is a straight
op-builder omission.

Happy to send follow-up PRs for any of these after this one lands,
following the same pattern. Flagging here so they're on the EP coverage
roadmap.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tilities (microsoft#28227)

This pull request significantly improves the safety and robustness of
sparse tensor handling in ONNX Runtime. The main focus is on adding
thorough bounds checking and using safe integer arithmetic to prevent
overflows and invalid memory accesses when working with sparse tensor
indices. Additionally, the Python bindings for sparse tensors are
refactored to ensure correct object lifetimes and memory management when
exposing data to NumPy.

**Sparse Tensor Index Validation and Safety**

* Added comprehensive bounds checks for COO and CSR sparse tensor
indices in both the C API (`onnxruntime_c_api.cc`) and core conversion
utilities, ensuring indices are within valid ranges and, for CSR, that
outer indices are non-decreasing and within bounds.
[[1]](diffhunk://#diff-cff364b6b1ab4ef507d87a661a97b873405f569797fcaf91af29491f223555a8R449-R485)
[[2]](diffhunk://#diff-cff364b6b1ab4ef507d87a661a97b873405f569797fcaf91af29491f223555a8R521-R547)
[[3]](diffhunk://#diff-cff364b6b1ab4ef507d87a661a97b873405f569797fcaf91af29491f223555a8R659-R696)
[[4]](diffhunk://#diff-cff364b6b1ab4ef507d87a661a97b873405f569797fcaf91af29491f223555a8R721-R747)
[[5]](diffhunk://#diff-620fd022510c5134fc9bd3c8d01bc5772cc78a82043b0da5e44cf2482038dc37L267-R273)
[[6]](diffhunk://#diff-620fd022510c5134fc9bd3c8d01bc5772cc78a82043b0da5e44cf2482038dc37L359-R376)
* Replaced direct arithmetic with `SafeInt` for all index and size
calculations to prevent integer overflows, especially when converting
between types or computing dense tensor offsets.
[[1]](diffhunk://#diff-620fd022510c5134fc9bd3c8d01bc5772cc78a82043b0da5e44cf2482038dc37L267-R273)
[[2]](diffhunk://#diff-d31e9fbe0f5334fcd949833e035f2b25d5ae810dcd505c545f6b372b546b1406L2077-R2077)
[[3]](diffhunk://#diff-d31e9fbe0f5334fcd949833e035f2b25d5ae810dcd505c545f6b372b546b1406L2091-R2091)
[[4]](diffhunk://#diff-d31e9fbe0f5334fcd949833e035f2b25d5ae810dcd505c545f6b372b546b1406L2110-R2110)
[[5]](diffhunk://#diff-d31e9fbe0f5334fcd949833e035f2b25d5ae810dcd505c545f6b372b546b1406L2291-R2298)
* Improved error messages for invalid indices, making debugging easier
by providing more context about the specific error.
[[1]](diffhunk://#diff-cff364b6b1ab4ef507d87a661a97b873405f569797fcaf91af29491f223555a8R449-R485)
[[2]](diffhunk://#diff-cff364b6b1ab4ef507d87a661a97b873405f569797fcaf91af29491f223555a8R521-R547)
[[3]](diffhunk://#diff-cff364b6b1ab4ef507d87a661a97b873405f569797fcaf91af29491f223555a8R659-R696)
[[4]](diffhunk://#diff-cff364b6b1ab4ef507d87a661a97b873405f569797fcaf91af29491f223555a8R721-R747)
[[5]](diffhunk://#diff-620fd022510c5134fc9bd3c8d01bc5772cc78a82043b0da5e44cf2482038dc37L267-R273)
[[6]](diffhunk://#diff-620fd022510c5134fc9bd3c8d01bc5772cc78a82043b0da5e44cf2482038dc37L359-R376)

**Python Bindings Improvements**

* Refactored the pybind11 bindings for sparse tensor views so that NumPy
arrays referencing sparse tensor memory correctly keep the parent Python
object alive, preventing potential memory issues when the sparse tensor
is on the GPU or managed by Python.
[[1]](diffhunk://#diff-3c1b21fe3d5903c277b4d3888f5a4c57ff8f8f6f593183a3f4865825c5ab8e0cL98-R120)
[[2]](diffhunk://#diff-3c1b21fe3d5903c277b4d3888f5a4c57ff8f8f6f593183a3f4865825c5ab8e0cL299-R304)
[[3]](diffhunk://#diff-3c1b21fe3d5903c277b4d3888f5a4c57ff8f8f6f593183a3f4865825c5ab8e0cL314-R319)

**General Code Quality**

* Added missing header include for `safeint.h` to ensure `SafeInt` is
available where needed.
* Minor cleanups and improved assertions to clarify intent and ensure
correctness.

These changes collectively make sparse tensor support in ONNX Runtime
safer, more reliable, and easier to use from both C++ and Python.
### Description
vector_per_class dimension was not verified, it could lead to illegal
memory access



### Motivation and Context
security issue

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: xadupre <22452781+xadupre@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
…icrosoft#28248)

### Description
<!-- Describe your changes. -->

- Correct misleading 'SemVer 1.0.0' label; the universal version regex
actually validates SemVer 2.0.0 syntax without build metadata, which is
what Azure Universal Packages requires.

- Prefix the dev short SHA with 'commit-' in universal_version so the
pre-release identifier always contains a non-digit, avoiding spurious
validation failures for all-numeric SHAs with leading zeros.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Fix invalid version when we have an all-numeric commit SHA starting with
0.
…ls (microsoft#28214)

This PR adds position_ids bounds checking to WebGPU and JS
RotaryEmbedding implementations, completing the security fix started in
PR microsoft#27597 (commit 056bab3) which covered CPU and CUDA.

## Problem
The `com.microsoft::RotaryEmbedding` kernel uses position_ids as row
indices into cos_cache/sin_cache without bounds validation. While PR
microsoft#27597 fixed CPU and CUDA paths, WebGPU and JS implementations were
still missing bounds checks, which could produce silently wrong results
(WebGPU hardware clamps OOB reads).

## Changes
- **contrib_ops/webgpu/bert/rotary_embedding.cc**: Host-side validation
(ORT_MAKE_STATUS) + shader-side defense-in-depth (pass-through on OOB)
- **core/providers/webgpu/llm/rotary_embedding.cc**: Host-side
validation with format-0 awareness
- **js/web/lib/wasm/jsep/webgpu/ops/rotary-embedding.ts**: TypeScript
validation using getBigInt64Array
- **7 new C++ OOB test cases** across contrib and ONNX domains targeting
WebGPU EP

## Security
Addresses the same vulnerability as microsoft#27597 (OOB read via position_ids,
CVSS 7.5-9.1) for WebGPU/JS execution providers.

## Testing
- 7 new unit tests (3 contrib + 4 ONNX domain) with GTEST_SKIP when
WebGPU EP unavailable
- JS/TS error tests not feasible with current JSONC test format
(documented)
- Build environment lacks C++20/emsdk for full compilation verification;
validated structurally

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ntime/ep/adapter/op_kernel_info.h` (microsoft#28081)

### Description
<!-- Describe your changes. -->

Remove reinterpret_cast of OrtKernelInfo* to internal OpKernelInfo* that
breaks ABI across DLL boundaries (vtable mismatch between plugin EP and
ORT core).

- KernelInfoCache: use Ort::ConstKernelInfo::GetEp() instead of casting
to OpKernelInfo* and calling GetExecutionProvider()->GetOrtEp()

- GetAllocator: use C API KernelInfoGetAllocator +
IAllocatorWrappingOrtAllocator instead of casting to OpKernelInfo*

- Remove #include core/framework/op_kernel_info.h (no longer needed)

- Add IAllocatorWrappingOrtAllocator adapter

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Address crash observed when testing WebGPU plugin EP with older ORT
1.24.4 binary where the number of `onnxruntime::IExecutionProvider`
virtual functions had changed between the two builds.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
anzzraju1997-glitch and others added 27 commits May 26, 2026 15:29
## Changes
- Update `requirements.txt` to `protobuf>=4.25.8`
- Update `requirements-training.txt` to `protobuf>=4.25.8`
- Update `requirements-dev.txt` to `protobuf>=4.25.8`
- Update `docs/python/requirements.txt` to `protobuf>=4.25.8`

## Notes
This change addresses the direct Python manifest surface only. It does
not claim to resolve every transitive Component Governance finding.

Co-authored-by: arajendra <arajendra@users.noreply.github.com>
### Description
Update GatherBlockQuantized to support 2-bits.
Updated op schema, implemented the CPU and WebGPU EP.
This helps to make the model smaller.
rev some npm packages

---------

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: guschmue <22941064+guschmue@users.noreply.github.com>
Co-authored-by: Copilot <copilot@github.com>
…osoft#28659)

### Description

Fix the CUDA plugin EP package test pipeline failure where the plugin is
built with the latest code (which includes `float8e8m0` and other newer
data types), but the host ORT 1.26 release does not support these types.
When the plugin attempts to register kernel type constraints containing
unsupported types, `GetTensorDataType` fails and the plugin load
crashes.

### Motivation and Context

The plugin EP architecture allows plugins to be built against a newer
version of the ONNX Runtime headers while being loaded into an older
host ORT. However, the existing `KernelDefBuilder::TypeConstraint`
methods call `GetTensorType` (which throws on unsupported types), making
it impossible for a forward-compatible plugin to register kernels that
include newer data types in their type constraint lists.

### Changes

- Add `TryGetTensorType()` — a non-throwing variant of `GetTensorType()`
that returns `nullptr` when the host ORT does not recognize a tensor
element type.
- Add `TryMLDataTypeToOrtDataType()` — a non-throwing variant of
`MLDataTypeToOrtDataType()` that returns `nullptr` instead of
asserting/throwing.
- Update `KernelDefBuilder::TypeConstraint` (both vector and single-type
overloads) to use the `Try` variants and gracefully skip unsupported
types rather than failing.

### Impact

- Plugins built with newer headers can now load into older host ORT
releases without crashing on unknown data types.
- If all types in a constraint list are unsupported, the constraint is
simply not registered (the kernel will not match, which is the correct
fallback behavior).
- No behavioral change when the host supports all types — the code path
is identical to before.
…osoft#28675)

## Description

The `windows_x64_asan / build_x64` CI pipeline has been failing with OOM
(out-of-memory) because ASan-instrumented test binaries consume
significantly more memory than normal builds, and CTest was running them
at full CPU-count parallelism.

This PR adds a `--test_parallel` argument to `build.py` that allows
CTest concurrency to be configured independently from the build
parallelism (`--parallel`). It then uses `--test_parallel 4` in the
Windows x64 ASan workflow to cap test execution to 4 parallel jobs,
preventing OOM while keeping build parallelism at full speed.

## Motivation and Context

- ASan instrumentation inflates per-process memory usage by ~2-3x.
- The existing `--parallel` flag controls both MSBuild and CTest
concurrency together; there was no way to keep fast parallel builds
while limiting test concurrency.
- The CI runner has limited memory, and running all tests in parallel
under ASan exceeded available RAM.

## Changes

| File | Change |
|------|--------|
| `tools/ci_build/build_args.py` | Add `--test_parallel` argument
(default: `None`, falls back to `--parallel`) |
| `tools/ci_build/build.py` | Add `number_of_test_parallel_jobs()`
helper; use it for CTest `--parallel`; validate negative values |
| `.github/workflows/windows_build_x64_asan.yml` | Pass `--test_parallel
4` to cap ASan test concurrency |

## Testing

- `python -m py_compile tools/ci_build/build.py
tools/ci_build/build_args.py` — passes
- `python tools/ci_build/build.py --help` shows the new
`--test_parallel` option
- `git diff --check` — no whitespace issues
- When `--test_parallel` is omitted, behavior is unchanged (falls back
to `--parallel` value)
…6345 (microsoft#28524)

This pull request introduces comprehensive validation and error handling
improvements for the ConvTranspose operator across CPU, CUDA, WebGPU,
and XNNPACK backends, as well as in shape inference and unit tests. The
main focus is to ensure that invalid input shapes (especially rank-0 or
rank-1 tensors) are properly detected and reported, preventing undefined
behavior and improving robustness. Additionally, error messages are
clarified, and several helper functions now return `Status` for better
error propagation.

**Validation and Error Handling Improvements:**

* All ConvTranspose implementations (CPU, CUDA, WebGPU) now explicitly
check that input `X` and filter `W` tensors have at least 3 dimensions,
returning clear error messages if not.
(`[[1]](diffhunk://#diff-72fa27d94d5d92dd1e78ff510ef9a84d1ad74426c19af9722cf6511f8d38a5a8R65-R79)`,
`[[2]](diffhunk://#diff-d1bbcb0542b5acea587ac929cd6362cfd11172c522505c6db8b457a9d470c63dR273-R289)`,
`[[3]](diffhunk://#diff-b615243d0702e9613bd815173108306495b0f690294001e606823b77322f6fafR22-L28)`)
* The shape inference function for `ConvTransposeWithDynamicPads` now
fails gracefully with descriptive errors if input or weight tensors have
fewer than 2 dimensions.
(`[onnxruntime/core/graph/contrib_ops/contrib_defs.ccL62-R67](diffhunk://#diff-81f57d9adc2cce94f85a2949a895b7ff82efcc13d05e23ee6567661f0fecb7c0L62-R67)`)
* Additional validation ensures that `output_padding` and dynamic pads
have correct sizes, and that `output_padding` values are within
ONNX-specified limits.
(`[[1]](diffhunk://#diff-72fa27d94d5d92dd1e78ff510ef9a84d1ad74426c19af9722cf6511f8d38a5a8R138-R153)`,
`[[2]](diffhunk://#diff-72fa27d94d5d92dd1e78ff510ef9a84d1ad74426c19af9722cf6511f8d38a5a8R171-R187)`)

**Refactoring for Robustness:**

* Helper functions such as `ComputePadsAndOutputShape` and
`ComputeTransposePadAndOutputShape` now return `Status`, allowing errors
to propagate and be handled appropriately rather than causing crashes or
silent failures.
(`[[1]](diffhunk://#diff-72fa27d94d5d92dd1e78ff510ef9a84d1ad74426c19af9722cf6511f8d38a5a8L165-R234)`,
`[[2]](diffhunk://#diff-72fa27d94d5d92dd1e78ff510ef9a84d1ad74426c19af9722cf6511f8d38a5a8L194-R262)`,
`[[3]](diffhunk://#diff-72fa27d94d5d92dd1e78ff510ef9a84d1ad74426c19af9722cf6511f8d38a5a8L220-R282)`,
`[[4]](diffhunk://#diff-72fa27d94d5d92dd1e78ff510ef9a84d1ad74426c19af9722cf6511f8d38a5a8R291-R302)`)
* All call sites (CPU, CUDA, WebGPU, XNNPACK) are updated to handle and
propagate these errors using `ORT_RETURN_IF_ERROR` or
`ORT_THROW_IF_ERROR`.
(`[[1]](diffhunk://#diff-72fa27d94d5d92dd1e78ff510ef9a84d1ad74426c19af9722cf6511f8d38a5a8R171-R187)`,
`[[2]](diffhunk://#diff-d1bbcb0542b5acea587ac929cd6362cfd11172c522505c6db8b457a9d470c63dL362-R379)`,
`[[3]](diffhunk://#diff-b615243d0702e9613bd815173108306495b0f690294001e606823b77322f6fafL48-R60)`,
`[[4]](diffhunk://#diff-6a2f8672090f25850b90b266aff3c7212552fc81b14bb7b539e9e5161c9fd526L494-R497)`)

**Unit Test Enhancements:**

* New negative tests are added to verify that rank-0 and rank-1 weight
tensors are properly rejected and produce the expected error messages,
increasing test coverage and reliability.
(`[onnxruntime/test/contrib_ops/conv_transpose_with_dynamic_pads_test.ccR22-R56](diffhunk://#diff-cb5bfc51d0c8096922eb674d142f0e970d5becd140b47bdfd7729a06a818b598R22-R56)`)

**Minor Code Quality Improvements:**

* Improved memory management in the CPU implementation by wrapping the
allocated buffer in `BufferUniquePtr` immediately to prevent leaks if
exceptions are thrown.
(`[onnxruntime/core/providers/cpu/nn/conv_transpose.ccR79-R89](diffhunk://#diff-0dcb5a9c8ba0c4e67940e9d77f77cb706bbf82d67bf6757967099b0a69c797b5R79-R89)`)
* Minor includes and type safety improvements (e.g., use of `SafeInt`
for overflow protection).
(`[[1]](diffhunk://#diff-72fa27d94d5d92dd1e78ff510ef9a84d1ad74426c19af9722cf6511f8d38a5a8R22)`,
`[[2]](diffhunk://#diff-72fa27d94d5d92dd1e78ff510ef9a84d1ad74426c19af9722cf6511f8d38a5a8R291-R302)`)

**Summary of Most Important Changes:**

**1. Validation and Error Handling**
- All ConvTranspose implementations now check that input and filter
tensors have at least 3 dimensions, returning clear errors if not.
(`[[1]](diffhunk://#diff-72fa27d94d5d92dd1e78ff510ef9a84d1ad74426c19af9722cf6511f8d38a5a8R65-R79)`,
`[[2]](diffhunk://#diff-d1bbcb0542b5acea587ac929cd6362cfd11172c522505c6db8b457a9d470c63dR273-R289)`,
`[[3]](diffhunk://#diff-b615243d0702e9613bd815173108306495b0f690294001e606823b77322f6fafR22-L28)`)
- Shape inference for `ConvTransposeWithDynamicPads` fails with
descriptive errors for invalid input or weight tensor ranks.
(`[onnxruntime/core/graph/contrib_ops/contrib_defs.ccL62-R67](diffhunk://#diff-81f57d9adc2cce94f85a2949a895b7ff82efcc13d05e23ee6567661f0fecb7c0L62-R67)`)
- Additional checks for `output_padding` and dynamic pads sizes and
values, with ONNX spec compliance.
(`[[1]](diffhunk://#diff-72fa27d94d5d92dd1e78ff510ef9a84d1ad74426c19af9722cf6511f8d38a5a8R138-R153)`,
`[[2]](diffhunk://#diff-72fa27d94d5d92dd1e78ff510ef9a84d1ad74426c19af9722cf6511f8d38a5a8R171-R187)`)

**2. Error Propagation and Refactoring**
- Helper functions now return `Status` and propagate errors; all call
sites updated to handle these errors.
(`[[1]](diffhunk://#diff-72fa27d94d5d92dd1e78ff510ef9a84d1ad74426c19af9722cf6511f8d38a5a8L165-R234)`,
`[[2]](diffhunk://#diff-72fa27d94d5d92dd1e78ff510ef9a84d1ad74426c19af9722cf6511f8d38a5a8L194-R262)`,
`[[3]](diffhunk://#diff-72fa27d94d5d92dd1e78ff510ef9a84d1ad74426c19af9722cf6511f8d38a5a8L220-R282)`,
`[[4]](diffhunk://#diff-72fa27d94d5d92dd1e78ff510ef9a84d1ad74426c19af9722cf6511f8d38a5a8R291-R302)`,
`[[5]](diffhunk://#diff-d1bbcb0542b5acea587ac929cd6362cfd11172c522505c6db8b457a9d470c63dL362-R379)`,
`[[6]](diffhunk://#diff-b615243d0702e9613bd815173108306495b0f690294001e606823b77322f6fafL48-R60)`,
`[[7]](diffhunk://#diff-6a2f8672090f25850b90b266aff3c7212552fc81b14bb7b539e9e5161c9fd526L494-R497)`)

**3. Unit Testing**
- Added tests to ensure invalid weight tensor ranks are rejected with
proper error messages.
(`[onnxruntime/test/contrib_ops/conv_transpose_with_dynamic_pads_test.ccR22-R56](diffhunk://#diff-cb5bfc51d0c8096922eb674d142f0e970d5becd140b47bdfd7729a06a818b598R22-R56)`)

**4. Code Quality**
- Improved buffer management and type safety in CPU backend.
(`[[1]](diffhunk://#diff-0dcb5a9c8ba0c4e67940e9d77f77cb706bbf82d67bf6757967099b0a69c797b5R79-R89)`,
`[[2]](diffhunk://#diff-72fa27d94d5d92dd1e78ff510ef9a84d1ad74426c19af9722cf6511f8d38a5a8R22)`,
`[[3]](diffhunk://#diff-72fa27d94d5d92dd1e78ff510ef9a84d1ad74426c19af9722cf6511f8d38a5a8R291-R302)`)
- disable dynamic wgsl template tesst
- disable shader cache key checks
…osoft#28369)

BiasLoader hardcoded 128-bit vectorized loads (`ElementsPerAccess =
128/sizeof_bits = 8` for fp16) regardless of the `isAligned` template
flag. When bias stride was not a multiple of 8, the unaligned kernel was
selected but BiasLoader still used 128-bit loads →
`cudaErrorMisalignedAddress`.

**Fix**: Use `kAlignmentA` (4 for unaligned, 8 for aligned) instead of
hardcoded 8.

Tested with Gemma4 Attention + mask at all seq lengths 1–32.

---------

Signed-off-by: Justin Chu <justinchu@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
…rosoft#27707)

## Summary

- Fix ORT raising "does not have type information set by parent node"
when a subgraph references an initializer declared in the outer (parent)
graph without explicit `value_info` in the subgraph
- Propagate type info from implicit input defs to subgraph NodeArgs
before subgraph verification in `VerifyNodeAndOpMatch`
- Add regression test with an `If` node whose subgraph references an
outer scope initializer without `value_info`

## Motivation

Fixes microsoft#24880

When a node's op schema type inference function does not invoke subgraph
inferencing (e.g., contrib ops like `BeamSearch`, `GreedySearch`,
`WhisperBeamSearch`, `Sampling`), `InferAndVerifySubgraphTypes` is never
called. This means type info from outer scope values — such as
initializers declared in the parent graph — is never propagated to the
subgraph's NodeArgs. When the subgraph is later verified in the second
pass of `VerifyNodeAndOpMatch`, nodes referencing these outer scope
values fail with a null type error.

The existing workaround in `convert_generation.py` (manually adding
`value_info` entries for moved initializers) confirms this gap in the
type propagation path.

## Changes

**`onnxruntime/core/graph/graph.cc`**: In `VerifyNodeAndOpMatch`'s
subgraph verification loop, propagate type info from the containing
node's `implicit_input_defs` to the subgraph's NodeArgs before calling
`VerifyNodeAndOpMatch` on the subgraph. The propagation is guarded by
`subgraph_nodearg->Type() == nullptr`, making it a safe no-op for
standard ONNX ops (If/Loop/Scan) where `InferAndVerifySubgraphTypes`
already set the types. For nested subgraphs, the recursive call to
`VerifyNodeAndOpMatch` handles propagation at each level.

**`onnxruntime/test/ir/graph_test.cc`**: Add
`OuterScopeInitializerTypeInfoPropagatedToSubgraph` test that constructs
a model proto with an `If` node whose subgraphs reference an outer graph
initializer without `value_info`, and verifies `Model::Load` (which
calls `Graph::Resolve`) succeeds.

## Test Plan

- [ ] New C++ unit test
`OuterScopeInitializerTypeInfoPropagatedToSubgraph` verifies model
resolution succeeds
- [ ] Existing `graph_test.cc` tests continue to pass (no regression in
type inference for standard ONNX ops)
- [ ] Existing control flow tests (If/Loop/Scan) continue to pass
- [ ] CI lint checks pass (verified locally with `lintrunner`)
…lash_nvcc_threads, and enable quick build mode (microsoft#28645)

## Description

Speed up CUDA CI build times by splitting the monolithic CUDA provider
into architecture-specific OBJECT libraries with independent `nvcc
--threads` control, and introducing a quick build mode
(`onnxruntime_QUICK_BUILD`) that reduces kernel instantiations for CI
validation.

## Motivation and Context

CUDA builds were bottlenecked by `--nvcc_threads 1` across all targets
because flash attention (48 .cu files, SM80+) requires ~4GB per nvcc
thread and caused OOM when compiled with higher thread counts. The old
heuristic in `build.py` used `psutil` to auto-detect memory but was
unreliable and always conservative.

By splitting flash attention into its own OBJECT library, the rest of
the build can safely use `--threads 4` while flash attention stays at
`--threads 2`. Combined with quick build mode (fewer kernel variants),
this significantly reduces CI wall-clock time.

## CI Time Saving

* N1F1: `--nvcc_threads 1`. CI time is from checks of [PR
28607](https://github.com/microsoft/onnxruntime/pull/28607/checks).
* N4F2: `--nvcc_threads 4 --flash_nvcc_threads 2`: CI time is from this
PR.
* N8F4: `--nvcc_threads 8 --flash_nvcc_threads 4`: CI time is from this
PR.
* N4F4: `--nvcc_threads 4 --flash_nvcc_threads 4`: CI time is from this
PR. This is the final candidate.
* Saving = N1F1 - N4F4
* Saving Ratio = (N1F1  - N4F4) / N1F1

Here is CI time (Build + Test time in minutes) saving:

CI | N1F1 | N4F2 | N8F4 | N4F4 | Saved Minutes | Saving Ratio
-- | -- | -- | -- | --  | -- | --
Linux CI | 35 + 38 | 35 + 32 | 35 + 32 | 36 + 27 | 10 | 14%
Windows CI | 58 + 36 | 53 + 38 | 54 + 38 | 48 + 36 | 10 | 11%
Plugin Linux CI | 53 + 26 | 38 + 17 | 39 + 39 | 39 + 15 | 25 | 32%
Plugin Windows CI | 77 + 16 | 57 + 14 | 54 + 14 | 53 + 12 | 28 | 30%
Windows TRT CI | 54 + 43 | 38 + 38 | 42 + 43 | 41 + 37 | 19 | 20%

Note that this is only one time comparison. Cache might take effect with
more runs, and might change the statistics. The CI time is reduced in
the range of 11% to 32%. Total CI time saving is more than 90 minutes.

## Key Changes

### 1. CMake: Architecture-specific OBJECT Libraries

| File | Change |
|------|--------|
| `cmake/onnxruntime_cuda_source_filters.cmake` | New macros:
`onnxruntime_extract_flash_attention_sources()`,
`onnxruntime_extract_llm_sources()`,
`onnxruntime_extract_sm_specific_cuda_sources()` to partition sources by
SM arch |
| `cmake/onnxruntime_providers_cuda.cmake` | Create `flash_attention`
(SM80+), `llm` (SM75+), `sm90_tma`, and `sm120_tma` OBJECT libraries
with per-target `--threads`; merge fpA_intB SM90 launchers into SM90 TMA
lib |
| `cmake/onnxruntime_providers_cuda_plugin.cmake` | Mirror OBJECT
library pattern for plugin EP build; consolidate shared compile options
into a variable; fix `-Xcudafe --diag_suppress=550,2810` and `--std
c++20` for CUDA 12.8 compatibility |
| `cmake/onnxruntime_unittests.cmake` | Link new OBJECT libraries into
test target |

### 2. Build Script: `--flash_nvcc_threads` and Default 4

| File | Change |
|------|--------|
| `tools/ci_build/build.py` | Remove `psutil`-based memory heuristic;
add `--flash_nvcc_threads` forwarding; default `nvcc_threads` to 4 |
| `tools/ci_build/build_args.py` | Add `--flash_nvcc_threads` CLI
argument (default: same as `--nvcc_threads`) |

### 3. Quick Build Mode (`onnxruntime_QUICK_BUILD`)

- Reduces flash attention kernels to hdim128 fp16 only (skips
hdim32/64/96/192/256)
- Guards some MoE SM90 generated launchers with `#ifndef
ORT_QUICK_BUILD`
- Restricts CUTLASS SM80 tile configs to 3 instantiations
- Skips test cases that depend on excluded kernel variants (e.g.,
`test_gqa_fp8_fallback_unsupported_head_size` needs hdim64)
- Applied to all CI pipelines **except** Linux CUDA CI (full build) and
packaging pipelines

### 4. CI and Packaging Pipeline Updates

All CUDA CI pipelines updated from `--nvcc_threads 1` to `--nvcc_threads
4 --flash_nvcc_threads 4`:
- `.github/workflows/linux_cuda_ci.yml`
- `.github/workflows/linux_cuda_plugin_ci.yml` (+ `QUICK_BUILD=ON`)
- `.github/workflows/linux_tensorrt_ci.yml` (+ `QUICK_BUILD=ON`)
- `.github/workflows/windows_cuda.yml` (+ `QUICK_BUILD=ON`)
- `.github/workflows/windows_cuda_plugin.yml` (+ `QUICK_BUILD=ON`)
- `.github/workflows/windows_tensorrt.yml` (+ `QUICK_BUILD=ON`)

Packaging pipeline updated to use `--nvcc_threads 4 --flash_nvcc_threads
2`, except `--nvcc_threads 2 --flash_nvcc_threads 1` for cuda plugin:
- Azure Pipelines: `custom-nuget-packaging-pipeline.yml`,
`nuget-win-cuda-packaging-stage.yml`, `plugin-win-cuda-stage.yml`,
`py-win-gpu-stage.yml`
- Linux scripts: `build_cuda_plugin_package.sh`,
`build_linux_python_package.sh`

### 5. Bug Fix: CUTLASS Heuristic for SIMT Kernels

- `onnxruntime/contrib_ops/cuda/llm/cutlass_heuristic.cc`: Fixed
`ORT_QUICK_BUILD` path to return proper tile config for SIMT (float)
gemm type instead of discarding the type info

## Architecture Mapping

| OBJECT Library | Min SM | Sources | Threads |
|---|---|---|---|
| `*_flash_attention` | SM80+ | `bert/flash_attention/*.cu` (48 files) |
`onnxruntime_FLASH_NVCC_THREADS` (default: same as nvcc_threads) |
| `*_llm` | SM75+ | `contrib_ops/cuda/llm/*.cu` (excl. SM90/SM120
launchers) | `onnxruntime_NVCC_THREADS` (default 4) |
| `*_sm90_tma` | 90a-real | MoE TMA + fpA_intB SM90 launchers |
`onnxruntime_NVCC_THREADS` |
| `*_sm120_tma` | SM120+ | MoE SM120 TMA generated files |
`onnxruntime_NVCC_THREADS` |
| Parent target | All archs | Everything else |
`onnxruntime_NVCC_THREADS` |

## New Build Options

- `--nvcc_threads N` (default 4) — threads for all CUDA targets except
flash attention
- `--flash_nvcc_threads N` (default: same as `--nvcc_threads`) — threads
specifically for flash attention compilation

CMake cache variables: `onnxruntime_NVCC_THREADS`,
`onnxruntime_FLASH_NVCC_THREADS`

## Testing

- Built locally with
`CMAKE_CUDA_ARCHITECTURES="75;80;86;89;90;100;120"`, `--nvcc_threads 4
--flash_nvcc_threads 2`
- Verified flash attention .cu files compile only for SM80+ (checked
`build.ninja` / VS project)
- Verified LLM .cu files compile for SM75+
- Ran `onnxruntime_provider_test` — all CUDA EP tests pass
- Ran `python test_qmoe_cuda.py` (MoE kernels), flash attention / GQA
tests
- No link errors in both in-tree provider and plugin EP builds
- No nvcc warnings about duplicate `--threads` flags
- Plugin CI compile options verified: `--std c++20`, `-Xcudafe
--diag_suppress=550,2810`, MSVC `/bigobj` all applied to OBJECT
libraries
### Summary

Lower ONNX `Sin` and `Cos` to the CoreML ML Program `sin` / `cos`
elementwise ops
via the existing `UnaryOpBuilder`, registered in the op builder factory.
Like
`Erf` / `Round` / `Exp`, these have no NeuralNetwork lowering
(`UnaryFunctionLayerParams` has no sin/cos), so `IsOpSupportedImpl`
rejects them on
the NeuralNetwork format.

### Why

`Sin` / `Cos` form the sinusoidal timestep embedding of diffusion UNets.
Supporting
them keeps that prologue on CoreML instead of splitting the graph — a
tiny
Stable-Diffusion UNet goes from **2 CoreML partitions → 1, zero graph
breaks** with
this change alone.

This PR is **independent** of the rest of the series (it touches only
the unary
builder) and can be reviewed/merged in any order.

### Tests (`coreml_basic_test.cc`)

- `SinCos_MLProgram` — a Sin + Cos graph runs fully on CoreML and
matches the CPU
  reference.
- `SinCosNeuralNetworkNotSupported` — the same graph falls back to CPU
on the
  NeuralNetwork format.

Doc: `coreml_supported_mlprogram_ops.md` lists `Sin` and `Cos`.

### Series — CoreML EP coverage for transformer / diffusion graphs

- microsoft#28595 — Support bool Cast in ML Program *(prerequisite)*
- **microsoft#28596 — Add Sin and Cos unary ops** *(this PR — independent)*
- microsoft#28597 — Add Where and And builders *(depends on microsoft#28595)*
- microsoft#28598 — Add GatherND builder *(depends on microsoft#28595)*

Together with microsoft#28278 (scalar-`Gather`), the series takes BERT / GPT-2 /
ViT /
diffusion-UNet graphs — tiny and full-size — from 2 CoreML partitions to
1, with
zero graph breaks.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
### Description
TRT 11.0 removes many deprecated APIs, so guard TRT-EP code accordingly
to support TRT 11 builds.


### Motivation and Context
Fixes compilation with TRT 11.0

Signed-off-by: Kevin Chen <kevinch@nvidia.com>
…p-webgpu/MIN_ONNXRUNTIME_VERSION` (microsoft#28687)

### Description
<!-- Describe your changes. -->

Bake the contents of `plugin-ep-webgpu/MIN_ONNXRUNTIME_VERSION` into the
plugin EP library as the `ORT_PLUGIN_EP_MIN_ORT_VERSION` preprocessor
definition and pass it to `ApiInit()` so the EP refuses to load against
an older ORT runtime.

Rework `ApiInit()` to strictly parse the runtime version string as
"MAJOR.MINOR.PATCH", optionally enforce a caller-supplied minimum,
require MAJOR == 1, and use MINOR as the API version. All failure modes
now throw `std::runtime_error` with a descriptive message.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Enforce minimum ORT version for WebGPU plugin EP as specified in the
minimum ORT version file. Previously, the version was hardcoded in
`ApiInit()`.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…icrosoft#28682)

## Description

Use fp32 accumulation in SkipLayerNormalization,
SkipSimplifiedLayerNormalization, and EmbedLayerNormalization CUDA
kernels to avoid overflow and improve numerical accuracy when processing
fp16/bf16 data.

The original implementation accumulated mean and variance statistics in
the input data type (fp16/bf16), which can overflow for large hidden
sizes or when input values have large magnitude. This change promotes
all intermediate accumulation (mean, variance, normalization math) to
fp32, matching the approach used by TensorRT-LLM's LayerNorm kernels.

## Motivation

- fp16 has limited range (max ~65504) and precision (10-bit mantissa).
Accumulating `x²/ld` across thousands of elements in fp16 easily
overflows or loses precision.
- bf16 has even less precision (7-bit mantissa), making accumulation
errors more severe.
- The fix is straightforward: cast to float before accumulating, compute
normalization in float, cast back to the output type.

## Key Changes

| File | Change |
|------|--------|
| `layer_norm.cuh` | Changed `LayerNorm`, `SimplifiedLayerNorm`,
`LayerNormSmall`, `SimplifiedLayerNormSmall` to accept and operate on
`float` for thread_data, epsilon, mu, rsigma. Removed unused
`KeyValuePairSum` overloads for half/bfloat16. |
| `skip_layer_norm_impl.cu` | Changed `SkipLayerNormKernel` and
`SkipLayerNormKernelSmall` to accumulate in fp32
(`cub::KeyValuePair<float, float>`). Removed `maybe2half` helper (no
longer needed). |
| `embed_layer_norm_impl.cu` | Changed epsilon from `T` to `float`,
accumulation to use `float` thread_data. |
| `profile_skip_layer_norm.py` | New profiling script for nsys-based
kernel timing analysis. |
| `profile_skip_layer_norm.sh` | Shell wrapper for running nsys
profiling. |
| `parse_nsys.py` | Utility to parse nsys SQLite output and extract CUDA
kernel timings. |

## Performance Results

Profiled on NVIDIA GPU with nsys (B=1, seq_len=2048, fp16 data, 200
iterations, skip first 5 warmup):

| Hidden Size | fp16 accum (μs) | fp32 accum (μs) | Regression |
|---|---|---|---|
| 768 | 3.81 | 3.81 | **0.0%** |
| 1024 | 4.22 | 4.22 | **0.0%** |
| 4096 | 13.01 | 13.03 | **+0.15%** (noise) |
| 8192 | 28.94 | 28.94 | **0.0%** |

**No measurable performance regression.** The kernel is
memory-bandwidth-bound, so fp32 arithmetic is completely hidden behind
memory latency.

## Testing

- Existing unit tests pass (SkipLayerNorm, EmbedLayerNorm ops).
- Profiling scripts added for reproducible performance measurement:
  ```bash
  cd onnxruntime/test/python/transformers
nsys profile -o sln_fp16 --export=sqlite python
profile_skip_layer_norm.py --mode fp16 --warmup 5 --repeat 100
  python parse_nsys.py sln_fp16.sqlite --skip-first 5
  ```
  
## Related PRs

microsoft#28442
microsoft#15660
This pull request makes a targeted update to the operator schema in the
ONNX Runtime codebase, specifically clarifying the optional nature of
certain outputs.

Schema definition improvements:

* Marked the `present_key` and `present_value` outputs as optional in
the `ONNX_MS_OPERATOR_SET_SCHEMA` macro within `bert_defs.cc`, making
the operator schema clearer and more flexible for consumers.
…rosoft#28260)

### Summary
Add per-annotation-ID buffer managers and captured command storage so
multiple generators can each capture and replay their own graph
independently without cross-contamination
Add ReleaseGraph API through the full ORT stack (EP base → C API →
InferenceSession → plugin EP) to release captured commands and GPU
buffers when a generator is destroyed
Replace the single graph_buffer_mgr_ / is_graph_captured_ bool with
per_graph_buffer_mgrs_ map and captured_graph_ids_ set keyed by
annotation ID
Use a std::function getter with cached pointer pattern in
GpuBufferAllocator to dynamically route allocations to the active
per-graph buffer manager during runs, while keeping Alloc/Free as simple
pointer dereferences
### Motivation
Edge's Prompt API speed benchmark creates multiple sessions/generators
sequentially with graph capture enabled. With the existing single-graph
design, the second generator replays the first generator's captured
commands with wrong buffers, producing incorrect output and ultimately a
QuotaExceededError in the browser. This PR isolates each generator's
graph capture state so they don't interfere with each other.

### Related PR
The GenAI side change is in
microsoft/onnxruntime-genai#2106, which calls
SessionReleaseGraph when a generator is destroyed to release the
captured graph's GPU buffers.

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: qjia7 <4221210+qjia7@users.noreply.github.com>
### Description
We have internal alerts asking to update protobuf version in response to
CVE-2026-0994.

The alert asks 

tools/ci_build/github/linux/docker/inference/aarch64/python/cpu/scripts/requirements.txt
to be updated to 5.29.6
while the rest are asked to be set to 6.33.5.
…28280)

### Description
## Summary

Adds two WebGPU-only graph fusions and the contrib ops they target, plus
a small
refactor of the existing `MatMulNBits` dispatch logic so the new fused
kernels
can share its predicates.

| Component | Files | Purpose |
|---|---|---|
| **`MatMulNBitsMlp` op + kernel** |
`contrib_ops/webgpu/quantization/matmul_nbits_mlp.{cc,h}`,
`*.wgsl.template` (3) | Fuses the SwiGLU MLP block: optional
`(Skip)SimplifiedLayerNormalization` + two `MatMulNBits` projections
(gate, up) + optional biases + `Sigmoid`/`Mul` (SiLU) + element-wise
`Mul`. Single dispatch instead of 5–7. |
| **`MatMulNBitsQkv` op + kernel** |
`contrib_ops/webgpu/quantization/matmul_nbits_qkv.{cc,h}`,
`*.wgsl.template` | Fuses `(Skip)SimplifiedLayerNormalization` + three
`MatMulNBits` projections (Q, K, V) sharing the same input. Single
dispatch instead of 4. |
| **Op schemas** | `core/graph/contrib_ops/contrib_defs.cc` |
`MatMulNBitsMlp` and `MatMulNBitsQkv` contrib op schemas (kMSDomain,
opset 1). |
| **Graph transformers** |
`core/optimizer/matmul_nbits_{mlp,qkv}_fusion.{cc,h}` | Pattern-match
the source subgraphs and emit the fused ops. EP-gated to WebGPU only —
no impact on other EPs. Registered in `graph_transformer_utils.cc`. |
| **Dispatch helpers** |
`contrib_ops/webgpu/quantization/matmul_nbits_common.{cc,h}` +
`matmul_nbits.cc` | Extracts the "would this dispatch use
Subgroup-Matrix / DP4A / WideTile?" predicates into pure functions
reusable by the fused kernels. No behavior change in the unfused
`MatMulNBits` path. |
| **Tests** | `test/optimizer/matmul_nbits_{mlp,qkv}_fusion_test.cc`,
`graph_transform_utils_test.cc` | Unit tests for the new transformers
(positive + negative cases). |


### Motivation and Context
~25-30% decode TPS throughput improvement on WebGPU + D3D backend on
Windows. GPU used: RTX 5060Ti for Qwe3-1.7B.

BEFORE (**95 decode TPS**): main branch
<img width="344" height="140" alt="image"
src="https://github.com/user-attachments/assets/0f5d7cfb-05f9-4f25-acb5-4becb8f5addd"
/>

AFTER (**120+ decode TPS**): PR branch
<img width="359" height="134" alt="image"
src="https://github.com/user-attachments/assets/f1254d8e-a400-4dbb-9d06-ab6116f929bb"
/>

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
…ft#28681)

This pull request significantly improves the numerical correctness and
robustness of L1 and L2 reduction operations (norms) for integer types
on both CPU and CUDA backends. The main changes address integer
overflow, undefined behavior, and precision loss in norm calculations,
especially for edge cases like minimum representable integers and large
accumulations. The changes also ensure consistency between CPU and CUDA
implementations, and add detailed documentation for future
maintainability.

**Numerical correctness and overflow handling for integer norm
reductions:**

* On CPU, L1 and L2 reductions (`ReduceAggregatorL1` and
`ReduceAggregatorL2` in `reduction_ops.h`) now accumulate in double
precision to avoid integer overflow and undefined behavior, with Kahan
summation for int64+ to minimize precision loss. Results are clamped to
the maximum representable value to prevent overflow.
[[1]](diffhunk://#diff-ca0c9224442a3c46251b0fb7326aacc1469bdee20ab409b930556f439d560015R722-R805)
[[2]](diffhunk://#diff-ca0c9224442a3c46251b0fb7326aacc1469bdee20ab409b930556f439d560015R814-R893)
* Introduced a `saturating_abs` function on CPU and a device-side
`Impl_SaturatingAbs` kernel on CUDA to safely compute the absolute value
for signed integer types, clamping to `max()` if `abs(min())` would
overflow.
[[1]](diffhunk://#diff-ca0c9224442a3c46251b0fb7326aacc1469bdee20ab409b930556f439d560015R722-R805)
[[2]](diffhunk://#diff-f7138acd21464814d1793c9d334bee07d0cbe69719691e67efe3b2e23e4d06c7R516-R566)
[[3]](diffhunk://#diff-945ab1deb57e1ff44b790cb2054537c252e23b7c7c374f44da66475361910abdR110-R119)

**CUDA backend improvements and consistency:**

* For integer reductions on CUDA, the input is cast to double before
reduction, and the result is cast back to integer with saturating
semantics (using PTX `cvt.sat`), matching the CPU's explicit clamping.
This avoids precision loss and undefined behavior.
* For no-op reductions (where input and output counts are equal), norm
operations now use the saturating absolute value kernel to ensure
non-negative results, even for edge-case values like `INT_MIN`.
[[1]](diffhunk://#diff-ee5316fc3898058f70e942d9a84de36be4c7da09f144633a2504236430d5d033L209-R217)
[[2]](diffhunk://#diff-ee5316fc3898058f70e942d9a84de36be4c7da09f144633a2504236430d5d033L592-R607)
[[3]](diffhunk://#diff-ee5316fc3898058f70e942d9a84de36be4c7da09f144633a2504236430d5d033L778-R794)

**Documentation and maintainability:**

* Added detailed comments explaining the rationale and numerical
properties of the new implementations, including why double precision is
used, the limitations of float, and the behavior for large reductions
and edge cases.
[[1]](diffhunk://#diff-ca0c9224442a3c46251b0fb7326aacc1469bdee20ab409b930556f439d560015R722-R805)
[[2]](diffhunk://#diff-ca0c9224442a3c46251b0fb7326aacc1469bdee20ab409b930556f439d560015R814-R893)
[[3]](diffhunk://#diff-f7138acd21464814d1793c9d334bee07d0cbe69719691e67efe3b2e23e4d06c7R516-R566)
[[4]](diffhunk://#diff-ee5316fc3898058f70e942d9a84de36be4c7da09f144633a2504236430d5d033L807-R861)

These changes make norm reductions for integer types safe,
mathematically correct, and consistent across CPU and CUDA, even for
extreme or previously problematic inputs.
…ove shader key validation to nightly build (microsoft#28674)

### Description
<!-- Describe your changes. -->

Allow shader code to be dumped to the file specified in the
`ORT_WEBGPU_EP_SHADER_DUMP_FILE` environment variable. Previously,
shader code was only dumped by verbose logging.

Create new nightly CI pipeline to run shader key validation test. That
test is removed from the CI pipeline in microsoft#28642.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

More shader dump output options. Moving shader key validation test.
…u_inc/cub.cuh" wrapper. (microsoft#28705)

### Description
<!-- Describe your changes. -->

Replace direct inclusion of `<cub/cub.cuh>` with
`"core/providers/cuda/cu_inc/cub.cuh"` wrapper. The wrapper accounts for
a problematic macro definition which causes issues.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Fix pipeline build error.
…mic quantization (microsoft#28228)

## Summary
- Fix `quantize_dynamic(per_channel=True)` so weights quantized
per-channel produce a `DequantizeLinear` node with the correct `axis`
attribute.
- Stop dropping the channel axis when `quantize_weight_per_channel`
populates `QuantizedValue` (was hardcoded to `None`).
- Gate the scalar-scale assertion in `_dequantize_value` on `axis is
None` so per-channel scales (1-D tensors) are accepted.

## Motivation
Fixes microsoft#19997.

When a model is quantized with `quantize_dynamic(..., per_channel=True)`
and a per-channel weight reaches `_dequantize_value` (e.g. via
`_dequantize_outputs` when the weight is in the graph outputs), two bugs
surface:

1. `quantize_weight_per_channel` stores `QuantizedValue.axis = None`
even though it received a real `channel_axis`, so the per-channel
information is lost.
2. `_dequantize_value` (a) asserts `scale_init.size == 1`, which fails
for a 1-D per-channel scale, and (b) builds the `DequantizeLinear` node
without an `axis` attribute, producing an invalid ONNX node when the
model is consumed.

PR microsoft#22283 (Nov 2024) softened the assertion against `None`-typed scales
but left the underlying axis-propagation bug in place.

## Changes
- `onnxruntime/python/tools/quantization/onnx_quantizer.py`
- `quantize_weight_per_channel`: pass `channel_axis` (was `None`) into
`QuantizedValue`.
- `_dequantize_value`: only require a scalar scale on the per-tensor
path (`axis is None`); forward `axis=quantized_value.axis` to
`onnx.helper.make_node("DequantizeLinear", ...)`. `make_node` silently
omits the attribute when `axis` is `None`, so the per-tensor path is
unchanged.
- `onnxruntime/test/python/quantization/test_quant_issues.py`
- New regression test
`test_dynamic_quantize_per_channel_emits_axis_attribute` that builds a
minimal MatMul model with the weight routed to a graph output (to force
the `_dequantize_outputs` -> `_dequantize_value` path), runs
`quantize_dynamic(per_channel=True)`, and asserts the emitted
`DequantizeLinear` has the `axis` attribute and a 1-D multi-element
scale initializer.

## Test Plan
- `python -m pytest
onnxruntime/test/python/quantization/test_quant_issues.py -xvs` — new
test passes; existing test skipped as before.
- `python -m pytest
onnxruntime/test/python/quantization/test_op_matmul.py` — 7 passed, 8
skipped (no regression).
- `python -m pytest onnxruntime/test/python/quantization/test_qdq.py -k
per_channel` — 1 passed.
- `lintrunner -a` on changed files: clean.
…Def (microsoft#28608)

## Summary

`utils::MakeComputeCapability` is the shared helper used by
`utils::CreateSupportedPartitions` to build an
`IndexedSubGraph::MetaDef` from a group of supported nodes. When a
supported group contains a control-flow op (`Loop`, `If`, `Scan`),
`MakeComputeCapability` currently walks only `node->InputDefs()` and
silently drops the outer-scope captures (`node->ImplicitInputDefs()`).
The captures never enter `meta_def->inputs`, so after
`Graph::FinalizeFuseSubGraph` the fused node's `InputDefs()` is missing
them — the EP that owns the fused subgraph has no boundary value-info
for the captured tensors and cannot bind them at Compute time.

This PR adds a second loop in `MakeComputeCapability` that walks
`node->ImplicitInputDefs()` with the same "produced inside the partition
→ skip, otherwise add to subgraph inputs" semantics already applied to
`InputDefs()`.

## Why this is the right fix

`onnxruntime::Node` partitions inputs into two arrays by design:

- `InputDefs()` — formal operand list as declared in the op's ONNX
schema.
- `ImplicitInputDefs()` — outer-scope SSA values referenced from inside
body subgraphs of `Loop` / `If` / `Scan`. These are real boundary inputs
at runtime (the body kernel reads them) but they don't appear in the
op's formal operand list.

`Graph::FinalizeFuseSubGraph` consumes only `meta_def->inputs` to
populate the fused node's `InputDefs()` and rewire outer-scope edges. So
whatever `MakeComputeCapability` puts in `meta_def->inputs` is what ends
up at the fused-node boundary. Omitting `ImplicitInputDefs()` here means
the captures are unreachable downstream — there is no other place that
can patch them back in.

The fix is intentionally a mirror of the existing `InputDefs()` loop
(same `Contains(node_outputs, ...)` produced-inside check, same
`ordered_subgraph_inputs.push_back` ordering). The new loop runs after
the explicit loop so explicit-operand index ordering in
`meta_def->inputs` is preserved (EPs that have implicitly relied on
`meta_def->inputs[i].name == node.InputDefs()[i].name` for
non-control-flow op groups are not perturbed).

## Scope of impact

Only EPs that consume `utils::MakeComputeCapability` /
`utils::CreateSupportedPartitions` are affected. A quick audit:

| EP | Uses `partitioning_utils::MakeComputeCapability`? | Affected by
bug? |
|---|---|---|
| Plugin EPs (`EpGraphSupportInfo_AddNodesToFuse` →
`PluginExecutionProvider::GetCapability`) | yes, in
`onnxruntime/core/session/plugin_ep/ep_plugin_provider_interfaces.cc` |
**yes** |
| `internal_testing_ep` (used by ORT's own unit tests) | yes, in
`onnxruntime/test/internal_testing_ep/internal_testing_execution_provider.cc`
| **yes** |
| TensorRT, MIGraphX, NV-TRT-RTX, VitisAI | no — they build
`MetaDef::inputs` themselves and already walk `ImplicitInputDefs()`
(e.g. `tensorrt_execution_provider.cc:2084`,
`migraphx_execution_provider.cc:735`) | no |
| DML / CPU / CUDA / ROCm / OpenVINO / QNN / CANN / WebGPU / CoreML |
don't use it for Loop/If/Scan fusion paths | no |

So the impact is bounded to the plugin EP architecture (ORT 1.23+) and
the in-tree testing EP — both of which delegate boundary calculation to
this shared helper.

## Reproduction

The bug is reproducible against this repo's `internal_testing_ep`. No
external code required.

A minimal repro model with a Loop body that captures an outer-scope
tensor `B`:

```python
# build_repro.py — produces a ~1.5 KB onnx
import numpy as np, onnx
from onnx import TensorProto, helper as h, numpy_helper as nph

A   = h.make_tensor_value_info("A", TensorProto.FLOAT, ["N", 2, 2])
B   = h.make_tensor_value_info("B", TensorProto.FLOAT, [2, 2])
out = h.make_tensor_value_info("v_final", TensorProto.FLOAT, [2, 2])

acc_init  = nph.from_array(np.zeros((2, 2), np.float32), name="acc_init")
cond_init = nph.from_array(np.array([1], np.bool_), name="cond_init")
sq_ax     = nph.from_array(np.array([0], np.int64), name="sq_ax")

body = h.make_graph(
    nodes=[
        h.make_node("Gather", ["A", "iter"], ["slice"], axis=0),
        h.make_node("Add", ["slice", "B"], ["tmp"]),     # captures outer B
        h.make_node("Add", ["acc_in", "tmp"], ["acc_out"]),
        h.make_node("Identity", ["cond_in"], ["cond_out"]),
    ],
    name="loop_body",
    inputs=[h.make_tensor_value_info("iter", TensorProto.INT64, []),
            h.make_tensor_value_info("cond_in", TensorProto.BOOL, []),
            h.make_tensor_value_info("acc_in", TensorProto.FLOAT, [2, 2])],
    outputs=[h.make_tensor_value_info("cond_out", TensorProto.BOOL, []),
             h.make_tensor_value_info("acc_out", TensorProto.FLOAT, [2, 2])],
)

g = h.make_graph(
    nodes=[
        h.make_node("Shape", ["A"], ["M_1d"], start=0, end=1),
        h.make_node("Squeeze", ["M_1d", "sq_ax"], ["M"]),
        h.make_node("Loop", ["M", "cond_init", "acc_init"], ["v_final"], body=body),
    ],
    name="loop_with_outer_capture",
    inputs=[A, B], outputs=[out],
    initializer=[acc_init, cond_init, sq_ax],
)
onnx.save(h.make_model(g, opset_imports=[h.make_opsetid("", 16)]),
          "loop_with_outer_capture.onnx")
```

Observable bug path (against any EP using `CreateSupportedPartitions`,
e.g. `InternalTestingExecutionProvider`):

```cpp
// Claim every node (Shape/Squeeze/Constant/Loop) as compiled.
SessionOptions so;
InferenceSession session(so, env);
session.RegisterExecutionProvider(
    std::make_unique<InternalTestingExecutionProvider>(/*supported=*/{...}));
session.Load("loop_with_outer_capture.onnx");
session.Initialize();

// In EP::Compile, iterate fused_node.InputDefs():
//   for (const auto* in : fused_node.InputDefs()) std::cerr << in->Name() << "\n";
// BEFORE this fix: only "A" is printed (Shape(A) makes A explicit;
// B is consumed only via Loop's ImplicitInputDefs and gets dropped).
// AFTER this fix:  both "A" and "B" are printed.
```

A small unit-test fixture exercising the same path can be added to
`onnxruntime/test/providers/partitioning_utils_test.cc` following the
existing `CheckAllNodesProcessed` pattern, asserting that
`result[0]->sub_graph->GetMetaDef()->inputs` contains `B` when the
supported group includes the Loop.

## What this PR changes

A single hunk in
`onnxruntime/core/providers/partitioning_utils.cc::MakeComputeCapability`,
immediately after the existing `for (const auto* input :
node->InputDefs()) { ... }`:

```cpp
// Region-bearing ops (Loop/If/Scan) reference outer-scope SSA values via
// ImplicitInputDefs rather than InputDefs. When an EP claims the whole
// control-flow op, those implicit captures must also be in MetaDef::inputs
// so FinalizeFuseSubGraph can rewire the outer-scope edges onto the fused
// node's InputDefs. Without this, plugin EPs that fuse Loop/If/Scan lose
// the captures at the fused-node boundary and cannot resolve them at
// Compute time.
for (const auto* input : node->ImplicitInputDefs()) {
  if (!input->Exists()) {
    continue;
  }
  if (!Contains(node_outputs, input)) {
    if (!Contains(subgraph_inputs, input)) {
      subgraph_inputs.insert(input);
      ordered_subgraph_inputs.push_back(input);
    }
  }
}
```

## Risks / migration

- **No ABI change.** `MakeComputeCapability` signature unchanged.
`IndexedSubGraph::MetaDef` schema unchanged.
- **No semantic regression for op groups without control flow.** The new
loop only adds elements; for partitions that contain no `Loop` / `If` /
`Scan`, `ImplicitInputDefs()` is empty on every node and the new loop is
a no-op.
- **Behavior change for plugin EPs that fuse Loop/If/Scan.** Their fused
node's `InputDefs()` gains the captures. EPs that were silently fishing
out captures via a workaround (e.g. walking the original Loop node's
`ImplicitInputDefs()` themselves at Compile time) would see those names
show up via the standard fused-node `InputDefs()` API. Audit above shows
no in-tree EP that uses `partitioning_utils` had such a workaround — TRT
/ MIGraphX / etc. roll their own MetaDef without calling
`MakeComputeCapability`.

## Validation

- Verified the fix end-to-end against a downstream plugin EP that claims
a `Loop` node as part of a fused partition (Loop body captures an
outer-scope tensor): without this fix, the EP cannot resolve the
captured tensor name at the fused-node boundary; with the fix the
captured tensor appears in `fused_node.InputDefs()` and session
initialization + the EP's Compile both succeed.
- No `partitioning_utils.cc` changes between `origin/main` and the patch
base, so it applies cleanly.
- Existing `onnxruntime_test_all --gtest_filter=PartitioningUtilsTest.*`
cases still pass (the fix only adds behavior for control-flow ops;
non-control-flow partitions are byte-for-byte identical to before).
This pull request strengthens security checks around loading external
tensor data in ONNX Runtime, particularly to prevent malicious models
from referencing unsafe file paths or in-memory address markers that
could lead to arbitrary file access or unsafe memory dereferencing. The
changes introduce stricter validation for external data paths and add
explicit rejections for ORT in-memory address markers found in model
protobufs, along with new and improved regression tests to verify this
behavior.

**Security hardening for external data loading:**

* Added `ValidateExternalFilePathForTensor` to enforce that external
data paths are validated for all code paths loading external data
(including those outside `Graph::Resolve`), rejecting absolute or
directory-escaping paths and passing through only trusted in-memory
markers. This is now called in `GetExtDataFromTensorProto` and
`LoadExtDataToTensorFromTensorProto` to ensure defense-in-depth.
[[1]](diffhunk://#diff-d31e9fbe0f5334fcd949833e035f2b25d5ae810dcd505c545f6b372b546b1406R1568-R1596)
[[2]](diffhunk://#diff-d31e9fbe0f5334fcd949833e035f2b25d5ae810dcd505c545f6b372b546b1406R1760-R1762)
* Updated the validation logic for sparse tensor sub-tensors with
`ValidateSparseSubTensorExternalDataPath`, clarifying the handling of
in-memory markers and ensuring only legitimate file paths are accepted.
* Changed `SparseTensorProtoToDenseTensorProto` to use the new sparse
sub-tensor validation for both values and indices.

**Model loading and graph construction protections:**

* In `Graph::Graph`, added explicit rejection of ORT in-memory address
markers in sparse tensor attributes and initializers when loading from a
protobuf, preventing attackers from crafting models that could cause
unsafe memory access during sparse-to-dense conversion or initializer
resolution.
[[1]](diffhunk://#diff-e231a92b40d89409cc8e82436be0a15bc87ef95c93b303b9feaeab6e50c8835cR1268-R1282)
[[2]](diffhunk://#diff-e231a92b40d89409cc8e82436be0a15bc87ef95c93b303b9feaeab6e50c8835cR1322-R1331)
[[3]](diffhunk://#diff-e231a92b40d89409cc8e82436be0a15bc87ef95c93b303b9feaeab6e50c8835cR1373-R1380)

**Expanded and improved testing:**

* Added new unit tests to verify that absolute and directory-escaping
external paths are rejected even when loading tensors directly (not via
graph resolution), and that in-memory address markers are not accepted
in dense or sparse initializers loaded from protobufs.
[[1]](diffhunk://#diff-d75ec5db9cc4642f78b6ff568aff6d10398fc211b0fb7c862d3ec88738e3eda6R1156-R1217)
[[2]](diffhunk://#diff-1d3978c99d95a56af0f2603bdd0b10cf02bdc1cecbd4fe5db353a8c8388696efR1365-R1484)
* Updated an optimizer initializer test to reflect the new error
handling for invalid external data paths.
microsoft#28695)

## Description

Add a flash attention-style tiled computation path to the CPU
GroupQueryAttention operator for quantized KV cache (INT8/INT4). Instead
of materializing the full `[B, N, S, T]` attention probability matrix,
this processes K/V in L2-cache-sized blocks with online softmax —
reducing peak memory from O(S×T) to O(S×Bc) per head where Bc is the KV
block size.

Additionally, implements **flash decoding** for the decode phase (S=1):
when `batch×heads < threads`, idle threads are repurposed to partition
the KV sequence across parallel workers. Each worker computes partial
softmax statistics on its KV chunk, then a lightweight reduce phase
merges the partials — achieving 2–5x decode speedup for long sequences.

### Motivation

For long-sequence LLM inference with quantized KV cache on CPU:
- **Prefill**: The full attention matrix allocation becomes a
significant memory bottleneck. With 16 heads and S=4096, the naive path
allocates ~1 GB for attention scores alone. The tiled approach reduces
peak memory by 13–24x and latency by 1.2–2.7x.
- **Decode**: When batch size is small relative to available threads,
many threads sit idle. Flash decoding partitions the KV sequence across
these idle threads, achieving 2–5x speedup for long KV lengths.

## Key Changes

| File | Change |
|------|--------|
| `onnxruntime/core/mlas/lib/flashattn_qkv.cpp` | MLAS kernel: tiled
prefill with online softmax, flash decoding (two-phase KV partitioning),
and reduce |
| `onnxruntime/core/mlas/inc/mlas_qkv_quant.h` |
`MlasFlashAttentionQuantizedKVArgs` struct with
`flash_decoding_partials` and `kv_chunk_count` fields |
| `onnxruntime/contrib_ops/cpu/bert/gqa_attention_base.h` |
`ApplyAttentionQuantizedFlash()` with L2-cache-aware block sizing, KV
concat, flash decoding setup |
| `onnxruntime/contrib_ops/cpu/bert/group_query_attention.cc` | Dispatch
logic: activates flash path when no softcap/smooth softmax/output_qk |
| `cmake/onnxruntime_mlas.cmake` | Added `flashattn_qkv.cpp` to the MLAS
build |
| `docs/contrib_ops/cpu/gqa.md` | Documentation with algorithm details,
benchmark results, and reproduction steps |
| `onnxruntime/test/mlas/bench/bench_qkv_quant.cpp` | MLAS-level C++
benchmark (`BM_GQA_Naive` vs `BM_GQA_Flash`) |
| `onnxruntime/test/python/transformers/benchmark_gqa_cpu_flash.py` |
Operator-level Python benchmark |

## Algorithm

### Prefill (S > 1): Tiled Flash Attention

Per (batch, head, q_block) tile:
1. **QK GEMM** — `MlasQKGemm` on a block slice of quantized K cache
2. **Causal + local window masking** — Set masked positions to -inf
before softmax
3. **Online softmax** — Track running max `m` and sum `l`, rescale
accumulated output with `exp(m_old - m_new)`
4. **SV accumulation** — Dequantize V block to FP32, then accumulate
weighted V into output

### Decode (S = 1): Flash Decoding

When `sequence_length == 1 && batch_size * num_heads < thread_count &&
kv_chunk_count > 1`:

**Phase 1 — Parallel KV scan**: Each idle thread processes a disjoint KV
chunk for a (batch, head) pair. For each chunk: compute QK dot products,
find local max, compute local softmax sum, and accumulate partial
weighted V output. Store per-chunk `(max_score, sum_exp,
partial_output[head_size])` into a partials buffer.

**Phase 2 — Reduce**: One thread per (batch, head) merges all chunk
partials using the log-sum-exp trick: find global max, rescale each
chunk's sum and partial output, then normalize by global sum.

This is analogous to GPU flash decoding (Dao et al.) but adapted for CPU
threading.

### Activation Conditions

Flash path activates when ALL of:
- `ORT_GQA_DISABLE_FLASH_ATTENTION` env var is not set
- `total_sequence_length > 1`
- No softcap, no smooth softmax, no output_qk (attention bias IS
supported)

Flash decoding additionally requires:
- `sequence_length == 1` (decode phase)
- `batch_size * num_heads < thread_count` (idle threads available)
- `kv_chunk_count > 1` (enough KV to partition)

## Benchmark Results

Measured on Intel Xeon Platinum 8480C, 96 CPUs, threads=8. MLAS-level
C++ benchmark.

### Latency — Prefill (S = T)

Shape: B=1, num_heads=16, kv_num_heads=8, head_size=128.

| Seq Length | Naive (ms) | Flash (ms) | Speedup | Quant |
|---:|---:|---:|---:|:---|
| 512 | 9.9 | 8.1 | 1.2x | per-tensor |
| 1024 | 44.4 | 27.0 | 1.6x | per-tensor |
| 2048 | 190.9 | 116.9 | 1.6x | per-tensor |
| 4096 | 1257.8 | 461.6 | 2.7x | per-tensor |
| 512 | 10.7 | 10.8 | 1.0x | per-channel |
| 1024 | 49.5 | 41.7 | 1.2x | per-channel |
| 2048 | 212.1 | 164.1 | 1.3x | per-channel |
| 4096 | 1223.9 | 607.8 | 2.0x | per-channel |

### Latency — Decode (S = 1, no flash decoding)

Shape: B=1, num_heads=16, kv_num_heads=8, head_size=128. Flash decoding
NOT active (batch×heads=16 > threads=8).

| Total Seqlen | Naive (us) | Flash (us) | Speedup | Quant |
|---:|---:|---:|---:|:---|
| 512 | 32 | 22 | 1.4x | per-tensor |
| 1024 | 71 | 47 | 1.5x | per-tensor |
| 2048 | 120 | 87 | 1.4x | per-tensor |
| 4096 | 210 | 174 | 1.2x | per-tensor |
| 512 | 53 | 31 | 1.7x | per-channel |
| 1024 | 86 | 52 | 1.7x | per-channel |
| 2048 | 172 | 97 | 1.8x | per-channel |
| 4096 | 299 | 191 | 1.6x | per-channel |

### Latency — Flash Decoding (S = 1, KV partitioned across threads)

Shape: B=1, num_heads=4, kv_num_heads=4 (MHA), head_size=128. Flash
decoding IS active (batch×heads=4 < threads=8).

| Total Seqlen | Naive (us) | Flash (us) | Speedup | Quant |
|---:|---:|---:|---:|:---|
| 512 | 31 | 25 | 1.2x | per-tensor |
| 1024 | 41 | 25 | 1.6x | per-tensor |
| 2048 | 67 | 34 | 2.0x | per-tensor |
| 4096 | 197 | 54 | 3.7x | per-tensor |
| 512 | 25 | 28 | 0.9x | per-channel |
| 1024 | 72 | 27 | 2.7x | per-channel |
| 2048 | 144 | 37 | 3.9x | per-channel |
| 4096 | 304 | 60 | 5.1x | per-channel |

### Peak Memory — Prefill

| Seq Length | Naive Peak | Flash Peak | Memory Reduction |
|---:|---:|---:|---:|
| 2048 (N=16) | +294 MB | +44 MB | 6.7x |
| 4096 (N=16) | +1107 MB | +82 MB | 13.5x |
| 4096 (N=32) | +2131 MB | +87 MB | 24.5x |

**Summary**: Prefill gains 1.2–2.7x latency + 7–24x memory reduction
from tiled online softmax. Decode gains 1.2–1.8x from fused dequant+dot
alone. Flash decoding adds 2–5x for long sequences when idle threads are
available to partition the KV scan.

### How to Reproduce

```bash
# Build ORT
python tools/ci_build/build.py --build_dir build/cpu --config Release \
  --parallel --build_wheel --skip_tests

# MLAS-level C++ benchmark:
cd build/cpu/Release
./onnxruntime_mlas_benchmark \
  --benchmark_filter='BM_GQA_(Naive|Flash)' \
  --benchmark_min_time=0.5s \
  --benchmark_repetitions=3 \
  --benchmark_report_aggregates_only=true
```

## Testing

- All 35 CPU `GroupQueryAttentionTest.*` tests pass (INT8/INT4,
per-tensor/per-channel, multi-batch, large head, GQA ratio variants)
- Set `ORT_GQA_DISABLE_FLASH_ATTENTION=1` to verify fallback path still
works
- End-to-end verified with `quantized_kv_cache_cpu_demo.py`
- Numerical agreement between flash and naive paths: max diff < 1e-7
…t#28710)

### Description
<!-- Describe your changes. -->

Add a new `--paths` option to `compile_contributors.py` to limit git
history queries using pathspecs.

Apply the path filter to both base and target git log collection and log
the active path filter in logs.txt.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Allow `compile_contributors.py` to be used for releases where relevant
changes are largely limited to a subset of the codebase. E.g., we can
limit the paths to WebGPU EP-related files for the WebGPU plugin EP
release.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@zhihuidu-amd zhihuidu-amd force-pushed the migraphx-fix-stream-synchronization branch from 7b6955d to 20e6217 Compare May 29, 2026 11:24
Address Copilot review feedback on initial submission:

- migraphx_stream_handle.cc: set own_stream_=false when wrapping an
  external user_compute_stream so ~MIGraphXStream does not destroy a
  stream it did not create (was: own_stream_ defaults true, causing
  hipStreamDestroy on the caller's stream at session teardown).

- migraphx_execution_provider.cc (OnRunEnd): honor the sync_stream
  parameter instead of always querying/synchronizing. ORT passes
  sync_stream=false in certain paths (e.g. during HIP graph capture);
  ignoring it caused unexpected blocking.

- migraphx_execution_provider.cc (Sync): correct misleading comment
  that claimed 'no default-stream sync' — when stream_ is nullptr the
  fallback is still hipStreamSynchronize(nullptr), a full-device sync.
  Comment now accurately documents the fallback behaviour.

- migraphx_execution_provider_info.cc (ToProviderOptions): add
  user_compute_stream to the round-trip map so the option is not
  silently dropped when provider options are queried via
  GetProviderOptions.

- migraphx_execution_provider_info.h (std::hash): include
  user_compute_stream in the hash so info structs that differ only
  in external stream are not collapsed to the same map key.
@zhihuidu-amd zhihuidu-amd force-pushed the migraphx-fix-stream-synchronization branch from 20e6217 to 60c5ecc Compare May 29, 2026 13:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.