Enable Qwen3.5 TRT-RTX EP path with CUDA graph by yen-shi · Pull Request #2139 · microsoft/onnxruntime-genai

yen-shi · 2026-05-07T19:04:11Z

This PR enables Qwen3.5 text-only INT4 QDQ export and TRT-RTX EP inference with CUDA graph/shared past-present buffers.

Structure

The branch is rebased on latest main and intentionally split into two commits:

Add Qwen3.5 text-only export support
- Mirrors the overlapping Qwen3.5 text-only builder/model-type work from PR Add text-only mode support for Qwen 3.5 model builder #2157.
- Keeps Qwen3.5 text-only ONNX input position_ids as [B, S] and expands inside the graph to [3, B, S] for mRoPE.
Enable Qwen3.5 TRT-RTX shared-buffer inference
- Shares Qwen3.5 recurrent/conv state buffers when past_present_share_buffer is enabled, preserving stable input/output addresses for TRT-RTX graph replay.
- Fixes QDQ SkipLayerNorm output_3 producer wiring.
- Keeps mixed-precision quantization logic in the shared base k_quant_linear path rather than Qwen-local code.
- Adds the canonical NvTensorRtRtx name to the example EP choices.

PR #2157 compatibility

This branch was compared against #2157 using git merge-tree. The same Qwen files are touched, but Git auto-merges them cleanly and the simulation produced no conflict markers.

If #2157 merges first, the first commit in this branch is the overlap and can be dropped/rebased away; the second commit contains the TRT-RTX-specific delta.

Validation

Rebased onto latest upstream main (bf6cf3fe).
Built CUDA Release wheel with CUDA 13.2:
python build.py --use_cuda --cuda_home="C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.2" --config Release --update --build --parallel --skip_tests --skip_examples
Installed the rebuilt wheel in the minimal TRT-RTX package environment.
Exported and ran Qwen3.5 0.8B and 9B text-only INT4 QDQ models with TRT-RTX EP + CUDA graph enabled.
- 0.8B: TTFT 1.18s, decode 64.44 tok/s, answer starts: "The history of artificial intelligence..."
- 9B: TTFT 1.47s, decode 48.00 tok/s, answer is a coherent reasoning-style response.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Enables more reliable CUDA-graph-style replay/reuse for Qwen3.5 on the TRT-RTX execution provider by stabilizing input/output buffer addresses and fixing a couple of TRT-RTX QDQ export edge cases.

Changes:

Add shared past/present recurrent-state buffer mode to keep bindings stable for TRT-RTX graph replay.
Keep Qwen2VL-style attention_mask and 3D position_ids at decode-stable shapes for graph capture/shared-buffer runs, updating contents in place.
Adjust TRT-RTX QDQ export behavior for SkipLayerNorm output naming and avoid mixed INT8 weight-only overrides on explicit QDQ paths.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
src/python/py/models/builders/qwen.py	Avoid applying mixed INT8 (weight-only/QOperator) overrides when exporting explicit QDQ.
src/python/py/models/builders/base.py	Pass redirected SkipLayerNorm output_3 name to primitives to avoid duplicate producers in QDQ export.
src/models/recurrent_state.h	Track whether past/present buffers are shared.
src/models/recurrent_state.cpp	Implement shared-buffer recurrent-state allocation/binding and disable swapping/rewind rebinding in that mode.
src/models/position_inputs.h	Add static-shape handling APIs for attention mask and 3D position IDs.
src/models/position_inputs.cpp	Implement static mask + stable decode position_ids tensors and in-place updates for TRT-RTX graph replay.
examples/python/model-qa.py	Add `NvTensorRtRtx` EP option to the example CLI.

yen-shi · 2026-05-07T21:46:39Z

@microsoft-github-policy-service agree company="NVIDIA Corporation"

anskumar01 · 2026-05-08T09:03:02Z

@kunal-vaishnavi , @baijumeswani , can you please help review.
Cc @anujj

kunal-vaishnavi · 2026-05-15T16:08:41Z

  }
 };

+bool IsTextOnlyQwen3_5(const Config& config) {


There is a similar PR for enabling a text-only Qwen-3.5 model. Let's align the two PRs so they are on the same page.

Addressed in the two-commit rewrite and latest-main rebase. The overlap with #2157 is isolated in the first commit, while the second commit contains only the TRT-RTX-specific Qwen3.5 delta. I also checked this branch against #2157 with git merge-tree; the touched files auto-merge with no conflict markers.

kunal-vaishnavi · 2026-05-15T16:11:56Z


-  // Some multimodal decoders (e.g., Gemma4) require input_ids alongside inputs_embeds
-  if (model_.session_info_.HasInput(model_.config_->model.decoder.inputs.input_ids)) {
+  // Some multimodal decoders (e.g., Gemma4) require input_ids alongside inputs_embeds.


There was a PR fix for a regression introduced by these changes. Let's align the two PRs so we can agree on the same fix.

Addressed in the two-commit rewrite and latest-main rebase. The overlap with #2157 is isolated in the first commit, while the second commit contains only the TRT-RTX-specific Qwen3.5 delta. I also checked this branch against #2157 with git merge-tree; the touched files auto-merge with no conflict markers.

kunal-vaishnavi · 2026-05-15T16:40:57Z

  }
+  ONNXTensorElementDataType posid_type = mask_type;
+  if (has_posid_input_) {
+    posid_type = model_.session_info_.GetInputDataType(model_.config_->model.decoder.inputs.position_ids);


Why are these changes for the position ids type needed?

Addressed by the rewrite. The previous Qwen-specific runtime changes in position_inputs.cpp were removed, so the current branch no longer touches this file; Qwen3.5 text-only position handling is kept in the builder/export path instead of shared runtime position-input code.

kunal-vaishnavi · 2026-05-15T17:10:22Z

+        #
+        # Linear attention recurrence accumulates errors across the full sequence,
+        # unlike softmax attention which normalizes per-step.
+        int8_nodes = {}


There is an algorithm called k_quant_linear that was added for mixed-precision quantization. This logic was moved in this PR to the base class for re-usability. Can you sync with the main branch and use that instead?

Addressed in the rewritten branch. I removed the Qwen-local mixed-precision quantization block and now rely on the shared base k_quant_linear path from latest main; the remaining Qwen builder delta is limited to text-only/TRT-RTX export plumbing.

kunal-vaishnavi · 2026-05-15T17:10:59Z

    position_ids_shape_[2] = 0;  // Will be set during first update

    position_ids_ = std::make_unique<Tensor>(model_.p_device_inputs_, posid_type);
+    position_ids_next_ = std::make_unique<Tensor>(model_.p_device_inputs_, posid_type);


We shouldn't have both position_ids_ and position_ids_next_ for representing the position ids during different steps of the decoding loop. We should find a way to re-use position_ids_.

Addressed by the rewrite. The previous Qwen-specific runtime changes in position_inputs.cpp were removed, so the current branch no longer touches this file; Qwen3.5 text-only position handling is kept in the builder/export path instead of shared runtime position-input code.

kunal-vaishnavi · 2026-05-15T17:16:09Z

 }

+template <typename T>
+void Qwen2VLPositionInputs::InitializeStaticMask(OrtValue& cpu_attention_mask) {


Can we move all of the logic for Qwen-specific position inputs inside the Qwen-specific files? This file should ideally just have logic that is shared across all types of models.

Addressed by the rewrite. The previous Qwen-specific runtime changes in position_inputs.cpp were removed, so the current branch no longer touches this file; Qwen3.5 text-only position handling is kept in the builder/export path instead of shared runtime position-input code.

kunal-vaishnavi · 2026-05-15T17:24:34Z

+  // past/present buffers rather than the generic CUDA EP path.
+  return state_.params_->use_graph_capture ||
+         (state_.params_->IsPastPresentShareBufferEnabled(model_.config_->model.type) &&
+          model_.p_device_->GetType() == DeviceType::NvTensorRtRtx);


Why do we need to specify the device is TRT-RTX? Multiple EPs support graph capture and will require static inputs.

Addressed by the rewrite. The previous Qwen-specific runtime changes in position_inputs.cpp were removed, so the current branch no longer touches this file; Qwen3.5 text-only position handling is kept in the builder/export path instead of shared runtime position-input code.

apsonawane · 2026-05-15T17:54:49Z

+        }
+        typed_span.CopyCpuToDevice();
+      };
+      if (type_ == Ort::TypeToTensorType<int32_t>) {


Can we use DispatchOnType here, to keep it consistent with other code

Addressed by the rewrite. The previous Qwen-specific runtime changes in position_inputs.cpp were removed, so the current branch no longer touches this file; Qwen3.5 text-only position handling is kept in the builder/export path instead of shared runtime position-input code.

yen-shi · 2026-05-19T05:39:26Z

Thanks for the reviews. I rewrote/rebased the branch and addressed the outstanding threads in a smaller two-commit form:

The first commit is the Qwen3.5 text-only export overlap with Add text-only mode support for Qwen 3.5 model builder #2157.
The second commit is only the TRT-RTX-specific delta needed for Qwen3.5 shared-buffer/CUDA-graph inference.
The previous Qwen-specific runtime edits in position_inputs.cpp were removed; the current branch no longer touches that file.
The previous Qwen-local mixed-precision quantization block was removed; the branch now uses the shared base k_quant_linear path from latest main.
I checked compatibility with Add text-only mode support for Qwen 3.5 model builder #2157 via git merge-tree; the same Qwen files are touched, but Git auto-merges them with no conflict markers.
Rebuilt and verified Qwen3.5 0.8B and 9B export + TRT-RTX inference with the rebuilt wheel.

yen-shi · 2026-05-19T08:09:40Z

Hi @apsonawane and @kunal-vaishnavi,
I have updated my changes based on PR #2100 , PR #2148, and PR #2157. Trim down changes to only 2 commits, one is changes borrowed from #2157, and the other commit enables the shared past/present buffers to TRT-RTX EP to do optimized inference with CUDA graph enabled.

Can you review again please? Thanks!

kunal-vaishnavi · 2026-05-20T00:19:37Z

            skip_input = inputs[1] if skip else None

+        # Cast insertion can redirect SkipLayerNorm's fourth output to a casted
+        # value. Pass that redirected name to the primitive so QDQ export does


Can you explain this further? Why is this an issue exactly? QDQ should impact the MatMul ops used in the model and not Cast ops. Some screenshots of the invalid ONNX model and valid ONNX model in Netron would be helpful to visualize this comparison.

Copilot AI review requested due to automatic review settings May 7, 2026 19:04

yen-shi requested a review from a team as a code owner May 7, 2026 19:04

Copilot AI reviewed May 7, 2026

View reviewed changes

Comment thread src/models/position_inputs.cpp Outdated

Comment thread src/models/position_inputs.cpp Outdated

Comment thread src/models/position_inputs.cpp Outdated

Comment thread src/models/position_inputs.cpp Outdated

Comment thread src/models/position_inputs.cpp Outdated

Copilot started reviewing on behalf of yen-shi May 7, 2026 19:55 View session

yen-shi force-pushed the yenshiw/qwen3.5-trtrtx branch from 1a7742c to a837888 Compare May 8, 2026 08:59

kunal-vaishnavi reviewed May 15, 2026

View reviewed changes

apsonawane reviewed May 15, 2026

View reviewed changes

yen-shi force-pushed the yenshiw/qwen3.5-trtrtx branch from 6b44e87 to 5630dd6 Compare May 15, 2026 23:11

kunal-vaishnavi added the 0.14.0 label May 18, 2026

Add Qwen3.5 text-only export support

e61bf57

yen-shi force-pushed the yenshiw/qwen3.5-trtrtx branch from bc30bb1 to 1d4fae7 Compare May 19, 2026 05:34

Enable Qwen3.5 TRT-RTX shared-buffer inference

7a152a1

yen-shi force-pushed the yenshiw/qwen3.5-trtrtx branch from 1d4fae7 to 7a152a1 Compare May 19, 2026 07:01

kunal-vaishnavi reviewed May 20, 2026

View reviewed changes

Conversation

yen-shi commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Structure

PR #2157 compatibility

Validation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yen-shi commented May 7, 2026

Uh oh!

anskumar01 commented May 8, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yen-shi commented May 19, 2026

Uh oh!

yen-shi commented May 19, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yen-shi commented May 7, 2026 •

edited

Loading