Skip to content

Enable Qwen3.5 TRT-RTX EP path with CUDA graph#2139

Open
yen-shi wants to merge 2 commits into
microsoft:mainfrom
yen-shi:yenshiw/qwen3.5-trtrtx
Open

Enable Qwen3.5 TRT-RTX EP path with CUDA graph#2139
yen-shi wants to merge 2 commits into
microsoft:mainfrom
yen-shi:yenshiw/qwen3.5-trtrtx

Conversation

@yen-shi
Copy link
Copy Markdown

@yen-shi yen-shi commented May 7, 2026

This PR enables Qwen3.5 text-only INT4 QDQ export and TRT-RTX EP inference with CUDA graph/shared past-present buffers.

Structure

The branch is rebased on latest main and intentionally split into two commits:

  1. Add Qwen3.5 text-only export support
  2. Enable Qwen3.5 TRT-RTX shared-buffer inference
    • Shares Qwen3.5 recurrent/conv state buffers when past_present_share_buffer is enabled, preserving stable input/output addresses for TRT-RTX graph replay.
    • Fixes QDQ SkipLayerNorm output_3 producer wiring.
    • Keeps mixed-precision quantization logic in the shared base k_quant_linear path rather than Qwen-local code.
    • Adds the canonical NvTensorRtRtx name to the example EP choices.

PR #2157 compatibility

This branch was compared against #2157 using git merge-tree. The same Qwen files are touched, but Git auto-merges them cleanly and the simulation produced no conflict markers.

If #2157 merges first, the first commit in this branch is the overlap and can be dropped/rebased away; the second commit contains the TRT-RTX-specific delta.

Validation

  • Rebased onto latest upstream main (bf6cf3fe).
  • Built CUDA Release wheel with CUDA 13.2:
    python build.py --use_cuda --cuda_home="C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.2" --config Release --update --build --parallel --skip_tests --skip_examples
  • Installed the rebuilt wheel in the minimal TRT-RTX package environment.
  • Exported and ran Qwen3.5 0.8B and 9B text-only INT4 QDQ models with TRT-RTX EP + CUDA graph enabled.
    • 0.8B: TTFT 1.18s, decode 64.44 tok/s, answer starts: "The history of artificial intelligence..."
    • 9B: TTFT 1.47s, decode 48.00 tok/s, answer is a coherent reasoning-style response.

Copilot AI review requested due to automatic review settings May 7, 2026 19:04
@yen-shi yen-shi requested a review from a team as a code owner May 7, 2026 19:04
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Enables more reliable CUDA-graph-style replay/reuse for Qwen3.5 on the TRT-RTX execution provider by stabilizing input/output buffer addresses and fixing a couple of TRT-RTX QDQ export edge cases.

Changes:

  • Add shared past/present recurrent-state buffer mode to keep bindings stable for TRT-RTX graph replay.
  • Keep Qwen2VL-style attention_mask and 3D position_ids at decode-stable shapes for graph capture/shared-buffer runs, updating contents in place.
  • Adjust TRT-RTX QDQ export behavior for SkipLayerNorm output naming and avoid mixed INT8 weight-only overrides on explicit QDQ paths.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/python/py/models/builders/qwen.py Avoid applying mixed INT8 (weight-only/QOperator) overrides when exporting explicit QDQ.
src/python/py/models/builders/base.py Pass redirected SkipLayerNorm output_3 name to primitives to avoid duplicate producers in QDQ export.
src/models/recurrent_state.h Track whether past/present buffers are shared.
src/models/recurrent_state.cpp Implement shared-buffer recurrent-state allocation/binding and disable swapping/rewind rebinding in that mode.
src/models/position_inputs.h Add static-shape handling APIs for attention mask and 3D position IDs.
src/models/position_inputs.cpp Implement static mask + stable decode position_ids tensors and in-place updates for TRT-RTX graph replay.
examples/python/model-qa.py Add NvTensorRtRtx EP option to the example CLI.

Comment thread src/models/position_inputs.cpp Outdated
Comment thread src/models/position_inputs.cpp Outdated
Comment thread src/models/position_inputs.cpp Outdated
Comment thread src/models/position_inputs.cpp Outdated
Comment thread src/models/position_inputs.cpp Outdated
@yen-shi
Copy link
Copy Markdown
Author

yen-shi commented May 7, 2026

@microsoft-github-policy-service agree company="NVIDIA Corporation"

@yen-shi yen-shi force-pushed the yenshiw/qwen3.5-trtrtx branch from 1a7742c to a837888 Compare May 8, 2026 08:59
@anskumar01
Copy link
Copy Markdown

@kunal-vaishnavi , @baijumeswani , can you please help review.
Cc @anujj

Comment thread src/models/model.cpp Outdated
}
};

bool IsTextOnlyQwen3_5(const Config& config) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a similar PR for enabling a text-only Qwen-3.5 model. Let's align the two PRs so they are on the same page.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the two-commit rewrite and latest-main rebase. The overlap with #2157 is isolated in the first commit, while the second commit contains only the TRT-RTX-specific Qwen3.5 delta. I also checked this branch against #2157 with git merge-tree; the touched files auto-merge with no conflict markers.


// Some multimodal decoders (e.g., Gemma4) require input_ids alongside inputs_embeds
if (model_.session_info_.HasInput(model_.config_->model.decoder.inputs.input_ids)) {
// Some multimodal decoders (e.g., Gemma4) require input_ids alongside inputs_embeds.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a PR fix for a regression introduced by these changes. Let's align the two PRs so we can agree on the same fix.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the two-commit rewrite and latest-main rebase. The overlap with #2157 is isolated in the first commit, while the second commit contains only the TRT-RTX-specific Qwen3.5 delta. I also checked this branch against #2157 with git merge-tree; the touched files auto-merge with no conflict markers.

Comment thread src/models/position_inputs.cpp Outdated
}
ONNXTensorElementDataType posid_type = mask_type;
if (has_posid_input_) {
posid_type = model_.session_info_.GetInputDataType(model_.config_->model.decoder.inputs.position_ids);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these changes for the position ids type needed?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed by the rewrite. The previous Qwen-specific runtime changes in position_inputs.cpp were removed, so the current branch no longer touches this file; Qwen3.5 text-only position handling is kept in the builder/export path instead of shared runtime position-input code.

Comment thread src/python/py/models/builders/qwen.py Outdated
#
# Linear attention recurrence accumulates errors across the full sequence,
# unlike softmax attention which normalizes per-step.
int8_nodes = {}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an algorithm called k_quant_linear that was added for mixed-precision quantization. This logic was moved in this PR to the base class for re-usability. Can you sync with the main branch and use that instead?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the rewritten branch. I removed the Qwen-local mixed-precision quantization block and now rely on the shared base k_quant_linear path from latest main; the remaining Qwen builder delta is limited to text-only/TRT-RTX export plumbing.

Comment thread src/models/position_inputs.cpp Outdated
position_ids_shape_[2] = 0; // Will be set during first update

position_ids_ = std::make_unique<Tensor>(model_.p_device_inputs_, posid_type);
position_ids_next_ = std::make_unique<Tensor>(model_.p_device_inputs_, posid_type);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't have both position_ids_ and position_ids_next_ for representing the position ids during different steps of the decoding loop. We should find a way to re-use position_ids_.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed by the rewrite. The previous Qwen-specific runtime changes in position_inputs.cpp were removed, so the current branch no longer touches this file; Qwen3.5 text-only position handling is kept in the builder/export path instead of shared runtime position-input code.

Comment thread src/models/position_inputs.cpp Outdated
}

template <typename T>
void Qwen2VLPositionInputs::InitializeStaticMask(OrtValue& cpu_attention_mask) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move all of the logic for Qwen-specific position inputs inside the Qwen-specific files? This file should ideally just have logic that is shared across all types of models.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed by the rewrite. The previous Qwen-specific runtime changes in position_inputs.cpp were removed, so the current branch no longer touches this file; Qwen3.5 text-only position handling is kept in the builder/export path instead of shared runtime position-input code.

Comment thread src/models/position_inputs.cpp Outdated
// past/present buffers rather than the generic CUDA EP path.
return state_.params_->use_graph_capture ||
(state_.params_->IsPastPresentShareBufferEnabled(model_.config_->model.type) &&
model_.p_device_->GetType() == DeviceType::NvTensorRtRtx);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to specify the device is TRT-RTX? Multiple EPs support graph capture and will require static inputs.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed by the rewrite. The previous Qwen-specific runtime changes in position_inputs.cpp were removed, so the current branch no longer touches this file; Qwen3.5 text-only position handling is kept in the builder/export path instead of shared runtime position-input code.

Comment thread src/models/position_inputs.cpp Outdated
}
typed_span.CopyCpuToDevice();
};
if (type_ == Ort::TypeToTensorType<int32_t>) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use DispatchOnType here, to keep it consistent with other code

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed by the rewrite. The previous Qwen-specific runtime changes in position_inputs.cpp were removed, so the current branch no longer touches this file; Qwen3.5 text-only position handling is kept in the builder/export path instead of shared runtime position-input code.

@yen-shi yen-shi force-pushed the yenshiw/qwen3.5-trtrtx branch from 6b44e87 to 5630dd6 Compare May 15, 2026 23:11
@yen-shi yen-shi force-pushed the yenshiw/qwen3.5-trtrtx branch from bc30bb1 to 1d4fae7 Compare May 19, 2026 05:34
@yen-shi
Copy link
Copy Markdown
Author

yen-shi commented May 19, 2026

Thanks for the reviews. I rewrote/rebased the branch and addressed the outstanding threads in a smaller two-commit form:

  • The first commit is the Qwen3.5 text-only export overlap with Add text-only mode support for Qwen 3.5 model builder #2157.
  • The second commit is only the TRT-RTX-specific delta needed for Qwen3.5 shared-buffer/CUDA-graph inference.
  • The previous Qwen-specific runtime edits in position_inputs.cpp were removed; the current branch no longer touches that file.
  • The previous Qwen-local mixed-precision quantization block was removed; the branch now uses the shared base k_quant_linear path from latest main.
  • I checked compatibility with Add text-only mode support for Qwen 3.5 model builder #2157 via git merge-tree; the same Qwen files are touched, but Git auto-merges them with no conflict markers.
  • Rebuilt and verified Qwen3.5 0.8B and 9B export + TRT-RTX inference with the rebuilt wheel.

@yen-shi yen-shi force-pushed the yenshiw/qwen3.5-trtrtx branch from 1d4fae7 to 7a152a1 Compare May 19, 2026 07:01
@yen-shi
Copy link
Copy Markdown
Author

yen-shi commented May 19, 2026

Hi @apsonawane and @kunal-vaishnavi,
I have updated my changes based on PR #2100 , PR #2148, and PR #2157. Trim down changes to only 2 commits, one is changes borrowed from #2157, and the other commit enables the shared past/present buffers to TRT-RTX EP to do optimized inference with CUDA graph enabled.

Can you review again please? Thanks!

skip_input = inputs[1] if skip else None

# Cast insertion can redirect SkipLayerNorm's fourth output to a casted
# value. Pass that redirected name to the primitive so QDQ export does
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this further? Why is this an issue exactly? QDQ should impact the MatMul ops used in the model and not Cast ops. Some screenshots of the invalid ONNX model and valid ONNX model in Netron would be helpful to visualize this comparison.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants