Enable Qwen3.5 TRT-RTX EP path with CUDA graph#2139
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Enables more reliable CUDA-graph-style replay/reuse for Qwen3.5 on the TRT-RTX execution provider by stabilizing input/output buffer addresses and fixing a couple of TRT-RTX QDQ export edge cases.
Changes:
- Add shared past/present recurrent-state buffer mode to keep bindings stable for TRT-RTX graph replay.
- Keep Qwen2VL-style
attention_maskand 3Dposition_idsat decode-stable shapes for graph capture/shared-buffer runs, updating contents in place. - Adjust TRT-RTX QDQ export behavior for SkipLayerNorm output naming and avoid mixed INT8 weight-only overrides on explicit QDQ paths.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| src/python/py/models/builders/qwen.py | Avoid applying mixed INT8 (weight-only/QOperator) overrides when exporting explicit QDQ. |
| src/python/py/models/builders/base.py | Pass redirected SkipLayerNorm output_3 name to primitives to avoid duplicate producers in QDQ export. |
| src/models/recurrent_state.h | Track whether past/present buffers are shared. |
| src/models/recurrent_state.cpp | Implement shared-buffer recurrent-state allocation/binding and disable swapping/rewind rebinding in that mode. |
| src/models/position_inputs.h | Add static-shape handling APIs for attention mask and 3D position IDs. |
| src/models/position_inputs.cpp | Implement static mask + stable decode position_ids tensors and in-place updates for TRT-RTX graph replay. |
| examples/python/model-qa.py | Add NvTensorRtRtx EP option to the example CLI. |
|
@microsoft-github-policy-service agree company="NVIDIA Corporation" |
1a7742c to
a837888
Compare
|
@kunal-vaishnavi , @baijumeswani , can you please help review. |
| } | ||
| }; | ||
|
|
||
| bool IsTextOnlyQwen3_5(const Config& config) { |
There was a problem hiding this comment.
There is a similar PR for enabling a text-only Qwen-3.5 model. Let's align the two PRs so they are on the same page.
There was a problem hiding this comment.
Addressed in the two-commit rewrite and latest-main rebase. The overlap with #2157 is isolated in the first commit, while the second commit contains only the TRT-RTX-specific Qwen3.5 delta. I also checked this branch against #2157 with git merge-tree; the touched files auto-merge with no conflict markers.
|
|
||
| // Some multimodal decoders (e.g., Gemma4) require input_ids alongside inputs_embeds | ||
| if (model_.session_info_.HasInput(model_.config_->model.decoder.inputs.input_ids)) { | ||
| // Some multimodal decoders (e.g., Gemma4) require input_ids alongside inputs_embeds. |
There was a problem hiding this comment.
There was a PR fix for a regression introduced by these changes. Let's align the two PRs so we can agree on the same fix.
There was a problem hiding this comment.
Addressed in the two-commit rewrite and latest-main rebase. The overlap with #2157 is isolated in the first commit, while the second commit contains only the TRT-RTX-specific Qwen3.5 delta. I also checked this branch against #2157 with git merge-tree; the touched files auto-merge with no conflict markers.
| } | ||
| ONNXTensorElementDataType posid_type = mask_type; | ||
| if (has_posid_input_) { | ||
| posid_type = model_.session_info_.GetInputDataType(model_.config_->model.decoder.inputs.position_ids); |
There was a problem hiding this comment.
Why are these changes for the position ids type needed?
There was a problem hiding this comment.
Addressed by the rewrite. The previous Qwen-specific runtime changes in position_inputs.cpp were removed, so the current branch no longer touches this file; Qwen3.5 text-only position handling is kept in the builder/export path instead of shared runtime position-input code.
| # | ||
| # Linear attention recurrence accumulates errors across the full sequence, | ||
| # unlike softmax attention which normalizes per-step. | ||
| int8_nodes = {} |
There was a problem hiding this comment.
There is an algorithm called k_quant_linear that was added for mixed-precision quantization. This logic was moved in this PR to the base class for re-usability. Can you sync with the main branch and use that instead?
There was a problem hiding this comment.
Addressed in the rewritten branch. I removed the Qwen-local mixed-precision quantization block and now rely on the shared base k_quant_linear path from latest main; the remaining Qwen builder delta is limited to text-only/TRT-RTX export plumbing.
| position_ids_shape_[2] = 0; // Will be set during first update | ||
|
|
||
| position_ids_ = std::make_unique<Tensor>(model_.p_device_inputs_, posid_type); | ||
| position_ids_next_ = std::make_unique<Tensor>(model_.p_device_inputs_, posid_type); |
There was a problem hiding this comment.
We shouldn't have both position_ids_ and position_ids_next_ for representing the position ids during different steps of the decoding loop. We should find a way to re-use position_ids_.
There was a problem hiding this comment.
Addressed by the rewrite. The previous Qwen-specific runtime changes in position_inputs.cpp were removed, so the current branch no longer touches this file; Qwen3.5 text-only position handling is kept in the builder/export path instead of shared runtime position-input code.
| } | ||
|
|
||
| template <typename T> | ||
| void Qwen2VLPositionInputs::InitializeStaticMask(OrtValue& cpu_attention_mask) { |
There was a problem hiding this comment.
Can we move all of the logic for Qwen-specific position inputs inside the Qwen-specific files? This file should ideally just have logic that is shared across all types of models.
There was a problem hiding this comment.
Addressed by the rewrite. The previous Qwen-specific runtime changes in position_inputs.cpp were removed, so the current branch no longer touches this file; Qwen3.5 text-only position handling is kept in the builder/export path instead of shared runtime position-input code.
| // past/present buffers rather than the generic CUDA EP path. | ||
| return state_.params_->use_graph_capture || | ||
| (state_.params_->IsPastPresentShareBufferEnabled(model_.config_->model.type) && | ||
| model_.p_device_->GetType() == DeviceType::NvTensorRtRtx); |
There was a problem hiding this comment.
Why do we need to specify the device is TRT-RTX? Multiple EPs support graph capture and will require static inputs.
There was a problem hiding this comment.
Addressed by the rewrite. The previous Qwen-specific runtime changes in position_inputs.cpp were removed, so the current branch no longer touches this file; Qwen3.5 text-only position handling is kept in the builder/export path instead of shared runtime position-input code.
| } | ||
| typed_span.CopyCpuToDevice(); | ||
| }; | ||
| if (type_ == Ort::TypeToTensorType<int32_t>) { |
There was a problem hiding this comment.
Can we use DispatchOnType here, to keep it consistent with other code
There was a problem hiding this comment.
Addressed by the rewrite. The previous Qwen-specific runtime changes in position_inputs.cpp were removed, so the current branch no longer touches this file; Qwen3.5 text-only position handling is kept in the builder/export path instead of shared runtime position-input code.
6b44e87 to
5630dd6
Compare
bc30bb1 to
1d4fae7
Compare
|
Thanks for the reviews. I rewrote/rebased the branch and addressed the outstanding threads in a smaller two-commit form:
|
1d4fae7 to
7a152a1
Compare
|
Hi @apsonawane and @kunal-vaishnavi, Can you review again please? Thanks! |
| skip_input = inputs[1] if skip else None | ||
|
|
||
| # Cast insertion can redirect SkipLayerNorm's fourth output to a casted | ||
| # value. Pass that redirected name to the primitive so QDQ export does |
There was a problem hiding this comment.
Can you explain this further? Why is this an issue exactly? QDQ should impact the MatMul ops used in the model and not Cast ops. Some screenshots of the invalid ONNX model and valid ONNX model in Netron would be helpful to visualize this comparison.
This PR enables Qwen3.5 text-only INT4 QDQ export and TRT-RTX EP inference with CUDA graph/shared past-present buffers.
Structure
The branch is rebased on latest
mainand intentionally split into two commits:Add Qwen3.5 text-only export supportposition_idsas[B, S]and expands inside the graph to[3, B, S]for mRoPE.Enable Qwen3.5 TRT-RTX shared-buffer inferencepast_present_share_bufferis enabled, preserving stable input/output addresses for TRT-RTX graph replay.output_3producer wiring.k_quant_linearpath rather than Qwen-local code.NvTensorRtRtxname to the example EP choices.PR #2157 compatibility
This branch was compared against #2157 using
git merge-tree. The same Qwen files are touched, but Git auto-merges them cleanly and the simulation produced no conflict markers.If #2157 merges first, the first commit in this branch is the overlap and can be dropped/rebased away; the second commit contains the TRT-RTX-specific delta.
Validation
main(bf6cf3fe).python build.py --use_cuda --cuda_home="C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.2" --config Release --update --build --parallel --skip_tests --skip_examples