Add text-only mode support for Qwen 3.5 model builder#2157
Conversation
There was a problem hiding this comment.
Pull request overview
Adds support for building and running Qwen 3.5 as a standalone text-only LLM (without multimodal embedding/vision pipeline), including runtime-side fixes to avoid incorrectly injecting input_ids into decoders that only accept inputs_embeds.
Changes:
- Add “text-only mode” to the Qwen 3.5 builder, including 2D
position_idssupport with internal expansion for mRoPE and a tokenizer-regex post-export patch. - Fix multimodal runtime logic to check decoder inputs against decoder-only session metadata (avoids false positives from the embedding session).
- Register
qwen3_5_textas an LLM model type in C++.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| src/python/py/models/builders/qwen.py | Adds text-only build path for Qwen 3.5, adjusts position_ids handling, and patches exported tokenizer regex. |
| src/models/multi_modal.cpp | Refines decoder input detection to avoid injecting input_ids when the decoder session doesn’t accept it. |
| src/models/model_type.h | Adds qwen3_5_text to the LLM model-type allowlist. |
| # Position IDs input. | ||
| # In text-only mode the runtime provides standard 2D [B, S] position_ids. | ||
| # We expand them to 3D [3, B, S] inside the graph so mRoPE works unchanged. | ||
| # In VL mode the pipeline provides 3D position_ids directly. |
There was a problem hiding this comment.
Why do the position ids have to differ in text-only mode vs. vision-language mode? It should be an identical decoder besides the input_ids vs inputs_embeds input.
There was a problem hiding this comment.
The difference exists because of who provides them, not the decoder itself.
- VL mode: The multimodal pipeline computes 3D [3, B, S] position IDs externally — image patches need distinct spatial (H/W) positions vs text tokens (temporal only).
- Text-only mode: The onnxruntime-genai runtime only generates standard 2D [B, S] position IDs. For pure text, all 3 mRoPE dimensions are identical (sequential), so we Tile [B, S] → [3, B, S] in the graph.
To eliminate the Tile, we'd need to:
- Add
qwen3_5_texttoIsQwenVLFamily()inmodel_type.h - Guard vision token config reads in
Qwen2VLPositionInputsfor the no-vision case in position_inputs.cpp - Remove the
is_text_onlybranching in the Python builder
There is no perf regression, adding it in runtime will add code complexity I would say. Also, regarding above changes keeping text model out of VL model type makes sense to me.
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Description
Adds support for running Qwen 3.5 as a standalone text-only LLM (without vision/embedding pipelines).
Changes
exclude_embeds=false, the builder creates a 2D[B, S]position_ids graph input and internally expands it to 3D[3, B, S]for mRoPE compatibility. This allows the standard onnxruntime-genai runtime to provide position_ids without requiring the multimodal pipeline.save_processingoverride that patches unsupported\p{M}(Unicode Mark category) from tokenizer regex patterns after export. The C++std::regexengine in onnxruntime-extensions does not support this Unicode property class._pos_ids_3dattribute soself.input_names["position_ids"]remains as"position_ids"— ensuring genai_config.json references the actual graph input.Usage