Add VideoChat-Flash (OpenGVLab) language model support#2147
Add VideoChat-Flash (OpenGVLab) language model support#2147anilmartha wants to merge 24 commits into
Conversation
Adds text decoder export support for OpenGVLab/VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B. Architecture: VideoChatFlashQwenForCausalLM is a VLM whose language backbone is standard Qwen2.5-7B (flat config, standard weight keys, 2D RoPE, rope_theta=1e6, 28L / 28h / 4kv / hidden=3584). The model does NOT use MRoPE, so the builder inherits QwenModel directly rather than Qwen25VLTextModel. Changes: - src/python/py/models/builders/qwen.py: Add VideoChatFlashQwenModel subclass of QwenModel. Sets exclude_embeds=True (text decoder receives inputs_embeds from the embedding merger) and model_type="videochat_flash_qwen". - src/python/py/models/builders/__init__.py: Export VideoChatFlashQwenModel. - src/python/py/models/builder.py: Map "VideoChatFlashQwenForCausalLM" architecture string to VideoChatFlashQwenModel with exclude_embeds=True. - src/models/model_type.h: Register "videochat_flash_qwen" in IsVLM() (size 6->7). - examples/python/videochat-flash/builder.py: New example export script. Phase 1 (this PR): text decoder only. Vision encoder (InternVideo2-1B) and embedding merger export are Phase 2 (scaffolded as TODOs). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…inference support - builder.py (example): use local OGA builder via sys.path; fix create_model args when no local input dir; move prepare_model() inside Phase-2 block (trust_remote_code=False is safe for config-only loads) - builder.py (core): add VCF config bypass using hf_hub_download + Qwen2Config to avoid av/cv2/decord/imageio imports triggered by AutoConfig; set config._name_or_path so load_weights() resolves the model correctly; make exclude_embeds opt-in (guard with 'not in extra_options') so standalone (input_ids) export also works - qwen.py: add make_genai_config() override that writes a temp Qwen2Config and patches genai_config.json type back to videochat_flash_qwen; add load_weights() override using Qwen2ForCausalLM.from_pretrained() directly Validated: text-only inference with vcf-oga-fp32-standalone/ produces correct answers (Paris, 56, ONNX Runtime description) using append_tokens() API on the exported Qwen2.5-7B backbone. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
examples/python/videochat-flash/run.py:
- Text-only inference for the exported OGA model
- Uses HF AutoTokenizer (og.Tokenizer fails for TokenizersBackend models)
- Feeds tokens via generator.append_tokens(np.int32 array)
- --batch flag runs 4 built-in QA prompts for quick validation
- --prompt / --max-length for custom single-prompt runs
- Notes in docstring on genai_config.json type patch workaround for
installed OGA binaries that predate videochat_flash_qwen support
Validated: all 4 batch prompts produce correct answers on vcf-oga-fp32-standalone.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…xt-only mode Root cause of teammate's inference failure: 1. builder.py set type=videochat_flash_qwen even in --text-only mode. OGA loads this via MultiModalLanguageModel which requires vision.onnx + embedding.onnx (not present in text-only export) → 'File doesn't exist' error on og.Model(). 2. exclude_embeds defaulted to true → decoder expected inputs_embeds, but run.py feeds token IDs via append_tokens() → 'input not found' error. Fix: - builder.py export_text_model(): pass exclude_embeds=false in text-only mode (decoder uses input_ids for standalone inference). - builder.py update_genai_config(): set type=qwen2 in text-only mode so OGA loads the model as a plain decoder (LM backbone is identical to Qwen2.5-7B). type=videochat_flash_qwen is reserved for Phase 2 full VLM pipeline. - run.py: simplify docstring now that no manual patching is needed. Validated with compiled OGA build (onnxruntime-genai conda env, Python 3.11): all 4 batch prompts produce correct answers without any manual config changes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Documents Phase 1 (text decoder, done) and Phase 2 (vision + embedding, TODO): - Architecture overview: InternVideo2-1B ViT → mm_projector → embedding merger → Qwen2.5-7B - Phase 1 summary: what was implemented, design decisions, usage instructions - Phase 2 roadmap: vision.onnx export (39-block ViT + MLP projector), embedding.onnx (embed_tokens + image-pad replacement), genai_config.json wiring, video preprocessing, and optional in-LLM HiCo compression (llm_compress_layer_list) - Key challenges: 3D spatiotemporal attention ONNX compat, ToMe token merging ops, dynamic temporal position embeddings - File map and reference links Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Olive Recipe: microsoft/olive-recipes#406 |
|
@microsoft-github-policy-service agree company="AMD" |
There was a problem hiding this comment.
Pull request overview
Adds first-class support for the OpenGVLab VideoChat-Flash Qwen-based video-language model by extending the Python model builder to export its (text) decoder without pulling video dependencies, and extending the C++ runtime to recognize/run the new multimodal model type.
Changes:
- Python: detect
VideoChatFlashQwenForCausalLMearly from rawconfig.json, route to a newVideoChatFlashQwenModel, and avoidtrust_remote_codevideo-library imports. - C++: introduce
VideoChatFlashProcessorand registervideochat_flash_qwenas a VLM model type/processor. - Config: add and parse
vision.num_visual_tokensused by the new processor for fixed visual-token padding.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/python/py/models/builders/qwen.py | Adds VideoChatFlashQwenModel subclass to export the LM backbone while bypassing remote-code video deps. |
| src/python/py/models/builders/init.py | Exports the new builder class from the builders package. |
| src/python/py/models/builder.py | Adds config “peek” detection for VideoChat-Flash and routes to the new builder with default exclude_embeds=True. |
| src/models/videochat_flash_processor.h | Declares VideoChatFlashProcessor (new multimodal processor). |
| src/models/videochat_flash_processor.cpp | Implements preprocessing/token padding + emits image_grid_thw for batching inference. |
| src/models/model.h | Updates header/license comments. |
| src/models/model.cpp | Registers videochat_flash_qwen → VideoChatFlashProcessor in the multimodal processor factory. |
| src/models/model_type.h | Adds videochat_flash_qwen to the VLM model-type list. |
| src/config.h | Adds vision.num_visual_tokens to config schema. |
| src/config.cpp | Parses vision.num_visual_tokens from genai_config.json. |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
- builder.py: introduce _has_vcf_architecture() helper that handles architectures as a list, a bare string, or absent/unknown type, then use it in both the local-dir and HF-hub code paths (Copilot comment #2) - config.h: correct misleading comment on num_visual_tokens — remove the false "0 = compute from image_grid_thw" claim; the field must be > 0 for videochat_flash_qwen (Copilot comment #3) - videochat_flash_processor.cpp: add clarifying comment to empty catch block — exception is expected when only the language decoder session is loaded (Copilot comment microsoft#4) - builders/__init__.py: move VideoChatFlashQwenModel to its correct alphabetical position in __all__ (after SmolLM3Model, before WhisperModel) per kunal-vaishnavi comment microsoft#6 Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
|
Hi @kunal-vaishnavi |
| ort_extensions::OrtxObjectPtr<OrtxTensorResult> result; | ||
| CheckResult(OrtxImagePreProcess(processor_.get(), images->images_.get(), result.ToBeAssigned())); | ||
|
|
||
| OrtxTensor* pixel_values = nullptr; |
There was a problem hiding this comment.
This appears to be a memory leak. Can you fix this? There was a recent PR that fixes this issue.
| std::to_string(pv_ndims) + " (expected 3 or 4)"); | ||
| } | ||
|
|
||
| // Transpose HWC → CHW and reshape to [1, num_frames, C, H, W] |
There was a problem hiding this comment.
Let's do this transpose as another step inside ORT Extensions before returning the tensor to ORT GenAI.
| } | ||
| } | ||
|
|
||
| auto converted_pv = ConvertPixelValues(*float_tensor, pixel_values_type_, allocator); |
There was a problem hiding this comment.
Rather than creating a new method called ConvertPixelValues, can we re-use existing APIs that already convert dtypes? Here is an example.
onnxruntime-genai/src/models/whisper_processor.cpp
Lines 33 to 39 in ca9ecd2
| grid_ptr[i * 3 + 1] = height; | ||
| grid_ptr[i * 3 + 2] = width; | ||
| } | ||
| named_tensors->emplace("image_grid_thw", |
There was a problem hiding this comment.
If there are new input names, let's add their default names in the config (e.g. Config::Defaults::ImageGridThw) and reference them here.
| Qwen25VLTextModel, | ||
| Qwen35TextModel, | ||
| QwenModel, | ||
| VideoChatFlashQwenModel, |
There was a problem hiding this comment.
Let's preserve any existing alphabetical orders when inserting new items.
| # hf_remote=False so base class helpers (make_genai_config, | ||
| # save_processing) use standard, non-remote-code paths. | ||
| extra_options = dict(extra_options) if extra_options else {} | ||
| extra_options["hf_remote"] = False |
There was a problem hiding this comment.
To be consistent with other models, let's set any default extra options flags during architecture parsing. For example:
onnxruntime-genai/src/python/py/models/builder.py
Lines 292 to 299 in ca9ecd2
| # establishes these attributes on first assignment instead of us | ||
| # overwriting inherited values. | ||
| # | ||
| # The custom remote code requires video libraries (av, cv2, decord, |
There was a problem hiding this comment.
It's fine if video libraries need to be temporarily installed. These libraries are on a per-model basis and are based on Hugging Face's requirements for loading a model. For example, Phi-4 mm requires PEFT for loading the model and we don't remove those library dependencies even though they are not technically needed for producing the ONNX model.
|
|
||
| super().__init__(config, io_dtype, onnx_dtype, ep, cache_dir, extra_options) | ||
|
|
||
| def make_genai_config(self, model_name_or_path, extra_kwargs, out_dir): |
There was a problem hiding this comment.
Same comment as here. We should not need to override this method at all.
|
|
||
| from transformers import Qwen2Config | ||
|
|
||
| vcf_config = Qwen2Config.from_pretrained(model_name_or_path, token=self.hf_token, **extra_kwargs) |
There was a problem hiding this comment.
If you want to avoid loading the video library imports by using a Qwen2Config for importing, can you create a generic load_config method in the base class (similar to load_weights below)? Then, we can call that method in the various locations that require a config. That method can then be overridden in this class to be Qwen-2 specific.
| hf_remote = extra_options.get("hf_remote", True) | ||
|
|
||
| config = AutoConfig.from_pretrained(hf_name, token=hf_token, trust_remote_code=hf_remote, **extra_kwargs) | ||
| # VideoChat-Flash uses custom remote code that imports heavy video libraries |
There was a problem hiding this comment.
All of the below changes should be reverted. They are too specific for this model and this section needs to remain generic. We can try the load_config suggestion from earlier to resolve this.
| onnx_model = Phi4MMModel(config, io_dtype, onnx_dtype, execution_provider, cache_dir, extra_options) | ||
| elif config.architectures[0] == "Qwen2ForCausalLM": | ||
| onnx_model = QwenModel(config, io_dtype, onnx_dtype, execution_provider, cache_dir, extra_options) | ||
| elif config.architectures[0] == "VideoChatFlashQwenForCausalLM": |
There was a problem hiding this comment.
Can we remove the if condition?
onnxruntime-genai/src/python/py/models/builder.py
Lines 292 to 299 in f32444d
Multi-modal models do not need to check this since the embedding ONNX model will go from input_ids to inputs_embeds.
Address PR microsoft#2147 review: keep the config loading section generic by removing the VideoChat-Flash architecture peek/Qwen2Config workaround. Use the standard AutoConfig.from_pretrained path for all models. Also remove the forced hf_remote=False in the VCF dispatch branch so HF model dependencies (video libraries) are loaded normally, consistent with how other models handle their library requirements. Co-authored-by: Cursor <cursoragent@cursor.com>
|
Hi @kunal-vaishnavi |
| # and genai_config.json. Same pattern as Qwen3VLTextModel. | ||
| # Base class transforms this to "videochat_flash_qwen" via: | ||
| # model_type[:model_type.find("For")].lower() | ||
| self.model_type = "VideoChat_Flash_QwenForCausalLM" |
Summary
Adds support for OpenGVLab/VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B, a video-language model whose architecture is
VideoChatFlashQwenForCausalLM. The language backbone is standard Qwen2.5-7B (flat config, standard weight keys, 2D RoPE,rope_theta=1e6, 28 layers / 28 heads / 4 KV heads / hidden=3584). Because it does not use MRoPE, the new builder inherits fromQwenModeldirectly rather thanQwen25VLTextModel.Changes
Python builder (
src/python/py/models/)builders/qwen.py— NewVideoChatFlashQwenModelsubclass ofQwenModel.make_genai_config()to bypass the model's custom remote code (which importsav,cv2,decord,imageioeven for config-only loads) by writing a temporaryQwen2Config-basedconfig.jsonand patchinggenai_config.json["model"]["type"]back tovideochat_flash_qwen.load_weights()to load the LM backbone directly viaQwen2ForCausalLM.from_pretrained(), avoiding video deps.model_type = "videochat_flash_qwen"andhf_remote = False.builders/__init__.py— ExportsVideoChatFlashQwenModel.builder.py—VideoChatFlashQwenForCausalLMby peeking at rawconfig.json(local dir or viahf_hub_download) before callingAutoConfig, then loads asQwen2Configto avoid pulling in video libraries.VideoChatFlashQwenForCausalLM→VideoChatFlashQwenModeland defaultsexclude_embeds=True(the text decoder receivesinputs_embedsfrom the embedding merger).exclude_embedsis now opt-in (guarded withnot in extra_options) so standaloneinput_idsexport still works.C++ runtime (
src/)models/videochat_flash_processor.{h,cpp}(new) —VideoChatFlashProcessorruns ORT-Extensions image preprocessing, transposes HWC→CHW, reshapes to[1, num_frames, C, H, W], casts pixel values to the session input dtype, buildsinput_idsby replacing each<|vision_start|>…<|vision_end|>block with a fixed number of<|image_pad|>tokens, and emitsimage_grid_thwsoGetImageFeatureBatchSizecan determinenum_imagesafter thepixel_valuesname remap.models/model.cpp— Registers"videochat_flash_qwen" → VideoChatFlashProcessorinMultiModalProcessor.models/model_type.h— Adds"videochat_flash_qwen"to theIsVLM()list (size 6 → 7).config.{h,cpp}— AddsVision::num_visual_tokens(default0) — fixed visual tokens per image;0falls back to computing fromimage_grid_thw. Required (>0) by the new processor.Files changed
src/config.cppnum_visual_tokensfrom vision configsrc/config.hnum_visual_tokensfield onVisionsrc/models/model.cppVideoChatFlashProcessorsrc/models/model.hsrc/models/model_type.hvideochat_flash_qwento VLM listsrc/models/videochat_flash_processor.cppsrc/models/videochat_flash_processor.hsrc/python/py/models/builder.pysrc/python/py/models/builders/__init__.pyVideoChatFlashQwenModelsrc/python/py/models/builders/qwen.pyVideoChatFlashQwenModelValidation
python -m onnxruntime_genai.models.builder -m OpenGVLab/VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B -o <dir> -p fp32 -e cpu --extra_options exclude_embeds=falsesucceeds withoutav/cv2/decord/imageioinstalled.vcf-oga-fp32-standalonemodel produces correct answers across 4 batched QA prompts (e.g. Paris, 56, ONNX Runtime description) usinggenerator.append_tokens()on the Qwen2.5-7B backbone.Test plan
IsVLM("videochat_flash_qwen")returnstrue.exclude_embeds=falseand run text-only inference end-to-end.genai_config.jsoncontains"type": "videochat_flash_qwen"(full VLM) or"type": "qwen2"(text-only standalone).vision.onnxandembedding.onnxland.