Skip to content

Add VideoChat-Flash (OpenGVLab) language model support#2147

Open
anilmartha wants to merge 24 commits into
microsoft:mainfrom
anilmartha:add-opengv-support_updated
Open

Add VideoChat-Flash (OpenGVLab) language model support#2147
anilmartha wants to merge 24 commits into
microsoft:mainfrom
anilmartha:add-opengv-support_updated

Conversation

@anilmartha
Copy link
Copy Markdown
Contributor

@anilmartha anilmartha commented May 8, 2026

Summary

Adds support for OpenGVLab/VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B, a video-language model whose architecture is VideoChatFlashQwenForCausalLM. The language backbone is standard Qwen2.5-7B (flat config, standard weight keys, 2D RoPE, rope_theta=1e6, 28 layers / 28 heads / 4 KV heads / hidden=3584). Because it does not use MRoPE, the new builder inherits from QwenModel directly rather than Qwen25VLTextModel.

Changes

Python builder (src/python/py/models/)

  • builders/qwen.py — New VideoChatFlashQwenModel subclass of QwenModel.
    • Overrides make_genai_config() to bypass the model's custom remote code (which imports av, cv2, decord, imageio even for config-only loads) by writing a temporary Qwen2Config-based config.json and patching genai_config.json["model"]["type"] back to videochat_flash_qwen.
    • Overrides load_weights() to load the LM backbone directly via Qwen2ForCausalLM.from_pretrained(), avoiding video deps.
    • Sets model_type = "videochat_flash_qwen" and hf_remote = False.
  • builders/__init__.py — Exports VideoChatFlashQwenModel.
  • builder.py
    • Detects VideoChatFlashQwenForCausalLM by peeking at raw config.json (local dir or via hf_hub_download) before calling AutoConfig, then loads as Qwen2Config to avoid pulling in video libraries.
    • Routes VideoChatFlashQwenForCausalLMVideoChatFlashQwenModel and defaults exclude_embeds=True (the text decoder receives inputs_embeds from the embedding merger). exclude_embeds is now opt-in (guarded with not in extra_options) so standalone input_ids export still works.

C++ runtime (src/)

  • models/videochat_flash_processor.{h,cpp} (new)VideoChatFlashProcessor runs ORT-Extensions image preprocessing, transposes HWC→CHW, reshapes to [1, num_frames, C, H, W], casts pixel values to the session input dtype, builds input_ids by replacing each <|vision_start|>…<|vision_end|> block with a fixed number of <|image_pad|> tokens, and emits image_grid_thw so GetImageFeatureBatchSize can determine num_images after the pixel_values name remap.
  • models/model.cpp — Registers "videochat_flash_qwen" → VideoChatFlashProcessor in MultiModalProcessor.
  • models/model_type.h — Adds "videochat_flash_qwen" to the IsVLM() list (size 6 → 7).
  • config.{h,cpp} — Adds Vision::num_visual_tokens (default 0) — fixed visual tokens per image; 0 falls back to computing from image_grid_thw. Required (>0) by the new processor.

Files changed

File Change
src/config.cpp Parse num_visual_tokens from vision config
src/config.h Add num_visual_tokens field on Vision
src/models/model.cpp Register VideoChatFlashProcessor
src/models/model.h Header / license update
src/models/model_type.h Add videochat_flash_qwen to VLM list
src/models/videochat_flash_processor.cpp New — image processor
src/models/videochat_flash_processor.h New — header
src/python/py/models/builder.py Detect VCF arch, bypass remote-code video deps, route to new builder
src/python/py/models/builders/__init__.py Export VideoChatFlashQwenModel
src/python/py/models/builders/qwen.py Add VideoChatFlashQwenModel

Validation

  • Text-only export of the Qwen2.5-7B backbone via python -m onnxruntime_genai.models.builder -m OpenGVLab/VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B -o <dir> -p fp32 -e cpu --extra_options exclude_embeds=false succeeds without av/cv2/decord/imageio installed.
  • Standalone inference on the exported vcf-oga-fp32-standalone model produces correct answers across 4 batched QA prompts (e.g. Paris, 56, ONNX Runtime description) using generator.append_tokens() on the Qwen2.5-7B backbone.

Test plan

  • Build OGA with the new processor compiled in and confirm IsVLM("videochat_flash_qwen") returns true.
  • Export the text decoder with exclude_embeds=false and run text-only inference end-to-end.
  • Confirm genai_config.json contains "type": "videochat_flash_qwen" (full VLM) or "type": "qwen2" (text-only standalone).
  • Image-input smoke test once vision.onnx and embedding.onnx land.

amdrajeevp1 and others added 16 commits March 24, 2026 17:53
Adds text decoder export support for OpenGVLab/VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B.

Architecture: VideoChatFlashQwenForCausalLM is a VLM whose language backbone is
standard Qwen2.5-7B (flat config, standard weight keys, 2D RoPE, rope_theta=1e6,
28L / 28h / 4kv / hidden=3584). The model does NOT use MRoPE, so the builder
inherits QwenModel directly rather than Qwen25VLTextModel.

Changes:
- src/python/py/models/builders/qwen.py: Add VideoChatFlashQwenModel subclass
  of QwenModel. Sets exclude_embeds=True (text decoder receives inputs_embeds
  from the embedding merger) and model_type="videochat_flash_qwen".
- src/python/py/models/builders/__init__.py: Export VideoChatFlashQwenModel.
- src/python/py/models/builder.py: Map "VideoChatFlashQwenForCausalLM"
  architecture string to VideoChatFlashQwenModel with exclude_embeds=True.
- src/models/model_type.h: Register "videochat_flash_qwen" in IsVLM() (size 6->7).
- examples/python/videochat-flash/builder.py: New example export script.
  Phase 1 (this PR): text decoder only. Vision encoder (InternVideo2-1B)
  and embedding merger export are Phase 2 (scaffolded as TODOs).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…inference support

- builder.py (example): use local OGA builder via sys.path; fix create_model
  args when no local input dir; move prepare_model() inside Phase-2 block
  (trust_remote_code=False is safe for config-only loads)
- builder.py (core): add VCF config bypass using hf_hub_download + Qwen2Config
  to avoid av/cv2/decord/imageio imports triggered by AutoConfig; set
  config._name_or_path so load_weights() resolves the model correctly; make
  exclude_embeds opt-in (guard with 'not in extra_options') so standalone
  (input_ids) export also works
- qwen.py: add make_genai_config() override that writes a temp Qwen2Config
  and patches genai_config.json type back to videochat_flash_qwen; add
  load_weights() override using Qwen2ForCausalLM.from_pretrained() directly

Validated: text-only inference with vcf-oga-fp32-standalone/ produces
correct answers (Paris, 56, ONNX Runtime description) using append_tokens()
API on the exported Qwen2.5-7B backbone.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
examples/python/videochat-flash/run.py:
  - Text-only inference for the exported OGA model
  - Uses HF AutoTokenizer (og.Tokenizer fails for TokenizersBackend models)
  - Feeds tokens via generator.append_tokens(np.int32 array)
  - --batch flag runs 4 built-in QA prompts for quick validation
  - --prompt / --max-length for custom single-prompt runs
  - Notes in docstring on genai_config.json type patch workaround for
    installed OGA binaries that predate videochat_flash_qwen support

Validated: all 4 batch prompts produce correct answers on vcf-oga-fp32-standalone.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…xt-only mode

Root cause of teammate's inference failure:
1. builder.py set type=videochat_flash_qwen even in --text-only mode. OGA loads
   this via MultiModalLanguageModel which requires vision.onnx + embedding.onnx
   (not present in text-only export) → 'File doesn't exist' error on og.Model().
2. exclude_embeds defaulted to true → decoder expected inputs_embeds, but
   run.py feeds token IDs via append_tokens() → 'input not found' error.

Fix:
- builder.py export_text_model(): pass exclude_embeds=false in text-only mode
  (decoder uses input_ids for standalone inference).
- builder.py update_genai_config(): set type=qwen2 in text-only mode so OGA
  loads the model as a plain decoder (LM backbone is identical to Qwen2.5-7B).
  type=videochat_flash_qwen is reserved for Phase 2 full VLM pipeline.
- run.py: simplify docstring now that no manual patching is needed.

Validated with compiled OGA build (onnxruntime-genai conda env, Python 3.11):
all 4 batch prompts produce correct answers without any manual config changes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Documents Phase 1 (text decoder, done) and Phase 2 (vision + embedding, TODO):
- Architecture overview: InternVideo2-1B ViT → mm_projector → embedding merger → Qwen2.5-7B
- Phase 1 summary: what was implemented, design decisions, usage instructions
- Phase 2 roadmap: vision.onnx export (39-block ViT + MLP projector), embedding.onnx
  (embed_tokens + image-pad replacement), genai_config.json wiring, video preprocessing,
  and optional in-LLM HiCo compression (llm_compress_layer_list)
- Key challenges: 3D spatiotemporal attention ONNX compat, ToMe token merging ops,
  dynamic temporal position embeddings
- File map and reference links

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@VishalX
Copy link
Copy Markdown
Contributor

VishalX commented May 8, 2026

Olive Recipe: microsoft/olive-recipes#406

@anilmartha anilmartha marked this pull request as ready for review May 11, 2026 09:51
@anilmartha anilmartha requested a review from a team as a code owner May 11, 2026 09:51
Copilot AI review requested due to automatic review settings May 11, 2026 09:51
@anilmartha
Copy link
Copy Markdown
Contributor Author

@microsoft-github-policy-service agree company="AMD"

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds first-class support for the OpenGVLab VideoChat-Flash Qwen-based video-language model by extending the Python model builder to export its (text) decoder without pulling video dependencies, and extending the C++ runtime to recognize/run the new multimodal model type.

Changes:

  • Python: detect VideoChatFlashQwenForCausalLM early from raw config.json, route to a new VideoChatFlashQwenModel, and avoid trust_remote_code video-library imports.
  • C++: introduce VideoChatFlashProcessor and register videochat_flash_qwen as a VLM model type/processor.
  • Config: add and parse vision.num_visual_tokens used by the new processor for fixed visual-token padding.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/python/py/models/builders/qwen.py Adds VideoChatFlashQwenModel subclass to export the LM backbone while bypassing remote-code video deps.
src/python/py/models/builders/init.py Exports the new builder class from the builders package.
src/python/py/models/builder.py Adds config “peek” detection for VideoChat-Flash and routes to the new builder with default exclude_embeds=True.
src/models/videochat_flash_processor.h Declares VideoChatFlashProcessor (new multimodal processor).
src/models/videochat_flash_processor.cpp Implements preprocessing/token padding + emits image_grid_thw for batching inference.
src/models/model.h Updates header/license comments.
src/models/model.cpp Registers videochat_flash_qwenVideoChatFlashProcessor in the multimodal processor factory.
src/models/model_type.h Adds videochat_flash_qwen to the VLM model-type list.
src/config.h Adds vision.num_visual_tokens to config schema.
src/config.cpp Parses vision.num_visual_tokens from genai_config.json.

Comment thread src/python/py/models/builders/__init__.py Outdated
Comment thread src/python/py/models/builder.py Outdated
Comment thread src/config.h Outdated
Comment thread src/models/videochat_flash_processor.cpp
Comment thread src/python/py/models/builders/__init__.py Outdated
anilmartha and others added 2 commits May 18, 2026 11:03
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
- builder.py: introduce _has_vcf_architecture() helper that handles
  architectures as a list, a bare string, or absent/unknown type, then
  use it in both the local-dir and HF-hub code paths (Copilot comment #2)

- config.h: correct misleading comment on num_visual_tokens — remove the
  false "0 = compute from image_grid_thw" claim; the field must be > 0
  for videochat_flash_qwen (Copilot comment #3)

- videochat_flash_processor.cpp: add clarifying comment to empty catch
  block — exception is expected when only the language decoder session is
  loaded (Copilot comment microsoft#4)

- builders/__init__.py: move VideoChatFlashQwenModel to its correct
  alphabetical position in __all__ (after SmolLM3Model, before
  WhisperModel) per kunal-vaishnavi comment microsoft#6

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Comment thread src/python/py/models/builders/qwen.py Fixed
Comment thread src/python/py/models/builders/qwen.py Fixed
Comment thread src/python/py/models/builder.py Fixed
@anilmartha
Copy link
Copy Markdown
Contributor Author

Hi @kunal-vaishnavi
Could you please review it again?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

ort_extensions::OrtxObjectPtr<OrtxTensorResult> result;
CheckResult(OrtxImagePreProcess(processor_.get(), images->images_.get(), result.ToBeAssigned()));

OrtxTensor* pixel_values = nullptr;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to be a memory leak. Can you fix this? There was a recent PR that fixes this issue.

std::to_string(pv_ndims) + " (expected 3 or 4)");
}

// Transpose HWC → CHW and reshape to [1, num_frames, C, H, W]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do this transpose as another step inside ORT Extensions before returning the tensor to ORT GenAI.

}
}

auto converted_pv = ConvertPixelValues(*float_tensor, pixel_values_type_, allocator);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than creating a new method called ConvertPixelValues, can we re-use existing APIs that already convert dtypes? Here is an example.

if (audio_features_type_ == ONNX_TENSOR_ELEMENT_DATA_TYPE_FLOAT) {
named_tensors->emplace(std::string(Config::Defaults::AudioFeaturesName),
std::make_shared<Tensor>(ProcessTensor<float>(mel.get(), allocator)));
} else {
named_tensors->emplace(std::string(Config::Defaults::AudioFeaturesName),
std::make_shared<Tensor>(ProcessTensor<Ort::Float16_t>(mel.get(), allocator)));
}

grid_ptr[i * 3 + 1] = height;
grid_ptr[i * 3 + 2] = width;
}
named_tensors->emplace("image_grid_thw",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are new input names, let's add their default names in the config (e.g. Config::Defaults::ImageGridThw) and reference them here.

Comment thread src/python/py/models/builder.py Outdated
Qwen25VLTextModel,
Qwen35TextModel,
QwenModel,
VideoChatFlashQwenModel,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's preserve any existing alphabetical orders when inserting new items.

Comment thread src/python/py/models/builders/qwen.py Outdated
# hf_remote=False so base class helpers (make_genai_config,
# save_processing) use standard, non-remote-code paths.
extra_options = dict(extra_options) if extra_options else {}
extra_options["hf_remote"] = False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be consistent with other models, let's set any default extra options flags during architecture parsing. For example:

elif config.architectures[0] == "Qwen2_5_VLForConditionalGeneration":
text_config = config.text_config
for key in text_config:
if not hasattr(config, key):
setattr(config, key, getattr(text_config, key))
print("WARNING: This is only generating the text component of the model. Setting `--extra_options exclude_embeds=true` by default.")
extra_options["exclude_embeds"] = True
onnx_model = Qwen25VLTextModel(config, io_dtype, onnx_dtype, execution_provider, cache_dir, extra_options)

Comment thread src/python/py/models/builders/qwen.py Outdated
Comment thread src/python/py/models/builders/qwen.py Outdated
# establishes these attributes on first assignment instead of us
# overwriting inherited values.
#
# The custom remote code requires video libraries (av, cv2, decord,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine if video libraries need to be temporarily installed. These libraries are on a per-model basis and are based on Hugging Face's requirements for loading a model. For example, Phi-4 mm requires PEFT for loading the model and we don't remove those library dependencies even though they are not technically needed for producing the ONNX model.

Comment thread src/python/py/models/builders/qwen.py Outdated

super().__init__(config, io_dtype, onnx_dtype, ep, cache_dir, extra_options)

def make_genai_config(self, model_name_or_path, extra_kwargs, out_dir):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as here. We should not need to override this method at all.

Comment thread src/python/py/models/builders/qwen.py Outdated

from transformers import Qwen2Config

vcf_config = Qwen2Config.from_pretrained(model_name_or_path, token=self.hf_token, **extra_kwargs)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to avoid loading the video library imports by using a Qwen2Config for importing, can you create a generic load_config method in the base class (similar to load_weights below)? Then, we can call that method in the various locations that require a config. That method can then be overridden in this class to be Qwen-2 specific.

Comment thread src/python/py/models/builder.py Outdated
hf_remote = extra_options.get("hf_remote", True)

config = AutoConfig.from_pretrained(hf_name, token=hf_token, trust_remote_code=hf_remote, **extra_kwargs)
# VideoChat-Flash uses custom remote code that imports heavy video libraries
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the below changes should be reverted. They are too specific for this model and this section needs to remain generic. We can try the load_config suggestion from earlier to resolve this.

onnx_model = Phi4MMModel(config, io_dtype, onnx_dtype, execution_provider, cache_dir, extra_options)
elif config.architectures[0] == "Qwen2ForCausalLM":
onnx_model = QwenModel(config, io_dtype, onnx_dtype, execution_provider, cache_dir, extra_options)
elif config.architectures[0] == "VideoChatFlashQwenForCausalLM":
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove the if condition?

elif config.architectures[0] == "Qwen2_5_VLForConditionalGeneration":
text_config = config.text_config
for key in text_config:
if not hasattr(config, key):
setattr(config, key, getattr(text_config, key))
print("WARNING: This is only generating the text component of the model. Setting `--extra_options exclude_embeds=true` by default.")
extra_options["exclude_embeds"] = True
onnx_model = Qwen25VLTextModel(config, io_dtype, onnx_dtype, execution_provider, cache_dir, extra_options)

Multi-modal models do not need to check this since the embedding ONNX model will go from input_ids to inputs_embeds.

Anil Kumar Martha and others added 4 commits May 20, 2026 11:48
Address PR microsoft#2147 review: keep the config loading section generic by removing the VideoChat-Flash architecture peek/Qwen2Config workaround. Use the standard AutoConfig.from_pretrained path for all models.

Also remove the forced hf_remote=False in the VCF dispatch branch so HF model dependencies (video libraries) are loaded normally, consistent with how other models handle their library requirements.

Co-authored-by: Cursor <cursoragent@cursor.com>
@anilmartha
Copy link
Copy Markdown
Contributor Author

Hi @kunal-vaishnavi
I have addressed the review comments. Could you please take another look?

# and genai_config.json. Same pattern as Qwen3VLTextModel.
# Base class transforms this to "videochat_flash_qwen" via:
# model_type[:model_type.find("For")].lower()
self.model_type = "VideoChat_Flash_QwenForCausalLM"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants