This commit introduces the Qwen3-VL model, a vision-language model with a 2.2B parameter architecture. It includes a comprehensive README detailing prerequisites, export instructions, and usage examples. Additionally, a runtime script is provided to facilitate multimodal inference using ExecuTorch and PyTorch eager mode for the vision encoder.
Key features:
- Instructions for exporting the model using optimum-executorch.
- Example usage for running inference with image and text inputs.
- Details on exported methods and quantization configurations.
This addition enhances the functionality of ExecuTorch for multimodal applications.
Companion PR - Optimum-Executorch - Add Qwen3-VL export support for multimodal text-to-text pipeline
Overview
Adds export and runtime support for Qwen3-VL-2B-Instruct, a 2.2B parameter vision-language model. Export goes through
optimum-executorchvia the existingmultimodal-text-to-texttask, producing a single.ptewithvision_encoder,text_decoder, andtoken_embeddingmethods.The optimum-executorch changes (in a companion PR) handle three Qwen3-VL-specific concerns during torch.export: pre-computing M-RoPE vision positions that use data-dependent ops, injecting position_ids via a forward hook so the text decoder export doesn't hit get_rope_index, and falling back to AutoModelForImageTextToText when AutoModelForPreTraining doesn't resolve.
This PR adds the ExecuTorch-side example:
examples/models/qwen3_vl/run_qwen3_vl.py— Python runtime that loads the .pte via ExecuTorchModule.run_method, driving token_embedding and text_decoder through ExecuTorch. The vision encoder runs in PyTorch eager because the portable runtime's aten::convolution.out does not yet support 5D inputs (Conv3d).examples/models/qwen3_vl/README.md— Export command, runtime usage, method shapes, quantization config, and architecture notes.Quantized model is ~1.4 GB (8da4w decoder, 8da4w/8da8w encoder, 8w embeddings).
Decode rate is ~25 tokens/sec on Apple Silicon M-series via XNNPACK.
Run Qwen3-VL-2B
python qwen3/run_qwen3_vl.py \ --model_path qwen3/Qwen3-VL-2B-Instruct-xnnpack/model.pte \ --image_path qwen3/test_image.jpg \ --prompt "What is in this image?" \ --max_new_tokens 200Output