feat(mtmd): add Qwen2VL SMT patch preprocessing#9
Merged
Conversation
Add a Qwen2VL preprocessing path that converts resized RGB images into the flattened patch tensor expected by model.visual ONNX exports. Detect qwen2vl/qwen2_vl architectures in the SMT vision preprocessor and route them through rescale, normalization, and merge-ordered patch flattening instead of CHW image tensors.
xiaoguorou
pushed a commit
to xiaoguorou/llama.cpp
that referenced
this pull request
Jun 22, 2026
* spec: support MTP * fix batch size * rename files * cont : simplify (spacemit-com#7) * MTP: clean-up (spacemit-com#9) * MTP: clean-up * review: use llama_context_type instead of llama_graph_type * review: remove llama_model_has_mtp * review: fix convert issues * convert: fix pycheck * review: formatting * use `mtp-` for identifying mtp models * convert: fix mtp conversion * mtp -> draft-mtp * remove unused llama_arch * add need_embd in speculative * llama: allow partial seq_rm for GDN models for speculative decoding Currently speculative checkpoint needs to restart from a checkpoint after some draft tokens are not accepted, this leads to some wastage in running the target again. This PR adds the ability to rollback upto `draft_max` by storing the GDN intermediates. * fix pending state * vulkan: add GDN partial rollback * meta: extend check to axis 1 * metal: add GDN partial rollback Extend the gated delta net kernel to store intermediate states for partial rollback support on the Metal backend. - Add K (snapshot slot count) as a function constant - Read input state from slot 0 of the 3D state tensor - Write intermediate states to different slots during token loop - For K=1, maintain backward-compatible single-slot behavior Ref: ggml-org@8c05923 Assisted-by: llama.cpp:local pi * delta_net_base: use ggml_pad instead of new_tensor * review: add need_rs_seq * review: rename part_bounded to n_rs * review: deslop comments * review: rename, add asserts * server : adjust checkpoint logic (spacemit-com#11) * server : adjust checkpoint logic * cont : rm asserts * server-context: fix early exit * spec : fix compatibility with n-gram and add TODOs (ggml-org#13) * metal : cleanup * llama : fix faulty bitwise check in recurrent memory * server : disable RS-based MTP in combination with other spec types * spec : add TODOs * cont : fix comment * cont : update comment * common : fix logic for ngram + mtp compat * llama-memory: enable checkpointing with partial rollback * cont: add test-case for loading into a dirty ctx * llama-memory-recurrent: clear rs_idx in clear * download: fix mtp path * llama-arch: fix enorm op * docs: update docs * conversion: fix type annotations --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
There was a problem hiding this comment.
Pull request overview
Adds a Qwen2VL-specific vision preprocessing path to the SMT (ONNX) vision pipeline so that Qwen2VL model.visual exports can consume the flattened, merged patch tensor layout they expect (instead of a CHW image tensor). This supports split deployments where llama.cpp runs the text GGUF while SMT runs the Qwen2VL vision ONNX.
Changes:
- Extend SMT vision preprocessing spec resolution to detect
qwen2vl/qwen2_vlarchitectures and enable a new patch-flatten mode. - Implement RGB->normalized CHW conversion + Qwen2VL patch flattening in Qwen2VL merge order, producing
[grid_h * grid_w, 3 * temporal_patch_size * 14 * 14]-style packed float data. - Route preprocessing output selection through the new
qwen2vl_patch_flattenflag and reuse a single resolvedpreprocess_config.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
2415c9a
into
spacemit-com:spacemit-mtmd
10 of 11 checks passed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
feat(mtmd): add Qwen2VL SMT patch preprocessing
Overview
This PR adds a Qwen2VL-specific SMT vision preprocessing path for ONNX vision encoders exported from
model.visual.Qwen2VL
model.visualdoes not consume a normal CHW image tensor. It expects the processor output tensor shaped as:The SMT preprocessor now detects
qwen2vl/qwen2_vlarchitectures and converts resized RGB input into the flattened patch tensor expected by the exported ONNX model. The path applies the configured rescale and normalization values, then lays out patches in Qwen2VL merge order instead of returning a CHW tensor.This enables the MinerU2.5 split deployment path:
Additional information
Companion deployment/export project:
Model artifacts have been uploaded externally and are not included in this PR:
Validation context:
MinerU2.5-Pro-2605-1.2Bmineru2.5-text-Q4_1.ggufmineru2.5-vision-224.f16.onnxmineru2.5-vision-504.f16.onnxmineru2.5-vision-shared.f16.dataThe 1008 static vision export was attempted but could not complete on the available local machine configuration. The 504 static graph was exported and validated on the target device.
Requirements
git diff --checkcmake --build build --target llama-server -j 8tools/mtmd/smt-vision-preprocess.cppcompiles as part ofllama-server/healthreturns{"status":"ok"}input.pdfinference through the split GGUF + ONNX deployment on the target device