Skip to content

feat(mtmd): add Qwen2VL SMT patch preprocessing#9

Merged
alex-spacemit merged 1 commit into
spacemit-com:spacemit-mtmdfrom
LFF28:mineru2.5
Jun 22, 2026
Merged

feat(mtmd): add Qwen2VL SMT patch preprocessing#9
alex-spacemit merged 1 commit into
spacemit-com:spacemit-mtmdfrom
LFF28:mineru2.5

Conversation

@LFF28

@LFF28 LFF28 commented Jun 18, 2026

Copy link
Copy Markdown

feat(mtmd): add Qwen2VL SMT patch preprocessing

Overview

This PR adds a Qwen2VL-specific SMT vision preprocessing path for ONNX vision encoders exported from model.visual.

Qwen2VL model.visual does not consume a normal CHW image tensor. It expects the processor output tensor shaped as:

[grid_h * grid_w, 3 * temporal_patch_size * 14 * 14]

The SMT preprocessor now detects qwen2vl / qwen2_vl architectures and converts resized RGB input into the flattened patch tensor expected by the exported ONNX model. The path applies the configured rescale and normalization values, then lays out patches in Qwen2VL merge order instead of returning a CHW tensor.

This enables the MinerU2.5 split deployment path:

text side:   GGUF model loaded by llama.cpp
vision side: Qwen2VL visual ONNX loaded through the SMT vision backend

Additional information

Companion deployment/export project:

https://gitlab.dc.com:8443/liangjunzhao/mineru2.5-split

Model artifacts have been uploaded externally and are not included in this PR:

https://archive.spacemit.com/spacemit-ai/model_zoo/vlm/

Validation context:

  • Model: MinerU2.5-Pro-2605-1.2B
  • Text model: mineru2.5-text-Q4_1.gguf
  • Vision models:
    • mineru2.5-vision-224.f16.onnx
    • mineru2.5-vision-504.f16.onnx
    • shared external data: mineru2.5-vision-shared.f16.data
  • Current deployment config uses the 504 static vision graph.
  • 224 and 504 ONNX graphs share the same external initializer data; the graph files differ by static input/output shapes.

The 1008 static vision export was attempted but could not complete on the available local machine configuration. The 504 static graph was exported and validated on the target device.

Requirements

  • git diff --check
  • cmake --build build --target llama-server -j 8
  • Verified that tools/mtmd/smt-vision-preprocess.cpp compiles as part of llama-server
  • Verified the MinerU2.5 504 deployment starts with SMT media backend and /health returns {"status":"ok"}
  • Verified input.pdf inference through the split GGUF + ONNX deployment on the target device
  • Model binaries and exported ONNX/GGUF/data files are archived externally, not committed to source

Add a Qwen2VL preprocessing path that converts resized RGB images into the flattened patch tensor expected by model.visual ONNX exports.

Detect qwen2vl/qwen2_vl architectures in the SMT vision preprocessor and route them through rescale, normalization, and merge-ordered patch flattening instead of CHW image tensors.
xiaoguorou pushed a commit to xiaoguorou/llama.cpp that referenced this pull request Jun 22, 2026
* spec: support MTP

* fix batch size

* rename files

* cont : simplify (spacemit-com#7)

* MTP: clean-up (spacemit-com#9)

* MTP: clean-up

* review: use llama_context_type instead of llama_graph_type

* review: remove llama_model_has_mtp

* review: fix convert issues

* convert: fix pycheck

* review: formatting

* use `mtp-` for identifying mtp models

* convert: fix mtp conversion

* mtp -> draft-mtp

* remove unused llama_arch

* add need_embd in speculative

* llama: allow partial seq_rm for GDN models for speculative decoding

Currently speculative checkpoint needs to restart from a checkpoint
after some draft tokens are not accepted, this leads to some wastage in
running the target again. This PR adds the ability to rollback upto
`draft_max` by storing the GDN intermediates.

* fix pending state

* vulkan: add GDN partial rollback

* meta: extend check to axis 1

* metal: add GDN partial rollback

Extend the gated delta net kernel to store intermediate states for
partial rollback support on the Metal backend.

- Add K (snapshot slot count) as a function constant
- Read input state from slot 0 of the 3D state tensor
- Write intermediate states to different slots during token loop
- For K=1, maintain backward-compatible single-slot behavior

Ref: ggml-org@8c05923

Assisted-by: llama.cpp:local pi

* delta_net_base: use ggml_pad instead of new_tensor

* review: add need_rs_seq

* review: rename part_bounded to n_rs

* review: deslop comments

* review: rename, add asserts

* server : adjust checkpoint logic (spacemit-com#11)

* server : adjust checkpoint logic

* cont : rm asserts

* server-context: fix early exit

* spec : fix compatibility with n-gram and add TODOs (ggml-org#13)

* metal : cleanup

* llama : fix faulty bitwise check in recurrent memory

* server : disable RS-based MTP in combination with other spec types

* spec : add TODOs

* cont : fix comment

* cont : update comment

* common : fix logic for ngram + mtp compat

* llama-memory: enable checkpointing with partial rollback

* cont: add test-case for loading into a dirty ctx

* llama-memory-recurrent: clear rs_idx in clear

* download: fix mtp path

* llama-arch: fix enorm op

* docs: update docs

* conversion: fix type annotations

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@alex-spacemit alex-spacemit requested a review from Copilot June 22, 2026 06:40

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a Qwen2VL-specific vision preprocessing path to the SMT (ONNX) vision pipeline so that Qwen2VL model.visual exports can consume the flattened, merged patch tensor layout they expect (instead of a CHW image tensor). This supports split deployments where llama.cpp runs the text GGUF while SMT runs the Qwen2VL vision ONNX.

Changes:

  • Extend SMT vision preprocessing spec resolution to detect qwen2vl / qwen2_vl architectures and enable a new patch-flatten mode.
  • Implement RGB->normalized CHW conversion + Qwen2VL patch flattening in Qwen2VL merge order, producing [grid_h * grid_w, 3 * temporal_patch_size * 14 * 14]-style packed float data.
  • Route preprocessing output selection through the new qwen2vl_patch_flatten flag and reuse a single resolved preprocess_config.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@alex-spacemit alex-spacemit merged commit 2415c9a into spacemit-com:spacemit-mtmd Jun 22, 2026
10 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants