[Qwen3] Allow packed QKV MatMul under QK-Norm via post-MatMul Split by xiaofeihan1 · Pull Request #2137 · microsoft/onnxruntime-genai

xiaofeihan1 · 2026-05-07T02:26:27Z

Previously use_packed_matmul was disabled whenever q_norm or k_norm was set, so Qwen3 (and any other QK-Norm architecture) emitted three separate q_proj / k_proj / v_proj MatMulNBits nodes per layer.

This PR allows packed QKV MatMul for QK-Norm models by inserting a single ONNX Split after the packed projection so the existing per-head Q/K SimplifiedLayerNormalization path is unchanged. Math is equivalent; quantization is unchanged.

Graph change (per attention layer)

Before:

root -> q_proj/MatMul -> Q
root -> k_proj/MatMul -> K
root -> v_proj/MatMul -> V

After (Split placed after the optional packed bias-Add so packed bias fusion is preserved):

root -> qkv_proj/MatMul -> [qkv_proj/Add] -> Split -> Q, K, V

Net per layer: −2 MatMulNBits, +1 Split. For Qwen3-4B (36 layers): −72 MatMulNBits, +36 Split, total ONNX nodes 700 → 665.

Why Split (not Slice)

Single read of the packed tensor, single dispatch with 3 outputs — avoids re-reading the same packed Q/K/V tensor three times per decode step.

Verification — Qwen3-4B int4 (RelWithDebInfo build, accuracy_level=4)

Compared two builds of the same source model:

base = pre-PR builder (reset to 664a61b1, QK-Norm forces 3 independent MatMuls)
curr = this PR

Output token sequences are byte-identical on both GPUs (no accuracy drift).

WebGPU GPU profile (aggregate: 1 prefill + 50 decode × 2 phases)

OpType	base ms	curr ms	Δ ms
MatMulNBits	932.6	876.9	−55.7
Split	0	10.8	+10.8
GroupQueryAttention	440.2	442.5	+2.3
TOTAL GPU	1509.0	1466.4	−42.6 (−2.8%)

Decode path (M=1): A 3× independent GEMV = 96.8 ms → B packed GEMV + Split = 64.5 ms → decode net −32 ms.
Prefill (M=1000): A 3× DP4A = 71.3 ms → B packed DP4A = 62.6 ms → prefill net −8.7 ms.

End-to-end perf — Qwen3-4B prefill-1000, 5 runs

GPU	metric	base	curr	Δ
NVIDIA RTX 5080 (steady P0)	gen tps	105.4	108.5	+2.9%
Intel(R) Graphics iGPU	gen tps	12.44	13.06	+5.0%

(NV result matches the −2.8% GPU profile measurement; iGPU receives a similar relative gain. Prompt TPS and TTFT are unchanged on both vendors.)

Compatibility

Models without QK-Norm: behavior unchanged (the new Split block is gated on q_norm && k_norm).
LoRA / packed-Attention path (use_matmul_in_attn): unchanged.
Packed bias fusion still applies for QK-Norm models when bias exists, because Split is placed after the packed Add.

Previously `use_packed_matmul` was disabled whenever q_norm or k_norm was set, so Qwen3 (and any other QK-Norm architecture) emitted three separate q_proj / k_proj / v_proj MatMulNBits nodes per layer. Allow packed MatMul in this case and insert a single ONNX Split node right after the packed `qkv_proj/MatMul` to recover Q/K/V tensors that feed the existing q_norm/k_norm path. This keeps the per-head SimplifiedLayerNormalization semantics unchanged (math equivalent, quantization unchanged) while reducing 3 MatMulNBits per layer to 1. A single `Split` is preferred over 3 `Slice` nodes because Split reads the packed output once and writes 3 outputs in a single dispatch, avoiding re-reading the same packed tensor 3x each decode step. The packed-bias branch is also gated off when QK-Norm forces unpack, to avoid mismatched shape on a packed Add over a sliced Q tensor. Verified on Qwen3-1.7B (int4, accuracy_level=4) — generated text is byte-identical to the unpacked baseline, and on RTX 5080 (WebGPU EP) gen TPS improves +5.5% (121.6 -> 128.3) with no prefill regression.

Copilot

Pull request overview

This PR updates the Python model builder’s attention graph construction to allow packed QKV projection even for QK-Norm architectures (e.g., Qwen3), by inserting an ONNX Split immediately after the packed qkv_proj MatMul so downstream Q/K/V-specific paths (including per-head Q/K norm) remain unchanged.

Changes:

Stop disabling use_packed_matmul purely due to q_norm/k_norm being enabled.
Add a make_split(...) helper and use it to split packed QKV output into Q, K, V tensors when QK-Norm is active.
Disable the packed-bias Add fusion when QK-Norm is active (since the graph now operates on split Q/K/V tensors).

Copilot AI review requested due to automatic review settings May 7, 2026 02:26

xiaofeihan1 requested a review from a team as a code owner May 7, 2026 02:26

Copilot started reviewing on behalf of xiaofeihan1 May 7, 2026 02:27 View session

Copilot AI reviewed May 7, 2026

View reviewed changes

qjia7 mentioned this pull request May 9, 2026

[WebGPU] QKV and MLP layer fusions for Qwen3-style models microsoft/onnxruntime#28280

Open

kunal-vaishnavi reviewed May 15, 2026

View reviewed changes

Comment thread src/python/py/models/builders/base.py Outdated

kunal-vaishnavi reviewed May 15, 2026

View reviewed changes

Comment thread src/python/py/models/builders/base.py Outdated

kunal-vaishnavi reviewed May 15, 2026

View reviewed changes

Comment thread src/python/py/models/builders/base.py Outdated

kunal-vaishnavi added the 0.14.0 label May 18, 2026

resolve comments

d35c5ef

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Qwen3] Allow packed QKV MatMul under QK-Norm via post-MatMul Split#2137

[Qwen3] Allow packed QKV MatMul under QK-Norm via post-MatMul Split#2137
xiaofeihan1 wants to merge 2 commits into
mainfrom
xfh/qwen3-packed-qkv-with-qknorm

xiaofeihan1 commented May 7, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

xiaofeihan1 commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Graph change (per attention layer)

Why Split (not Slice)

Verification — Qwen3-4B int4 (RelWithDebInfo build, accuracy_level=4)

WebGPU GPU profile (aggregate: 1 prefill + 50 decode × 2 phases)

End-to-end perf — Qwen3-4B prefill-1000, 5 runs

Compatibility

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xiaofeihan1 commented May 7, 2026 •

edited

Loading