Skip to content

[Qwen3] Allow packed QKV MatMul under QK-Norm via post-MatMul Split#2137

Open
xiaofeihan1 wants to merge 2 commits into
mainfrom
xfh/qwen3-packed-qkv-with-qknorm
Open

[Qwen3] Allow packed QKV MatMul under QK-Norm via post-MatMul Split#2137
xiaofeihan1 wants to merge 2 commits into
mainfrom
xfh/qwen3-packed-qkv-with-qknorm

Conversation

@xiaofeihan1
Copy link
Copy Markdown
Contributor

@xiaofeihan1 xiaofeihan1 commented May 7, 2026

Previously use_packed_matmul was disabled whenever q_norm or k_norm was set, so Qwen3 (and any other QK-Norm architecture) emitted three separate q_proj / k_proj / v_proj MatMulNBits nodes per layer.

This PR allows packed QKV MatMul for QK-Norm models by inserting a single ONNX Split after the packed projection so the existing per-head Q/K SimplifiedLayerNormalization path is unchanged. Math is equivalent; quantization is unchanged.

Graph change (per attention layer)

Before:

root -> q_proj/MatMul -> Q
root -> k_proj/MatMul -> K
root -> v_proj/MatMul -> V

After (Split placed after the optional packed bias-Add so packed bias fusion is preserved):

root -> qkv_proj/MatMul -> [qkv_proj/Add] -> Split -> Q, K, V

Net per layer: −2 MatMulNBits, +1 Split. For Qwen3-4B (36 layers): −72 MatMulNBits, +36 Split, total ONNX nodes 700 → 665.

Why Split (not Slice)

Single read of the packed tensor, single dispatch with 3 outputs — avoids re-reading the same packed Q/K/V tensor three times per decode step.

Verification — Qwen3-4B int4 (RelWithDebInfo build, accuracy_level=4)

Compared two builds of the same source model:

  • base = pre-PR builder (reset to 664a61b1, QK-Norm forces 3 independent MatMuls)
  • curr = this PR

Output token sequences are byte-identical on both GPUs (no accuracy drift).

WebGPU GPU profile (aggregate: 1 prefill + 50 decode × 2 phases)

OpType base ms curr ms Δ ms
MatMulNBits 932.6 876.9 −55.7
Split 0 10.8 +10.8
GroupQueryAttention 440.2 442.5 +2.3
TOTAL GPU 1509.0 1466.4 −42.6 (−2.8%)
  • Decode path (M=1): A 3× independent GEMV = 96.8 ms → B packed GEMV + Split = 64.5 ms → decode net −32 ms.
  • Prefill (M=1000): A 3× DP4A = 71.3 ms → B packed DP4A = 62.6 ms → prefill net −8.7 ms.

End-to-end perf — Qwen3-4B prefill-1000, 5 runs

GPU metric base curr Δ
NVIDIA RTX 5080 (steady P0) gen tps 105.4 108.5 +2.9%
Intel(R) Graphics iGPU gen tps 12.44 13.06 +5.0%

(NV result matches the −2.8% GPU profile measurement; iGPU receives a similar relative gain. Prompt TPS and TTFT are unchanged on both vendors.)

Compatibility

  • Models without QK-Norm: behavior unchanged (the new Split block is gated on q_norm && k_norm).
  • LoRA / packed-Attention path (use_matmul_in_attn): unchanged.
  • Packed bias fusion still applies for QK-Norm models when bias exists, because Split is placed after the packed Add.

Previously `use_packed_matmul` was disabled whenever q_norm or k_norm
was set, so Qwen3 (and any other QK-Norm architecture) emitted three
separate q_proj / k_proj / v_proj MatMulNBits nodes per layer.

Allow packed MatMul in this case and insert a single ONNX Split node
right after the packed `qkv_proj/MatMul` to recover Q/K/V tensors that
feed the existing q_norm/k_norm path. This keeps the per-head
SimplifiedLayerNormalization semantics unchanged (math equivalent,
quantization unchanged) while reducing 3 MatMulNBits per layer to 1.

A single `Split` is preferred over 3 `Slice` nodes because Split reads
the packed output once and writes 3 outputs in a single dispatch,
avoiding re-reading the same packed tensor 3x each decode step.

The packed-bias branch is also gated off when QK-Norm forces unpack,
to avoid mismatched shape on a packed Add over a sliced Q tensor.

Verified on Qwen3-1.7B (int4, accuracy_level=4) — generated text is
byte-identical to the unpacked baseline, and on RTX 5080 (WebGPU EP)
gen TPS improves +5.5% (121.6 -> 128.3) with no prefill regression.
Copilot AI review requested due to automatic review settings May 7, 2026 02:26
@xiaofeihan1 xiaofeihan1 requested a review from a team as a code owner May 7, 2026 02:26
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Python model builder’s attention graph construction to allow packed QKV projection even for QK-Norm architectures (e.g., Qwen3), by inserting an ONNX Split immediately after the packed qkv_proj MatMul so downstream Q/K/V-specific paths (including per-head Q/K norm) remain unchanged.

Changes:

  • Stop disabling use_packed_matmul purely due to q_norm/k_norm being enabled.
  • Add a make_split(...) helper and use it to split packed QKV output into Q, K, V tensors when QK-Norm is active.
  • Disable the packed-bias Add fusion when QK-Norm is active (since the graph now operates on split Q/K/V tensors).

Comment thread src/python/py/models/builders/base.py Outdated
Comment thread src/python/py/models/builders/base.py Outdated
Comment thread src/python/py/models/builders/base.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants