Add QuantizeEmbeddingInt8 and ShareEmbeddingLmHead graph surgeries for INT8 embedding quantization by apsonawane · Pull Request #2464 · microsoft/Olive

apsonawane · 2026-05-14T00:17:08Z

Add QuantizeEmbeddingInt8 and ShareEmbeddingLmHead graph surgeries

Summary

Adds two new graph surgeries for post-hoc INT8 embedding quantization and weight sharing, along with evaluator fixes for hybrid attention architectures (e.g., Qwen3.5-2B with GatedDeltaNet + standard attention).

Motivation

Models with large vocabularies (e.g., Qwen3.5-2B with 248K tokens) have FP16 embeddings that dominate model size (~970 MB out of 2.0 GB for INT4 weights). The ModelBuilder's default quantizer (Neural Compressor) only quantizes MatMul ops, leaving Gather (embedding) as FP16. RTN-based quantizers that support INT8 embedding natively (k_quant_last) destroy accuracy on hybrid architectures (26% vs 59% MMLU).

Changes

New Graph Surgeries (`graph_surgeries.py`)

QuantizeEmbeddingInt8: Converts FP16 Gather embedding to INT8 GatherBlockQuantized with per-block asymmetric quantization (zero_point=128, block_size=32). Reduces embedding from ~970 MB to ~530 MB with negligible accuracy loss.
ShareEmbeddingLmHead: Replaces lm_head's INT4 MatMulNBits with INT8 MatMulNBits sharing the embedding weight via Reshape, eliminating duplicate storage. Saves ~250 MB.
Helper functions: _find_embed_node, _find_lm_head_node, _find_initializer, _get_node_attrs

Evaluator Fixes

lmeval_ort.py: Support for 3D position_ids (mRoPE) and hybrid conv_state/recurrent_state inputs for models with mixed attention + linear attention layers
olive_evaluator.py: Fix metric parsing for lm-eval results with non-comma metric keys and non-numeric values
onnx_io.py: Fix KV cache layer index detection for non-contiguous indices (e.g., attention at layers 3,7,11,15,19,23 only)

Results (Qwen3.5-2B)

Configuration	Size	MMLU	Δ vs FP16
Baseline FP16	4.3 GB	59.27%	—
INT4 weights + FP16 embed	2.0 GB	57.21%	-2.06%
INT4 weights + INT8 embed	1.6 GB	57.19%	-2.08%
INT4 weights + shared INT8 embed/lm_head	1.4 GB	57.11%	-2.16%

Testing

7 unit tests added in test/passes/onnx/test_quantize_embedding.py
All tests pass
End-to-end validated via Olive recipe with MMLU evaluation

…r INT8 embedding quantization

Copilot

Pull request overview

This PR adds two new ONNX graph surgeries to enable post-hoc INT8 embedding quantization and embedding/lm_head weight sharing (to reduce model size for large-vocab LLMs), and updates the lm-eval ORT evaluator + IO utilities to better support hybrid attention architectures and pruned/non-contiguous KV-cache indices.

Changes:

Add QuantizeEmbeddingInt8 (FP16/FP32 Gather → INT8 GatherBlockQuantized) and ShareEmbeddingLmHead (reuse embedding quantization params/weights for INT8 MatMulNBits) graph surgeries.
Improve lmeval_ort runtime IO binding to support 3D position_ids (mRoPE) and hybrid state tensors (conv_state / recurrent_state).
Fix KV-cache layer index detection for non-contiguous layer indices and make LM-eval metric parsing more robust to varied key formats/values.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`olive/passes/onnx/graph_surgeries.py`	Adds two new embedding-focused graph surgeries and helper functions.
`olive/passes/onnx/model_builder.py`	Removes a debug message about ignored tied-embedding flags in embedding construction.
`olive/evaluator/lmeval_ort.py`	Adds support for mRoPE `position_ids` rank detection and hybrid state IO binding/buffers.
`olive/evaluator/olive_evaluator.py`	Tightens parsing of lm-eval metric outputs (skip aliases/non-numeric, handle comma keys).
`olive/common/onnx_io.py`	Detects actual KV-cache layer indices from input names (supports non-contiguous indices).
`test/passes/onnx/test_quantize_embedding.py`	Adds unit tests covering the new embedding surgeries.

Comments suppressed due to low confidence (1)

test/passes/onnx/test_quantize_embedding.py:176

old_init_names is assigned but never used, which will fail linting (ruff F841). Remove the variable or assert on it (e.g., compare old vs new initializers) so the assignment is meaningful.


        old_init_names = {init.name for init in model.graph.initializer}

shaahji · 2026-05-20T17:54:29Z

+    # find the actual layer indices (may be non-contiguous after pruning)
+    layer_indices = []
    for i_name in io_config["input_names"]:
-        num_layers += int(re.match(kv_format, i_name) is not None)
+        m = re.match(kv_format, i_name)
+        if m:
+            idx = int(m.group(1))
+            if idx not in layer_indices:
+                layer_indices.append(idx)
+    layer_indices.sort()


nit: declare layer_indicies as a set and convert to list with sorted after iteration.

shaahji · 2026-05-20T17:58:36Z

+        if "position_ids" in self.io_config["input_names"]:
+            idx = self.io_config["input_names"].index("position_ids")
+            self.position_ids_rank = len(self.io_config["input_shapes"][idx])


You could merge this condition with the loop below. That would avoid multiple iterations thru' the list.

shaahji · 2026-05-20T17:59:11Z

+        self.hybrid_states = {}
+        for idx, name in enumerate(self.io_config["input_names"]):
+            if "conv_state" in name or "recurrent_state" in name:
+                shape = self.io_config["input_shapes"][idx]
+                dtype = self.io_config["input_types"][idx]
+                self.hybrid_states[name] = {"shape": shape, "dtype": dtype}
+
+        # detect hybrid state outputs
+        self.hybrid_state_outputs = {}
+        for idx, name in enumerate(self.io_config["output_names"]):
+            if "conv_state" in name or "recurrent_state" in name:
+                shape = self.io_config["output_shapes"][idx]
+                dtype = self.io_config["output_types"][idx]
+                self.hybrid_state_outputs[name] = {"shape": shape, "dtype": dtype}
+


These loops can be merged into one!

shaahji · 2026-05-20T18:06:07Z

+def _find_embed_node(model, op_type, label):
+    """Find the embed_tokens node of the given op_type and its index."""
+    for i, node in enumerate(model.graph.node):
+        if node.op_type == op_type and "embed_tokens" in node.name:
+            return node, i
+    logger.warning("No embed_tokens %s node found, skipping %s", op_type, label)
+    return None, None
+
+
+def _find_lm_head_node(model):
+    """Find the lm_head MatMulNBits node and its index."""
+    for i, node in enumerate(model.graph.node):
+        if node.op_type == "MatMulNBits" and "lm_head" in node.name:
+            return node, i
+    logger.warning("No lm_head MatMulNBits found")
+    return None, None
+
+
+def _find_initializer(model, name):
+    """Find an initializer by name."""
+    for init in model.graph.initializer:
+        if init.name == name:
+            return init
+    return None
+
+
+def _get_node_attrs(node, *attr_names):
+    """Extract integer attributes from a node by name."""
+    result = {}
+    for attr in node.attribute:
+        if attr.name in attr_names:
+            result[attr.name] = attr.i
+    return result
+
+
+def _ensure_msft_opset(model):
+    """Ensure com.microsoft opset import is present in the model."""
+    for opset in model.opset_import:
+        if opset.domain == "com.microsoft":
+            return
+    model.opset_import.append(onnx.helper.make_opsetid("com.microsoft", 1))


Use OnnxDAG instead.

shaahji · 2026-05-20T18:13:19Z

+        model.graph.initializer.append(numpy_helper.from_array(q_flat, name=qweight_name))
+        model.graph.initializer.append(numpy_helper.from_array(scales, name=scales_name))
+        model.graph.initializer.append(numpy_helper.from_array(zero_points, name=zp_name))


Use OnnxDAG to manmipulate the graph.

apsonawane added 3 commits April 15, 2026 11:50

Update tie-word embedding surgery

5f66048

Add QuantizeEmbeddingInt8 and ShareEmbeddingLmHead graph surgeries fo…

14ec328

…r INT8 embedding quantization

Merge branch 'main' into asonawane/tieword

6ba1f23

Copilot AI review requested due to automatic review settings May 14, 2026 00:17

Copilot started reviewing on behalf of apsonawane May 14, 2026 00:18 View session

github-advanced-security AI found potential problems May 14, 2026

View reviewed changes

Copilot AI reviewed May 14, 2026

View reviewed changes

Comment thread test/passes/onnx/test_quantize_embedding.py Outdated

Comment thread olive/passes/onnx/graph_surgeries.py

Comment thread olive/passes/onnx/graph_surgeries.py Outdated

github-advanced-security AI found potential problems May 14, 2026

View reviewed changes

apsonawane mentioned this pull request May 14, 2026

Add Qwen3.5-2B text only olive-recipe with INT4 weights and shared INT8 embedding microsoft/olive-recipes#422

Open

apsonawane added 2 commits May 14, 2026 03:58

Fix comments

cbb973a

Fix type issue

7541cf5

shaahji requested changes May 20, 2026

View reviewed changes

Address comments

1bfa744

apsonawane requested review from shaahji and xiaoyu-work May 20, 2026 22:49

apsonawane and others added 2 commits May 20, 2026 17:11

fix tests

ee840ef

Merge branch 'main' into asonawane/tieword

e5db35e

Conversation

apsonawane commented May 14, 2026