[quantization] Implement PTQ wrapper for Gemma4VisionModel with static export support#793
Merged
Conversation
34c1362 to
907f90f
Compare
Torrero
reviewed
Jun 24, 2026
| from transformers.models.gemma4.modeling_gemma4 import BaseModelOutputWithPast | ||
|
|
||
| # Create padding mask from pixel_position_ids | ||
| padding_positions = (pixel_position_ids == -1).all(dim=-1) |
Contributor
There was a problem hiding this comment.
Maybe here we can precompute padding_positions for static input image?
Contributor
Author
There was a problem hiding this comment.
👍 Good point, thank you! Implemented.
…c export support Replace the skeleton Gemma4VisionModel wrapper with a complete implementation TICO-DCO-1.0-Signed-off-by: d.savchenkov <d.savchenkov@partner.samsung.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
This PR replaces the skeleton
QuantGemma4VisionModelwrapper with a full PTQ implementation that decomposes the forward pass into individual submodules (patch embedder, encoder, pooler), adds a static-shapeforward_export()path fortorch.export, and activates the wrapper in the registry.Why
The previous
QuantGemma4VisionModelwas a skeleton that delegated the entire forward pass toself.module()(the original Hugging Face model). This meant:The Gemma4 E2B static runtime requires the vision model to be fully quantized with per-submodule observers and a
forward_export()path that avoids dynamic operations (conditional branching, dynamic shapes) incompatible withtorch.exportand Circle conversion.Key Design Decisions
Separate
forward()andforward_export()methods: The runtimeforward()supports dynamic shapes and conditionalconfig.standardizebranching. The exportforward_export()assumesconfig.standardize=Trueand uses precomputed static tensors. This follows the same pattern as the existing Llama and Qwen text decoder wrappers.Separate export adapter attributes:
as_export_module()stores submodule export adapters asself.patch_embedder_exportandself.pooler_exportrather than mutating the originalself.patch_embedderandself.poolerwrappers. This preserves the original wrappers for potential re-export with different parameters.pixel_position_idsrequired foras_export_module(): The pooler's export adapter needspixel_position_idsat construction time to precompute static pooling weights (replacing dynamicF.one_hotandtorch.divwith a staticmatmul). This is enforced via an assertion.keep_mask == Falseinstead of~keep_mask: Replacedaten::bitwise_notwith== Falsecomparison in bothquant_vision_attention.pyandforward_export()to avoid an operator not supported by the Circle conversion pipeline.register_fake_quant_meta_kernels_for_dynamic_export(): Called duringas_export_module()to register fake quantize meta kernels needed fortorch.exportwith dynamic shapes in the encoder path.Encoder returns plain tensor: The
QuantGemma4VisionEncoderwrapper returns a plain tensor rather thanBaseModelOutputWithPast. Bothforward()andforward_export()handle this withisinstance(output, torch.Tensor)checks.Changes
tico/quantization/wrapq/wrappers/gemma4/quant_vision_model.py— Replaced skeleton with full implementation: decomposedforward()into patch_embedder → encoder → pooler → strip_padding → standardization pipeline; addedforward_export()for static-shape export; addedas_export_module()with recursive submodule export adapter conversion; registered std_bias/std_scale as buffers; added observers for minus_bias, strip_padding, std_bias, std_scale; addedenable_calibration()to collect std_bias/std_scale statisticstico/quantization/wrapq/wrappers/gemma4/export_adapters.py— AddedGemma4VisionModelPrefillExportAdapterthat wraps aQuantGemma4VisionModeland delegatesforward()towrapped_model.forward_export()tico/quantization/wrapq/wrappers/gemma4/quant_vision_attention.py— Fixed~keep_mask→keep_mask == Falseto avoidaten::bitwise_notwhich is unsupported by the Circle conversion pipelinetico/quantization/wrapq/wrappers/registry.py— Activatedquant_vision_encoderandquant_vision_modelentries (uncommented from_CORE_MODULES)test/quantization/wrapq/wrappers/gemma4/test_quant_vision_model.py— Added 13 unit tests covering: prepare wrapping, no-quant forward parity, mode transitions, observer collection, quant mode output finiteness, config attribute storage, standardize buffer registration, standardize=False path, as_export_module preconditions, forward_export via as_export_module, export adapter attribute creation, submodule wrappingtest/quantization/wrapq/wrappers/gemma4/test_quantize_vision_model.py— Added 4 smoke tests covering: no-quant reference parity, prepare-convert flow, as_export_module flow withGemma4VisionModelPrefillExportAdapter, standardize=False pathtico/quantization/recipes/debug/wrapper_smoke/cases/gemma4.py— AddedGemma4VisionModelCasetoGEMMA4_CASESwithbuild(),calibration_inputs(),eval_input(),export_module(), andexport_input()methodstico/quantization/wrapq/examples/gemma4/quantize_vision_model.py— Added example script demonstrating full PTQ flow: create model, prepare, calibrate, convert, compare FP vs quantized, export to CircleTests
test_quant_vision_model.py): 13 tests covering wrapper lifecycle, forward parity, mode transitions, observer collection, export module preconditions, and export adapter attribute creation. All pass.test_quantize_vision_model.py): 4 tests covering end-to-end prepare-calibrate-convert flow, reference parity, and as_export_module flow. Gated behindRUN_INTERNAL_TESTS=1.Gemma4VisionModelCase): Registered inGEMMA4_CASESfor the wrapper smoke test framework.Unit Tests
Internal Tests
Smoke Tests
Example Script
tico/quantization/wrapq/examples/gemma4/quantize_vision_model.py— Demonstrates the complete PTQ pipeline forGemma4VisionModel:Gemma4VisionModelwithstandardize=Trueandpooling_kernel_size=2PTQConfig()as_export_module()withpixel_position_idsfor pooler precomputationgemma4_vision_model.q.circle