Hugging Face Model integration in Superbench#803
Hugging Face Model integration in Superbench#803Aishwarya-Tonpe wants to merge 5 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds HuggingFace Hub as a first-class model source across SuperBench training benchmarks and ORT/TensorRT inference micro-benchmarks, enabling users to benchmark arbitrary HF models via CLI flags (including gated models via HF_TOKEN).
Changes:
- Introduces
ModelSourceConfigandHuggingFaceModelLoaderfor unified HF model configuration/loading and memory-fit checks. - Extends PyTorch model benchmarks to optionally load HF backbones and wrap them with task-specific heads.
- Adds HF→ONNX export support and integrates HF flows into ORT and TensorRT inference micro-benchmarks, plus new tests and examples.
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/benchmarks/micro_benchmarks/test_model_source_config.py | Adds unit tests for ModelSourceConfig validation/defaulting. |
| tests/benchmarks/micro_benchmarks/test_huggingface_loader.py | Adds unit tests for HF loader dtype handling, load flow, and size estimation. |
| tests/benchmarks/micro_benchmarks/test_huggingface_e2e.py | Adds integration tests that download real HF models and validate basic forward pass. |
| superbench/benchmarks/model_benchmarks/pytorch_mixtral_impl.py | Adds HF config customization + wrapper and HF-loading branch for Mixtral benchmark. |
| superbench/benchmarks/model_benchmarks/pytorch_lstm.py | Adds HF-loading path + wrapper and refactors in-house model creation. |
| superbench/benchmarks/model_benchmarks/pytorch_llama.py | Adds HF-loading path + wrapper and refactors in-house model creation. |
| superbench/benchmarks/model_benchmarks/pytorch_gpt2.py | Adds HF-loading path + wrapper and refactors in-house model creation. |
| superbench/benchmarks/model_benchmarks/pytorch_cnn.py | Adds HF-loading path + wrapper for HF vision backbones, keeps in-house torchvision path. |
| superbench/benchmarks/model_benchmarks/pytorch_bert.py | Adds HF-loading path + wrapper and refactors in-house model creation. |
| superbench/benchmarks/model_benchmarks/pytorch_base.py | Adds shared HF model loading flow, memory estimation, and CLI args for model source/identifier. |
| superbench/benchmarks/micro_benchmarks/tensorrt_inference_performance.py | Adds HF model preprocessing: config-only memory check, HF load, ONNX export, TRT build command. |
| superbench/benchmarks/micro_benchmarks/ort_inference_performance.py | Adds HF preprocessing (config memory check, HF load, ONNX export/quantize) + dynamic input handling. |
| superbench/benchmarks/micro_benchmarks/model_source_config.py | New dataclass encapsulating model source, identifier, dtype, token, and loader kwargs. |
| superbench/benchmarks/micro_benchmarks/huggingface_model_loader.py | New loader for HF Hub with tokenizer support, size/memory estimation utilities, and pre-checks. |
| superbench/benchmarks/micro_benchmarks/_export_torch_to_onnx.py | Adds HF model ONNX export with vision/NLP detection, dynamic axes, and optional external data output. |
| examples/benchmarks/tensorrt_inference_performance.py | Updates example script to show in-house vs HF usage via CLI. |
| examples/benchmarks/pytorch_huggingface_models.py | New example demonstrating HF-backed training benchmarks, incl. distributed option. |
| examples/benchmarks/ort_inference_performance.py | Updates ORT example script to show in-house vs HF usage via CLI. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| logger.info(f'Loading HuggingFace model: {model_config.identifier}') | ||
|
|
||
| # Step 1: Download config only (few KB) to estimate memory | ||
| hf_token = os.environ.get('HF_TOKEN') or os.environ.get('HUGGING_FACE_HUB_TOKEN') | ||
| load_kwargs = {} | ||
| if hf_token: | ||
| load_kwargs['token'] = hf_token |
There was a problem hiding this comment.
os is used in _create_huggingface_model() but is not imported in this file (based on the shown diff). This will raise NameError at runtime. Add a module-level import os in pytorch_base.py.
|
|
||
| def test_missing_identifier(self): | ||
| """Test missing identifier raises error.""" | ||
| with pytest.raises(ValueError, match='identifier must be provided'): |
There was a problem hiding this comment.
This test’s regex does not match the actual error message raised by ModelSourceConfig.__post_init__() ('Model identifier must be provided.'). Update the match= pattern (e.g., to 'Model identifier must be provided' or a case-insensitive regex) so the test reflects the real behavior.
| with pytest.raises(ValueError, match='identifier must be provided'): | |
| with pytest.raises(ValueError, match='Model identifier must be provided'): |
| do_constant_folding=True, | ||
| input_names=input_names, | ||
| output_names=['output'], | ||
| dynamic_axes=dynamic_axes, |
There was a problem hiding this comment.
For models >2GB, exporting without enabling external-data at export time can fail due to protobuf size limits (the subsequent convert_model_to_external_data() may never run). Pass the appropriate export-time option (e.g., use_external_data_format=use_external_data) so large-model exports succeed reliably.
| dynamic_axes=dynamic_axes, | |
| dynamic_axes=dynamic_axes, | |
| use_external_data_format=use_external_data, |
|
|
||
| # Get input names from the ONNX session to determine input format | ||
| input_names = [input.name for input in ort_sess.get_inputs()] | ||
|
|
||
| # Determine input format based on what the model expects | ||
| if 'pixel_values' in input_names: | ||
| # Vision model: use pixel_values (batch_size, 3, 224, 224) | ||
| pixel_values = np.random.randn(self._args.batch_size, 3, 224, 224).astype(dtype=precision) | ||
| inputs = {'pixel_values': pixel_values} | ||
| elif 'input_ids' in input_names: | ||
| # NLP model: use input_ids and attention_mask | ||
| seq_len = getattr(self._args, 'seq_length', 512) | ||
| input_ids = np.random.randint(0, 30000, (self._args.batch_size, seq_len)).astype(np.int64) | ||
| attention_mask = np.ones((self._args.batch_size, seq_len), dtype=np.int64) | ||
| inputs = { | ||
| 'input_ids': input_ids, | ||
| 'attention_mask': attention_mask | ||
| } | ||
| else: | ||
| # Default for in-house torchvision models: use 'input' (batch_size, 3, 224, 224) | ||
| input_tensor = np.random.randn(self._args.batch_size, 3, 224, 224).astype(dtype=precision) | ||
| inputs = {'input': input_tensor} | ||
|
|
There was a problem hiding this comment.
For many HF-exported NLP models the ONNX graph may require additional inputs beyond input_ids and attention_mask (e.g., token_type_ids, position_ids, sometimes past-key-values). As written, ort_sess.run() will fail with a missing-input error for those models. Build the inputs dict by iterating ort_sess.get_inputs() and generating a tensor for every required input name (using name/type heuristics), rather than hardcoding only two inputs.
| # Get input names from the ONNX session to determine input format | |
| input_names = [input.name for input in ort_sess.get_inputs()] | |
| # Determine input format based on what the model expects | |
| if 'pixel_values' in input_names: | |
| # Vision model: use pixel_values (batch_size, 3, 224, 224) | |
| pixel_values = np.random.randn(self._args.batch_size, 3, 224, 224).astype(dtype=precision) | |
| inputs = {'pixel_values': pixel_values} | |
| elif 'input_ids' in input_names: | |
| # NLP model: use input_ids and attention_mask | |
| seq_len = getattr(self._args, 'seq_length', 512) | |
| input_ids = np.random.randint(0, 30000, (self._args.batch_size, seq_len)).astype(np.int64) | |
| attention_mask = np.ones((self._args.batch_size, seq_len), dtype=np.int64) | |
| inputs = { | |
| 'input_ids': input_ids, | |
| 'attention_mask': attention_mask | |
| } | |
| else: | |
| # Default for in-house torchvision models: use 'input' (batch_size, 3, 224, 224) | |
| input_tensor = np.random.randn(self._args.batch_size, 3, 224, 224).astype(dtype=precision) | |
| inputs = {'input': input_tensor} | |
| batch_size = self._args.batch_size | |
| seq_len = getattr(self._args, 'seq_length', 512) | |
| def _onnx_type_to_numpy_dtype(onnx_type): | |
| dtype_map = { | |
| 'tensor(float16)': np.float16, | |
| 'tensor(float)': np.float32, | |
| 'tensor(double)': np.float64, | |
| 'tensor(int64)': np.int64, | |
| 'tensor(int32)': np.int32, | |
| 'tensor(int16)': np.int16, | |
| 'tensor(int8)': np.int8, | |
| 'tensor(uint64)': np.uint64, | |
| 'tensor(uint32)': np.uint32, | |
| 'tensor(uint16)': np.uint16, | |
| 'tensor(uint8)': np.uint8, | |
| 'tensor(bool)': np.bool_, | |
| } | |
| return dtype_map.get(onnx_type, precision) | |
| def _resolve_shape(name, shape): | |
| if not shape: | |
| return () | |
| resolved_shape = [] | |
| rank = len(shape) | |
| lower_name = name.lower() | |
| for axis, dim in enumerate(shape): | |
| if isinstance(dim, int) and dim > 0: | |
| resolved_shape.append(dim) | |
| continue | |
| if axis == 0: | |
| resolved_shape.append(batch_size) | |
| elif 'pixel_values' in lower_name or (lower_name == 'input' and rank == 4): | |
| if axis == 1: | |
| resolved_shape.append(3) | |
| else: | |
| resolved_shape.append(224) | |
| elif 'past' in lower_name or 'key_values' in lower_name: | |
| resolved_shape.append(seq_len if axis >= rank - 2 else 1) | |
| elif axis == 1: | |
| resolved_shape.append(seq_len) | |
| else: | |
| resolved_shape.append(1) | |
| return tuple(resolved_shape) | |
| def _generate_input_tensor(ort_input): | |
| name = ort_input.name | |
| lower_name = name.lower() | |
| dtype = _onnx_type_to_numpy_dtype(ort_input.type) | |
| shape = _resolve_shape(name, ort_input.shape) | |
| rank = len(shape) | |
| if lower_name == 'input_ids': | |
| return np.random.randint(0, 30000, size=shape, dtype=np.int64) | |
| if lower_name == 'attention_mask': | |
| return np.ones(shape, dtype=np.int64) | |
| if lower_name == 'token_type_ids': | |
| return np.zeros(shape, dtype=np.int64) | |
| if lower_name == 'position_ids': | |
| if rank >= 2: | |
| positions = np.arange(shape[1], dtype=np.int64) | |
| return np.broadcast_to(positions, shape).copy() | |
| return np.arange(shape[0], dtype=np.int64) | |
| if 'pixel_values' in lower_name or (lower_name == 'input' and rank == 4): | |
| return np.random.randn(*shape).astype(dtype=dtype) | |
| if 'past' in lower_name or 'key_values' in lower_name: | |
| return np.zeros(shape, dtype=dtype) | |
| if dtype == np.bool_: | |
| return np.ones(shape, dtype=np.bool_) | |
| if np.issubdtype(dtype, np.integer): | |
| return np.zeros(shape, dtype=dtype) | |
| return np.random.randn(*shape).astype(dtype=dtype) | |
| inputs = {} | |
| for ort_input in ort_sess.get_inputs(): | |
| inputs[ort_input.name] = _generate_input_tensor(ort_input) |
f689460 to
2a47dc8
Compare
2a47dc8 to
a61db26
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 18 out of 18 changed files in this pull request and generated 13 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| dynamic_axes = { | ||
| 'input_ids': { | ||
| 0: 'batch_size', | ||
| 1: 'seq_length' | ||
| }, | ||
| 'attention_mask': { | ||
| 0: 'batch_size', | ||
| 1: 'seq_length' | ||
| }, | ||
| 'output': { | ||
| 0: 'batch_size' | ||
| }, | ||
| } |
There was a problem hiding this comment.
For many NLP models the exported output shape is sequence-dependent (e.g., logits/hidden states often have a seq_length dimension). Currently only the batch dimension is marked dynamic for output, which can lock the exported ONNX to a fixed seq_length and break dynamic-shape inference/engine building. Consider adding the sequence dimension to output’s dynamic_axes when the model output is 3D (batch, seq, hidden/vocab).
| # Get GPU rank to create unique file paths and avoid race conditions | ||
| # when multiple processes export the same model simultaneously | ||
| gpu_rank = os.getenv('CUDA_VISIBLE_DEVICES', '0') | ||
| proc_rank = os.getenv('PROC_RANK', gpu_rank) |
There was a problem hiding this comment.
CUDA_VISIBLE_DEVICES is not a stable per-process rank (it can be a comma-separated list like 0,1), so using it for per-process output directories can cause collisions or odd directory names. Prefer LOCAL_RANK/RANK (torchrun) or MPI local-rank env vars when available; fall back to PID if no rank is set.
| # Get GPU rank to create unique file paths and avoid race conditions | |
| # when multiple processes export the same model simultaneously | |
| gpu_rank = os.getenv('CUDA_VISIBLE_DEVICES', '0') | |
| proc_rank = os.getenv('PROC_RANK', gpu_rank) | |
| # Get a stable per-process rank to create unique file paths and avoid | |
| # race conditions when multiple processes export the same model | |
| # simultaneously. Do not use CUDA_VISIBLE_DEVICES here because it may | |
| # be a comma-separated device list (for example, "0,1") rather than a | |
| # unique per-process rank. | |
| proc_rank = next( | |
| ( | |
| os.getenv(env_name) for env_name in ( | |
| 'PROC_RANK', | |
| 'LOCAL_RANK', | |
| 'OMPI_COMM_WORLD_LOCAL_RANK', | |
| 'MPI_LOCALRANKID', | |
| 'SLURM_LOCALID', | |
| 'RANK', | |
| ) if os.getenv(env_name) is not None | |
| ), | |
| str(os.getpid()), | |
| ) |
| # Get the first input to determine shape and name | ||
| input_name = onnx_model.graph.input[0].name | ||
|
|
||
| # Vision models typically have 4D input (batch, channels, height, width) | ||
| # NLP models typically have 2D input (batch, sequence) | ||
| if input_name == 'pixel_values' or len(onnx_model.graph.input[0].type.tensor_type.shape.dim) == 4: | ||
| # Vision model: batch x channels x height x width | ||
| input_shapes = f'{input_name}:{self._args.batch_size}x3x224x224' | ||
| else: | ||
| # NLP model: batch x sequence - need to specify all inputs with same batch and seq length | ||
| seq_len = getattr(self._args, 'seq_length', 512) | ||
| shapes_list = [] | ||
| for inp in onnx_model.graph.input: |
There was a problem hiding this comment.
ONNX graph.input may include initializers/weights (depending on how the model was saved), so graph.input[0] is not guaranteed to be a real runtime input tensor. This can cause incorrect shape detection and invalid --optShapes. Consider filtering out inputs whose names appear in graph.initializer (and/or using the exporter’s known input names) before selecting the first real input.
| # Get the first input to determine shape and name | |
| input_name = onnx_model.graph.input[0].name | |
| # Vision models typically have 4D input (batch, channels, height, width) | |
| # NLP models typically have 2D input (batch, sequence) | |
| if input_name == 'pixel_values' or len(onnx_model.graph.input[0].type.tensor_type.shape.dim) == 4: | |
| # Vision model: batch x channels x height x width | |
| input_shapes = f'{input_name}:{self._args.batch_size}x3x224x224' | |
| else: | |
| # NLP model: batch x sequence - need to specify all inputs with same batch and seq length | |
| seq_len = getattr(self._args, 'seq_length', 512) | |
| shapes_list = [] | |
| for inp in onnx_model.graph.input: | |
| # Filter out initializer-backed graph inputs; ONNX graph.input may include weights/constants. | |
| initializer_names = {initializer.name for initializer in onnx_model.graph.initializer} | |
| runtime_inputs = [inp for inp in onnx_model.graph.input if inp.name not in initializer_names] | |
| if not runtime_inputs: | |
| logger.error(f'No runtime inputs found in exported ONNX model: {onnx_path}') | |
| return False | |
| # Get the first real runtime input to determine shape and name | |
| first_input = runtime_inputs[0] | |
| input_name = first_input.name | |
| # Vision models typically have 4D input (batch, channels, height, width) | |
| # NLP models typically have 2D input (batch, sequence) | |
| if input_name == 'pixel_values' or len(first_input.type.tensor_type.shape.dim) == 4: | |
| # Vision model: batch x channels x height x width | |
| input_shapes = f'{input_name}:{self._args.batch_size}x3x224x224' | |
| else: | |
| # NLP model: batch x sequence - need to specify all inputs with same batch and seq length | |
| seq_len = getattr(self._args, 'seq_length', 512) | |
| shapes_list = [] | |
| for inp in runtime_inputs: |
| choices=['in-house', 'huggingface'], | ||
| default='in-house', | ||
| required=False, | ||
| help='Source of the model: inhouse (default) or huggingface.', |
There was a problem hiding this comment.
The help text says inhouse but the CLI choice/value is in-house. Align the help text with the actual accepted value to avoid confusing users.
| help='Source of the model: inhouse (default) or huggingface.', | |
| help='Source of the model: in-house (default) or huggingface.', |
a61db26 to
6bebb38
Compare
6bebb38 to
4eec2f7
Compare
4eec2f7 to
2f24e0f
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 10 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| device_map: Optional[str] = None, | ||
| config: Optional[PretrainedConfig] = None, | ||
| **kwargs | ||
| ) -> Tuple[PreTrainedModel, PretrainedConfig, AutoTokenizer]: |
There was a problem hiding this comment.
tokenizer can be None when tokenizer loading fails, but the return type annotation claims it is always AutoTokenizer. Update the return type to reflect optionality (e.g., Optional[...]) so callers and type-checking/tests don't rely on a tokenizer always being present.
| ) -> Tuple[PreTrainedModel, PretrainedConfig, AutoTokenizer]: | |
| ) -> Tuple[PreTrainedModel, PretrainedConfig, Optional[AutoTokenizer]]: |
| tokenizer = None | ||
| try: | ||
| logger.info('Loading tokenizer...') | ||
| tokenizer = AutoTokenizer.from_pretrained(model_identifier, trust_remote_code=True, **load_kwargs) | ||
| except Exception as e: | ||
| logger.warning(f'Could not load tokenizer: {e}. Continuing without tokenizer.') |
There was a problem hiding this comment.
tokenizer can be None when tokenizer loading fails, but the return type annotation claims it is always AutoTokenizer. Update the return type to reflect optionality (e.g., Optional[...]) so callers and type-checking/tests don't rely on a tokenizer always being present.
| f'({self._get_model_size(model):.2f}M parameters)' | ||
| ) | ||
|
|
||
| return model, config, tokenizer |
There was a problem hiding this comment.
tokenizer can be None when tokenizer loading fails, but the return type annotation claims it is always AutoTokenizer. Update the return type to reflect optionality (e.g., Optional[...]) so callers and type-checking/tests don't rely on a tokenizer always being present.
| if not self.identifier: | ||
| raise ValueError('Model identifier must be provided.') |
There was a problem hiding this comment.
This error message likely breaks the newly added unit test that expects the message to match identifier must be provided (case-sensitive substring match). Either adjust the test expectation or change the raised message to match the intended contract; keeping the message stable/consistent is preferable since it becomes a public-ish validation surface.
| raise ValueError(f"Invalid model source '{self.source}'.Must be 'in-house' or 'huggingface'.") | ||
|
|
||
| # Validate torch_dtype | ||
| valid_dtypes = ['float32', 'float16', 'bfloat16', 'int8'] | ||
| if self.torch_dtype not in valid_dtypes: | ||
| raise ValueError(f"Invalid torch_dtype '{self.torch_dtype}'.Must be one of {valid_dtypes}.") |
There was a problem hiding this comment.
Both error strings are missing a space after the period ('.Must'). Please add the missing space so the messages are readable and consistent with other validation errors.
| raise ValueError(f"Invalid model source '{self.source}'.Must be 'in-house' or 'huggingface'.") | |
| # Validate torch_dtype | |
| valid_dtypes = ['float32', 'float16', 'bfloat16', 'int8'] | |
| if self.torch_dtype not in valid_dtypes: | |
| raise ValueError(f"Invalid torch_dtype '{self.torch_dtype}'.Must be one of {valid_dtypes}.") | |
| raise ValueError(f"Invalid model source '{self.source}'. Must be 'in-house' or 'huggingface'.") | |
| # Validate torch_dtype | |
| valid_dtypes = ['float32', 'float16', 'bfloat16', 'int8'] | |
| if self.torch_dtype not in valid_dtypes: | |
| raise ValueError(f"Invalid torch_dtype '{self.torch_dtype}'. Must be one of {valid_dtypes}.") |
| raise ValueError(f"Invalid model source '{self.source}'.Must be 'in-house' or 'huggingface'.") | ||
|
|
||
| # Validate torch_dtype | ||
| valid_dtypes = ['float32', 'float16', 'bfloat16', 'int8'] | ||
| if self.torch_dtype not in valid_dtypes: | ||
| raise ValueError(f"Invalid torch_dtype '{self.torch_dtype}'.Must be one of {valid_dtypes}.") |
There was a problem hiding this comment.
Both error strings are missing a space after the period ('.Must'). Please add the missing space so the messages are readable and consistent with other validation errors.
| raise ValueError(f"Invalid model source '{self.source}'.Must be 'in-house' or 'huggingface'.") | |
| # Validate torch_dtype | |
| valid_dtypes = ['float32', 'float16', 'bfloat16', 'int8'] | |
| if self.torch_dtype not in valid_dtypes: | |
| raise ValueError(f"Invalid torch_dtype '{self.torch_dtype}'.Must be one of {valid_dtypes}.") | |
| raise ValueError(f"Invalid model source '{self.source}'. Must be 'in-house' or 'huggingface'.") | |
| # Validate torch_dtype | |
| valid_dtypes = ['float32', 'float16', 'bfloat16', 'int8'] | |
| if self.torch_dtype not in valid_dtypes: | |
| raise ValueError(f"Invalid torch_dtype '{self.torch_dtype}'. Must be one of {valid_dtypes}.") |
| choices=['in-house', 'huggingface'], | ||
| default='in-house', | ||
| required=False, | ||
| help='Source of the model: inhouse (default) or huggingface.', |
There was a problem hiding this comment.
Same as ORT: the help text references inhouse while the actual choice is in-house. Align the wording with the parser choices.
| help='Source of the model: inhouse (default) or huggingface.', | |
| help='Source of the model: in-house (default) or huggingface.', |
| output_dir = f'/tmp/tensorrt_onnx_rank_{proc_rank}' | ||
| os.makedirs(output_dir, exist_ok=True) |
There was a problem hiding this comment.
Hard-coding exports to /tmp can be problematic in containerized/locked-down environments (noexec, limited disk, or different temp roots) and makes cleanup harder. Prefer using an existing benchmark cache/output directory if available in this benchmark (similar to ORT’s __model_cache_path) or tempfile.mkdtemp() under a configurable base directory.
| del dummy_input | ||
| torch.cuda.empty_cache() | ||
| return file_name | ||
|
|
There was a problem hiding this comment.
The new export_huggingface_model() introduces multiple branches (vision vs NLP, dynamic axes behavior, and external-data conversion for >2GB) but there’s no targeted unit test coverage shown for this method. Consider adding mocked/unit tests that validate: (1) correct input/output names for vision and NLP paths, and (2) external-data conversion is invoked when the size threshold is exceeded (can be done by mocking parameter sizing and ONNX helpers).
| _ONNX_EXTERNAL_DATA_THRESHOLD_BYTES = 2 * 1024 * 1024 * 1024 | |
| def _get_model_parameter_size_bytes(self, model): | |
| """Return the total serialized parameter size in bytes for a model. | |
| This helper is intentionally isolated so unit tests can mock parameter | |
| shapes/sizes without performing a real ONNX export. | |
| Args: | |
| model: Model instance exposing ``parameters()``. | |
| Returns: | |
| int: Total parameter size in bytes. | |
| """ | |
| total_size = 0 | |
| for parameter in model.parameters(): | |
| total_size += parameter.nelement() * parameter.element_size() | |
| return total_size | |
| def _should_use_external_data_format(self, model): | |
| """Return whether ONNX external-data format should be used. | |
| Args: | |
| model: Model instance exposing ``parameters()``. | |
| Returns: | |
| bool: True when the model size exceeds the ONNX 2GB threshold. | |
| """ | |
| return self._get_model_parameter_size_bytes(model) > self._ONNX_EXTERNAL_DATA_THRESHOLD_BYTES | |
| def _build_huggingface_export_config(self, model, batch_size=1, seq_length=512): | |
| """Build dummy input and ONNX I/O metadata for HuggingFace export. | |
| This helper extracts the vision-vs-NLP branch logic into a directly | |
| testable unit so mocked tests can validate input/output names and | |
| dynamic axes without requiring a full export. | |
| Args: | |
| model: HuggingFace model instance to export. | |
| batch_size (int): Batch size of input. Defaults to 1. | |
| seq_length (int): Sequence length of input. Defaults to 512. | |
| Returns: | |
| tuple: (dummy_input, input_names, output_names, dynamic_axes) | |
| """ | |
| config = getattr(model, 'config', None) | |
| model_type = getattr(config, 'model_type', '') | |
| # Vision models typically consume pixel_values with NCHW layout. | |
| if model_type in ('vit', 'swin', 'convnext', 'beit', 'deit', 'resnet', 'detr'): | |
| dummy_input = torch.randn((batch_size, 3, 224, 224), device='cuda') | |
| input_names = ['pixel_values'] | |
| output_names = ['output'] | |
| dynamic_axes = { | |
| 'pixel_values': { | |
| 0: 'batch_size', | |
| }, | |
| 'output': { | |
| 0: 'batch_size', | |
| } | |
| } | |
| return dummy_input, input_names, output_names, dynamic_axes | |
| # Default HuggingFace NLP-style export. | |
| dummy_input = torch.ones((batch_size, seq_length), dtype=torch.int64, device='cuda') | |
| input_names = ['input_ids'] | |
| output_names = ['output'] | |
| dynamic_axes = { | |
| 'input_ids': { | |
| 0: 'batch_size', | |
| 1: 'seq_length', | |
| }, | |
| 'output': { | |
| 0: 'batch_size', | |
| } | |
| } | |
| return dummy_input, input_names, output_names, dynamic_axes |
2f24e0f to
18df07b
Compare
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Handle device mapping for large models | ||
| if device_map: | ||
| model_kwargs['device_map'] = device_map | ||
| elif device == 'cuda' and torch.cuda.is_available(): | ||
| # Don't set device_map if device is explicitly cuda | ||
| pass | ||
| elif device != 'cpu': | ||
| model_kwargs['device_map'] = device | ||
|
|
||
| # Pass pre-downloaded config to from_pretrained so any overrides take effect | ||
| if config is not None: | ||
| model_kwargs['config'] = config | ||
|
|
||
| try: | ||
| model = AutoModel.from_pretrained(model_identifier, **model_kwargs) | ||
| except ValueError: | ||
| logger.info('AutoModel failed, trying AutoModelForCausalLM...') | ||
| model = AutoModelForCausalLM.from_pretrained(model_identifier, **model_kwargs) | ||
|
|
||
| # Move to device if not using device_map | ||
| if not device_map and device != 'auto': | ||
| model = model.to(device) |
There was a problem hiding this comment.
The decision to call model.to(device) is based on the argument device_map, but model_kwargs['device_map'] can be set even when device_map (the arg) is None (e.g., when device != 'cpu' and CUDA is unavailable). In that case, from_pretrained(..., device_map=...) returns a dispatched model and calling .to(...) can error. Track the effective device_map used (e.g., effective_device_map = model_kwargs.get('device_map')) and only call .to(device) when no device_map was actually passed.
| ValueError: If dtype string is invalid. | ||
| """ | ||
| dtype_map = { | ||
| 'float32': torch.float32, | ||
| 'float16': torch.float16, | ||
| 'bfloat16': torch.bfloat16, | ||
| 'int8': torch.int8, | ||
| 'fp32': torch.float32, | ||
| 'fp16': torch.float16, | ||
| 'bf16': torch.bfloat16, | ||
| } | ||
|
|
||
| if dtype_str.lower() not in dtype_map: | ||
| raise ValueError(f"Invalid dtype '{dtype_str}'.Must be one of {list(dtype_map.keys())}") | ||
|
|
||
| return dtype_map[dtype_str.lower()] |
There was a problem hiding this comment.
Allowing torch_dtype='int8' and mapping it to torch.int8 is misleading: from_pretrained(..., torch_dtype=torch.int8) generally isn't a supported way to load int8 weights for Transformers (int8 inference typically requires dedicated quantization flows/backends). Consider rejecting int8 in _get_torch_dtype (or in ModelSourceConfig) and reserving int8 for post-export quantization (as you already do for ORT), or implement a supported HF quantization path explicitly.
| ValueError: If dtype string is invalid. | |
| """ | |
| dtype_map = { | |
| 'float32': torch.float32, | |
| 'float16': torch.float16, | |
| 'bfloat16': torch.bfloat16, | |
| 'int8': torch.int8, | |
| 'fp32': torch.float32, | |
| 'fp16': torch.float16, | |
| 'bf16': torch.bfloat16, | |
| } | |
| if dtype_str.lower() not in dtype_map: | |
| raise ValueError(f"Invalid dtype '{dtype_str}'.Must be one of {list(dtype_map.keys())}") | |
| return dtype_map[dtype_str.lower()] | |
| ValueError: If dtype string is invalid or unsupported for standard HF loading. | |
| """ | |
| normalized_dtype = dtype_str.lower() | |
| if normalized_dtype == 'int8': | |
| raise ValueError( | |
| "Unsupported dtype 'int8' for Hugging Face model loading via torch_dtype. " | |
| 'Use a dedicated quantization/loading path for int8 models or apply int8 quantization ' | |
| 'after export.' | |
| ) | |
| dtype_map = { | |
| 'float32': torch.float32, | |
| 'float16': torch.float16, | |
| 'bfloat16': torch.bfloat16, | |
| 'fp32': torch.float32, | |
| 'fp16': torch.float16, | |
| 'bf16': torch.bfloat16, | |
| } | |
| if normalized_dtype not in dtype_map: | |
| raise ValueError(f"Invalid dtype '{dtype_str}'.Must be one of {list(dtype_map.keys())}") | |
| return dtype_map[normalized_dtype] |
| # Export to ONNX for large models (>2GB), use external data format | ||
| model_size_gb = sum(p.numel() * p.element_size() for p in model.parameters()) / (1024**3) | ||
| use_external_data = model_size_gb > 2.0 | ||
|
|
||
| if use_external_data: | ||
| logger.info(f'Model size is {model_size_gb:.2f}GB, using external data format for ONNX export') | ||
|
|
||
| torch.onnx.export( | ||
| wrapped_model, | ||
| export_args, | ||
| file_name, | ||
| opset_version=14, | ||
| do_constant_folding=True, | ||
| input_names=input_names, | ||
| output_names=['output'], | ||
| dynamic_axes=dynamic_axes, | ||
| ) |
There was a problem hiding this comment.
For models larger than ~2GB, torch.onnx.export(...) may fail before the later convert_model_to_external_data(...) step due to protobuf size limits, because the export itself still attempts to serialize initializers into the main ONNX file. For large-model support to be reliable, enable PyTorch's large/external-data export mode at export time (e.g., using the appropriate large_model / use_external_data_format option supported by your PyTorch version) rather than only converting after the fact.
| # Vision models typically have 4D input (batch, channels, height, width) | ||
| # NLP models typically have 2D input (batch, sequence) | ||
| if input_name == 'pixel_values' or len(onnx_model.graph.input[0].type.tensor_type.shape.dim) == 4: | ||
| # Vision model: batch x channels x height x width | ||
| input_shapes = f'{input_name}:{self._args.batch_size}x3x224x224' | ||
| else: |
There was a problem hiding this comment.
Hard-coding 3x224x224 will produce incorrect shapes for many vision models (e.g., models trained/evaluated at 384px, grayscale, or non-3-channel inputs). Since you're already inspecting the ONNX graph, prefer deriving H/W/C from the declared input shape when static, or (when dynamic/unknown) using model/config metadata (e.g., image_size, num_channels) with sensible defaults.
| @pytest.fixture | ||
| def loader(self): | ||
| """Create a loader instance for testing.""" | ||
| return HuggingFaceModelLoader(cache_dir='/tmp/test_cache', token=None) |
There was a problem hiding this comment.
Using a hard-coded /tmp/... path makes the test non-portable (e.g., Windows runners) and can cause interference across parallel test runs. Prefer tmp_path/tmp_path_factory to generate an isolated cache directory per test.
Adds support for loading and benchmarking models from HuggingFace Hub across Inference micro-benchmarks -ORT/TensorRT inference. Users can run any compatible HF-hosted model through the existing benchmark harness using --model_source huggingface --model_identifier <org/model>.
SuperBench previously only supported in-house model definitions with hardcoded architectures. Adding new models required code changes. This PR allows benchmarking any compatible HuggingFace model with a CLI flag change, including gated models via HF_TOKEN.
Key Changes
New modules:
HuggingFaceModelLoader — Downloads, caches, and loads models from HF Hub. Estimates parameter count from model config (few KB) and checks GPU
memory before downloading full weights to avoid failed multi-GB downloads.
ModelSourceConfig — Dataclass for model source configuration (in-house / huggingface), dtype, revision, auth token, and device mapping.
Micro-benchmarks (inference):
ORT inference — Downloads HF model → exports to ONNX → runs ORT inference. Handles both vision (pixel_values) and NLP (input_ids) inputs
automatically.
TensorRT inference — Same flow: download → ONNX export → trtexec engine build → inference. Includes dynamic input shape detection from the
exported ONNX graph.
ONNX exporter — New export_huggingface_model() method with vision/NLP auto-detection, dynamic axes, and external data support for large models
(>2GB).
Testing
Usage
Training benchmark
ORT inference
python examples/benchmarks/ort_inference_performance.py
--model_source huggingface --model_identifier bert-base-uncased
TensorRT inference
python examples/benchmarks/tensorrt_inference_performance.py
--model_source huggingface --model_identifier microsoft/resnet-50
Gated models
export HF_TOKEN=hf_xxxxx