Hugging Face Model integration in Superbench by Aishwarya-Tonpe · Pull Request #803 · microsoft/superbenchmark

Aishwarya-Tonpe · 2026-04-13T17:35:59Z

Adds support for loading and benchmarking models from HuggingFace Hub across Inference micro-benchmarks -ORT/TensorRT inference. Users can run any compatible HF-hosted model through the existing benchmark harness using --model_source huggingface --model_identifier <org/model>.

SuperBench previously only supported in-house model definitions with hardcoded architectures. Adding new models required code changes. This PR allows benchmarking any compatible HuggingFace model with a CLI flag change, including gated models via HF_TOKEN.

Key Changes

New modules:

HuggingFaceModelLoader — Downloads, caches, and loads models from HF Hub. Estimates parameter count from model config (few KB) and checks GPU
memory before downloading full weights to avoid failed multi-GB downloads.
ModelSourceConfig — Dataclass for model source configuration (in-house / huggingface), dtype, revision, auth token, and device mapping.

Micro-benchmarks (inference):
ORT inference — Downloads HF model → exports to ONNX → runs ORT inference. Handles both vision (pixel_values) and NLP (input_ids) inputs
automatically.
TensorRT inference — Same flow: download → ONNX export → trtexec engine build → inference. Includes dynamic input shape detection from the
exported ONNX graph.
ONNX exporter — New export_huggingface_model() method with vision/NLP auto-detection, dynamic axes, and external data support for large models
(>2GB).

Testing

test_model_source_config.py — Unit tests for validation, defaults, and edge cases.
test_huggingface_loader.py — Unit tests for dtype conversion, model size calculation, memory estimation, and param count estimation.
test_huggingface_e2e.py — End-to-end integration tests covering micro-benchmarks with real HF models.

Usage

Training benchmark

ORT inference
python examples/benchmarks/ort_inference_performance.py
--model_source huggingface --model_identifier bert-base-uncased

TensorRT inference
python examples/benchmarks/tensorrt_inference_performance.py
--model_source huggingface --model_identifier microsoft/resnet-50

Gated models
export HF_TOKEN=hf_xxxxx

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds HuggingFace Hub as a first-class model source across SuperBench training benchmarks and ORT/TensorRT inference micro-benchmarks, enabling users to benchmark arbitrary HF models via CLI flags (including gated models via HF_TOKEN).

Changes:

Introduces ModelSourceConfig and HuggingFaceModelLoader for unified HF model configuration/loading and memory-fit checks.
Extends PyTorch model benchmarks to optionally load HF backbones and wrap them with task-specific heads.
Adds HF→ONNX export support and integrates HF flows into ORT and TensorRT inference micro-benchmarks, plus new tests and examples.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
tests/benchmarks/micro_benchmarks/test_model_source_config.py	Adds unit tests for `ModelSourceConfig` validation/defaulting.
tests/benchmarks/micro_benchmarks/test_huggingface_loader.py	Adds unit tests for HF loader dtype handling, load flow, and size estimation.
tests/benchmarks/micro_benchmarks/test_huggingface_e2e.py	Adds integration tests that download real HF models and validate basic forward pass.
superbench/benchmarks/model_benchmarks/pytorch_mixtral_impl.py	Adds HF config customization + wrapper and HF-loading branch for Mixtral benchmark.
superbench/benchmarks/model_benchmarks/pytorch_lstm.py	Adds HF-loading path + wrapper and refactors in-house model creation.
superbench/benchmarks/model_benchmarks/pytorch_llama.py	Adds HF-loading path + wrapper and refactors in-house model creation.
superbench/benchmarks/model_benchmarks/pytorch_gpt2.py	Adds HF-loading path + wrapper and refactors in-house model creation.
superbench/benchmarks/model_benchmarks/pytorch_cnn.py	Adds HF-loading path + wrapper for HF vision backbones, keeps in-house torchvision path.
superbench/benchmarks/model_benchmarks/pytorch_bert.py	Adds HF-loading path + wrapper and refactors in-house model creation.
superbench/benchmarks/model_benchmarks/pytorch_base.py	Adds shared HF model loading flow, memory estimation, and CLI args for model source/identifier.
superbench/benchmarks/micro_benchmarks/tensorrt_inference_performance.py	Adds HF model preprocessing: config-only memory check, HF load, ONNX export, TRT build command.
superbench/benchmarks/micro_benchmarks/ort_inference_performance.py	Adds HF preprocessing (config memory check, HF load, ONNX export/quantize) + dynamic input handling.
superbench/benchmarks/micro_benchmarks/model_source_config.py	New dataclass encapsulating model source, identifier, dtype, token, and loader kwargs.
superbench/benchmarks/micro_benchmarks/huggingface_model_loader.py	New loader for HF Hub with tokenizer support, size/memory estimation utilities, and pre-checks.
superbench/benchmarks/micro_benchmarks/_export_torch_to_onnx.py	Adds HF model ONNX export with vision/NLP detection, dynamic axes, and optional external data output.
examples/benchmarks/tensorrt_inference_performance.py	Updates example script to show in-house vs HF usage via CLI.
examples/benchmarks/pytorch_huggingface_models.py	New example demonstrating HF-backed training benchmarks, incl. distributed option.
examples/benchmarks/ort_inference_performance.py	Updates ORT example script to show in-house vs HF usage via CLI.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-13T17:44:27Z

superbench/benchmarks/model_benchmarks/pytorch_base.py

+            logger.info(f'Loading HuggingFace model: {model_config.identifier}')
+
+            # Step 1: Download config only (few KB) to estimate memory
+            hf_token = os.environ.get('HF_TOKEN') or os.environ.get('HUGGING_FACE_HUB_TOKEN')
+            load_kwargs = {}
+            if hf_token:
+                load_kwargs['token'] = hf_token


os is used in _create_huggingface_model() but is not imported in this file (based on the shown diff). This will raise NameError at runtime. Add a module-level import os in pytorch_base.py.

Copilot · 2026-04-13T17:44:28Z

tests/benchmarks/micro_benchmarks/test_model_source_config.py

+
+    def test_missing_identifier(self):
+        """Test missing identifier raises error."""
+        with pytest.raises(ValueError, match='identifier must be provided'):


This test’s regex does not match the actual error message raised by ModelSourceConfig.__post_init__() ('Model identifier must be provided.'). Update the match= pattern (e.g., to 'Model identifier must be provided' or a case-insensitive regex) so the test reflects the real behavior.

Suggested change

with pytest.raises(ValueError, match='identifier must be provided'):

with pytest.raises(ValueError, match='Model identifier must be provided'):

Copilot · 2026-04-13T17:44:28Z

superbench/benchmarks/micro_benchmarks/_export_torch_to_onnx.py

+                do_constant_folding=True,
+                input_names=input_names,
+                output_names=['output'],
+                dynamic_axes=dynamic_axes,


For models >2GB, exporting without enabling external-data at export time can fail due to protobuf size limits (the subsequent convert_model_to_external_data() may never run). Pass the appropriate export-time option (e.g., use_external_data_format=use_external_data) so large-model exports succeed reliably.

Suggested change

dynamic_axes=dynamic_axes,

dynamic_axes=dynamic_axes,

use_external_data_format=use_external_data,

Copilot · 2026-04-13T17:44:28Z

superbench/benchmarks/micro_benchmarks/ort_inference_performance.py

+
+        # Get input names from the ONNX session to determine input format
+        input_names = [input.name for input in ort_sess.get_inputs()]
+
+        # Determine input format based on what the model expects
+        if 'pixel_values' in input_names:
+            # Vision model: use pixel_values (batch_size, 3, 224, 224)
+            pixel_values = np.random.randn(self._args.batch_size, 3, 224, 224).astype(dtype=precision)
+            inputs = {'pixel_values': pixel_values}
+        elif 'input_ids' in input_names:
+            # NLP model: use input_ids and attention_mask
+            seq_len = getattr(self._args, 'seq_length', 512)
+            input_ids = np.random.randint(0, 30000, (self._args.batch_size, seq_len)).astype(np.int64)
+            attention_mask = np.ones((self._args.batch_size, seq_len), dtype=np.int64)
+            inputs = {
+                'input_ids': input_ids,
+                'attention_mask': attention_mask
+            }
+        else:
+            # Default for in-house torchvision models: use 'input' (batch_size, 3, 224, 224)
+            input_tensor = np.random.randn(self._args.batch_size, 3, 224, 224).astype(dtype=precision)
+            inputs = {'input': input_tensor}



For many HF-exported NLP models the ONNX graph may require additional inputs beyond input_ids and attention_mask (e.g., token_type_ids, position_ids, sometimes past-key-values). As written, ort_sess.run() will fail with a missing-input error for those models. Build the inputs dict by iterating ort_sess.get_inputs() and generating a tensor for every required input name (using name/type heuristics), rather than hardcoding only two inputs.

Suggested change

# Get input names from the ONNX session to determine input format

input_names = [input.name for input in ort_sess.get_inputs()]

# Determine input format based on what the model expects

if 'pixel_values' in input_names:

# Vision model: use pixel_values (batch_size, 3, 224, 224)

pixel_values = np.random.randn(self._args.batch_size, 3, 224, 224).astype(dtype=precision)

inputs = {'pixel_values': pixel_values}

elif 'input_ids' in input_names:

# NLP model: use input_ids and attention_mask

seq_len = getattr(self._args, 'seq_length', 512)

input_ids = np.random.randint(0, 30000, (self._args.batch_size, seq_len)).astype(np.int64)

attention_mask = np.ones((self._args.batch_size, seq_len), dtype=np.int64)

inputs = {

'input_ids': input_ids,

'attention_mask': attention_mask

}

else:

# Default for in-house torchvision models: use 'input' (batch_size, 3, 224, 224)

input_tensor = np.random.randn(self._args.batch_size, 3, 224, 224).astype(dtype=precision)

inputs = {'input': input_tensor}

batch_size = self._args.batch_size

seq_len = getattr(self._args, 'seq_length', 512)

def _onnx_type_to_numpy_dtype(onnx_type):

dtype_map = {

'tensor(float16)': np.float16,

'tensor(float)': np.float32,

'tensor(double)': np.float64,

'tensor(int64)': np.int64,

'tensor(int32)': np.int32,

'tensor(int16)': np.int16,

'tensor(int8)': np.int8,

'tensor(uint64)': np.uint64,

'tensor(uint32)': np.uint32,

'tensor(uint16)': np.uint16,

'tensor(uint8)': np.uint8,

'tensor(bool)': np.bool_,

}

return dtype_map.get(onnx_type, precision)

def _resolve_shape(name, shape):

if not shape:

return ()

resolved_shape = []

rank = len(shape)

lower_name = name.lower()

for axis, dim in enumerate(shape):

if isinstance(dim, int) and dim > 0:

resolved_shape.append(dim)

continue

if axis == 0:

resolved_shape.append(batch_size)

elif 'pixel_values' in lower_name or (lower_name == 'input' and rank == 4):

if axis == 1:

resolved_shape.append(3)

else:

resolved_shape.append(224)

elif 'past' in lower_name or 'key_values' in lower_name:

resolved_shape.append(seq_len if axis >= rank - 2 else 1)

elif axis == 1:

resolved_shape.append(seq_len)

else:

resolved_shape.append(1)

return tuple(resolved_shape)

def _generate_input_tensor(ort_input):

name = ort_input.name

lower_name = name.lower()

dtype = _onnx_type_to_numpy_dtype(ort_input.type)

shape = _resolve_shape(name, ort_input.shape)

rank = len(shape)

if lower_name == 'input_ids':

return np.random.randint(0, 30000, size=shape, dtype=np.int64)

if lower_name == 'attention_mask':

return np.ones(shape, dtype=np.int64)

if lower_name == 'token_type_ids':

return np.zeros(shape, dtype=np.int64)

if lower_name == 'position_ids':

if rank >= 2:

positions = np.arange(shape[1], dtype=np.int64)

return np.broadcast_to(positions, shape).copy()

return np.arange(shape[0], dtype=np.int64)

if 'pixel_values' in lower_name or (lower_name == 'input' and rank == 4):

return np.random.randn(*shape).astype(dtype=dtype)

if 'past' in lower_name or 'key_values' in lower_name:

return np.zeros(shape, dtype=dtype)

if dtype == np.bool_:

return np.ones(shape, dtype=np.bool_)

if np.issubdtype(dtype, np.integer):

return np.zeros(shape, dtype=dtype)

return np.random.randn(*shape).astype(dtype=dtype)

inputs = {}

for ort_input in ort_sess.get_inputs():

inputs[ort_input.name] = _generate_input_tensor(ort_input)

superbench/benchmarks/micro_benchmarks/model_source_config.py

superbench/benchmarks/model_benchmarks/pytorch_base.py

… SuperBench

Copilot

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 13 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

superbench/benchmarks/model_benchmarks/pytorch_base.py

superbench/benchmarks/micro_benchmarks/huggingface_model_loader.py

Copilot · 2026-04-14T17:39:17Z

superbench/benchmarks/micro_benchmarks/_export_torch_to_onnx.py

+                dynamic_axes = {
+                    'input_ids': {
+                        0: 'batch_size',
+                        1: 'seq_length'
+                    },
+                    'attention_mask': {
+                        0: 'batch_size',
+                        1: 'seq_length'
+                    },
+                    'output': {
+                        0: 'batch_size'
+                    },
+                }


For many NLP models the exported output shape is sequence-dependent (e.g., logits/hidden states often have a seq_length dimension). Currently only the batch dimension is marked dynamic for output, which can lock the exported ONNX to a fixed seq_length and break dynamic-shape inference/engine building. Consider adding the sequence dimension to output’s dynamic_axes when the model output is 3D (batch, seq, hidden/vocab).

superbench/benchmarks/micro_benchmarks/ort_inference_performance.py

Copilot · 2026-04-14T17:39:18Z

superbench/benchmarks/micro_benchmarks/tensorrt_inference_performance.py

+            # Get GPU rank to create unique file paths and avoid race conditions
+            # when multiple processes export the same model simultaneously
+            gpu_rank = os.getenv('CUDA_VISIBLE_DEVICES', '0')
+            proc_rank = os.getenv('PROC_RANK', gpu_rank)


CUDA_VISIBLE_DEVICES is not a stable per-process rank (it can be a comma-separated list like 0,1), so using it for per-process output directories can cause collisions or odd directory names. Prefer LOCAL_RANK/RANK (torchrun) or MPI local-rank env vars when available; fall back to PID if no rank is set.

Suggested change

# Get GPU rank to create unique file paths and avoid race conditions

# when multiple processes export the same model simultaneously

gpu_rank = os.getenv('CUDA_VISIBLE_DEVICES', '0')

proc_rank = os.getenv('PROC_RANK', gpu_rank)

# Get a stable per-process rank to create unique file paths and avoid

# race conditions when multiple processes export the same model

# simultaneously. Do not use CUDA_VISIBLE_DEVICES here because it may

# be a comma-separated device list (for example, "0,1") rather than a

# unique per-process rank.

proc_rank = next(

(

os.getenv(env_name) for env_name in (

'PROC_RANK',

'LOCAL_RANK',

'OMPI_COMM_WORLD_LOCAL_RANK',

'MPI_LOCALRANKID',

'SLURM_LOCALID',

'RANK',

) if os.getenv(env_name) is not None

),

str(os.getpid()),

)

Copilot · 2026-04-14T17:39:18Z

superbench/benchmarks/micro_benchmarks/tensorrt_inference_performance.py

+            # Get the first input to determine shape and name
+            input_name = onnx_model.graph.input[0].name
+
+            # Vision models typically have 4D input (batch, channels, height, width)
+            # NLP models typically have 2D input (batch, sequence)
+            if input_name == 'pixel_values' or len(onnx_model.graph.input[0].type.tensor_type.shape.dim) == 4:
+                # Vision model: batch x channels x height x width
+                input_shapes = f'{input_name}:{self._args.batch_size}x3x224x224'
+            else:
+                # NLP model: batch x sequence - need to specify all inputs with same batch and seq length
+                seq_len = getattr(self._args, 'seq_length', 512)
+                shapes_list = []
+                for inp in onnx_model.graph.input:


ONNX graph.input may include initializers/weights (depending on how the model was saved), so graph.input[0] is not guaranteed to be a real runtime input tensor. This can cause incorrect shape detection and invalid --optShapes. Consider filtering out inputs whose names appear in graph.initializer (and/or using the exporter’s known input names) before selecting the first real input.

Suggested change

# Get the first input to determine shape and name

input_name = onnx_model.graph.input[0].name

# Vision models typically have 4D input (batch, channels, height, width)

# NLP models typically have 2D input (batch, sequence)

if input_name == 'pixel_values' or len(onnx_model.graph.input[0].type.tensor_type.shape.dim) == 4:

# Vision model: batch x channels x height x width

input_shapes = f'{input_name}:{self._args.batch_size}x3x224x224'

else:

# NLP model: batch x sequence - need to specify all inputs with same batch and seq length

seq_len = getattr(self._args, 'seq_length', 512)

shapes_list = []

for inp in onnx_model.graph.input:

# Filter out initializer-backed graph inputs; ONNX graph.input may include weights/constants.

initializer_names = {initializer.name for initializer in onnx_model.graph.initializer}

runtime_inputs = [inp for inp in onnx_model.graph.input if inp.name not in initializer_names]

if not runtime_inputs:

logger.error(f'No runtime inputs found in exported ONNX model: {onnx_path}')

return False

# Get the first real runtime input to determine shape and name

first_input = runtime_inputs[0]

input_name = first_input.name

# Vision models typically have 4D input (batch, channels, height, width)

# NLP models typically have 2D input (batch, sequence)

if input_name == 'pixel_values' or len(first_input.type.tensor_type.shape.dim) == 4:

# Vision model: batch x channels x height x width

input_shapes = f'{input_name}:{self._args.batch_size}x3x224x224'

else:

# NLP model: batch x sequence - need to specify all inputs with same batch and seq length

seq_len = getattr(self._args, 'seq_length', 512)

shapes_list = []

for inp in runtime_inputs:

Copilot · 2026-04-14T17:39:18Z

superbench/benchmarks/micro_benchmarks/ort_inference_performance.py

+            choices=['in-house', 'huggingface'],
+            default='in-house',
+            required=False,
+            help='Source of the model: inhouse (default) or huggingface.',


The help text says inhouse but the CLI choice/value is in-house. Align the help text with the actual accepted value to avoid confusing users.

Suggested change

help='Source of the model: inhouse (default) or huggingface.',

help='Source of the model: in-house (default) or huggingface.',

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 10 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-14T20:42:52Z

superbench/benchmarks/micro_benchmarks/huggingface_model_loader.py

+        device_map: Optional[str] = None,
+        config: Optional[PretrainedConfig] = None,
+        **kwargs
+    ) -> Tuple[PreTrainedModel, PretrainedConfig, AutoTokenizer]:


tokenizer can be None when tokenizer loading fails, but the return type annotation claims it is always AutoTokenizer. Update the return type to reflect optionality (e.g., Optional[...]) so callers and type-checking/tests don't rely on a tokenizer always being present.

Suggested change

) -> Tuple[PreTrainedModel, PretrainedConfig, AutoTokenizer]:

) -> Tuple[PreTrainedModel, PretrainedConfig, Optional[AutoTokenizer]]:

Copilot · 2026-04-14T20:42:53Z

superbench/benchmarks/micro_benchmarks/huggingface_model_loader.py

+            tokenizer = None
+            try:
+                logger.info('Loading tokenizer...')
+                tokenizer = AutoTokenizer.from_pretrained(model_identifier, trust_remote_code=True, **load_kwargs)
+            except Exception as e:
+                logger.warning(f'Could not load tokenizer: {e}. Continuing without tokenizer.')


tokenizer can be None when tokenizer loading fails, but the return type annotation claims it is always AutoTokenizer. Update the return type to reflect optionality (e.g., Optional[...]) so callers and type-checking/tests don't rely on a tokenizer always being present.

Copilot · 2026-04-14T20:42:53Z

superbench/benchmarks/micro_benchmarks/huggingface_model_loader.py

+                f'({self._get_model_size(model):.2f}M parameters)'
+            )
+
+            return model, config, tokenizer


tokenizer can be None when tokenizer loading fails, but the return type annotation claims it is always AutoTokenizer. Update the return type to reflect optionality (e.g., Optional[...]) so callers and type-checking/tests don't rely on a tokenizer always being present.

Copilot · 2026-04-14T20:42:53Z

superbench/benchmarks/micro_benchmarks/model_source_config.py

+        if not self.identifier:
+            raise ValueError('Model identifier must be provided.')


This error message likely breaks the newly added unit test that expects the message to match identifier must be provided (case-sensitive substring match). Either adjust the test expectation or change the raised message to match the intended contract; keeping the message stable/consistent is preferable since it becomes a public-ish validation surface.

Copilot · 2026-04-14T20:42:53Z

superbench/benchmarks/micro_benchmarks/model_source_config.py

+            raise ValueError(f"Invalid model source '{self.source}'.Must be 'in-house' or 'huggingface'.")
+
+        # Validate torch_dtype
+        valid_dtypes = ['float32', 'float16', 'bfloat16', 'int8']
+        if self.torch_dtype not in valid_dtypes:
+            raise ValueError(f"Invalid torch_dtype '{self.torch_dtype}'.Must be one of {valid_dtypes}.")


Both error strings are missing a space after the period ('.Must'). Please add the missing space so the messages are readable and consistent with other validation errors.

Suggested change

raise ValueError(f"Invalid model source '{self.source}'.Must be 'in-house' or 'huggingface'.")

# Validate torch_dtype

valid_dtypes = ['float32', 'float16', 'bfloat16', 'int8']

if self.torch_dtype not in valid_dtypes:

raise ValueError(f"Invalid torch_dtype '{self.torch_dtype}'.Must be one of {valid_dtypes}.")

raise ValueError(f"Invalid model source '{self.source}'. Must be 'in-house' or 'huggingface'.")

# Validate torch_dtype

valid_dtypes = ['float32', 'float16', 'bfloat16', 'int8']

if self.torch_dtype not in valid_dtypes:

raise ValueError(f"Invalid torch_dtype '{self.torch_dtype}'. Must be one of {valid_dtypes}.")

Copilot · 2026-04-14T20:42:54Z

superbench/benchmarks/micro_benchmarks/model_source_config.py

+            raise ValueError(f"Invalid model source '{self.source}'.Must be 'in-house' or 'huggingface'.")
+
+        # Validate torch_dtype
+        valid_dtypes = ['float32', 'float16', 'bfloat16', 'int8']
+        if self.torch_dtype not in valid_dtypes:
+            raise ValueError(f"Invalid torch_dtype '{self.torch_dtype}'.Must be one of {valid_dtypes}.")


Both error strings are missing a space after the period ('.Must'). Please add the missing space so the messages are readable and consistent with other validation errors.

Suggested change

raise ValueError(f"Invalid model source '{self.source}'.Must be 'in-house' or 'huggingface'.")

# Validate torch_dtype

valid_dtypes = ['float32', 'float16', 'bfloat16', 'int8']

if self.torch_dtype not in valid_dtypes:

raise ValueError(f"Invalid torch_dtype '{self.torch_dtype}'.Must be one of {valid_dtypes}.")

raise ValueError(f"Invalid model source '{self.source}'. Must be 'in-house' or 'huggingface'.")

# Validate torch_dtype

valid_dtypes = ['float32', 'float16', 'bfloat16', 'int8']

if self.torch_dtype not in valid_dtypes:

raise ValueError(f"Invalid torch_dtype '{self.torch_dtype}'. Must be one of {valid_dtypes}.")

Copilot · 2026-04-14T20:42:54Z

superbench/benchmarks/micro_benchmarks/tensorrt_inference_performance.py

+            choices=['in-house', 'huggingface'],
+            default='in-house',
+            required=False,
+            help='Source of the model: inhouse (default) or huggingface.',


Same as ORT: the help text references inhouse while the actual choice is in-house. Align the wording with the parser choices.

Suggested change

help='Source of the model: inhouse (default) or huggingface.',

help='Source of the model: in-house (default) or huggingface.',

Copilot · 2026-04-14T20:42:54Z

superbench/benchmarks/micro_benchmarks/tensorrt_inference_performance.py

+            output_dir = f'/tmp/tensorrt_onnx_rank_{proc_rank}'
+            os.makedirs(output_dir, exist_ok=True)


Hard-coding exports to /tmp can be problematic in containerized/locked-down environments (noexec, limited disk, or different temp roots) and makes cleanup harder. Prefer using an existing benchmark cache/output directory if available in this benchmark (similar to ORT’s __model_cache_path) or tempfile.mkdtemp() under a configurable base directory.

Copilot · 2026-04-14T20:42:54Z

superbench/benchmarks/micro_benchmarks/_export_torch_to_onnx.py

        del dummy_input
        torch.cuda.empty_cache()
        return file_name
+


The new export_huggingface_model() introduces multiple branches (vision vs NLP, dynamic axes behavior, and external-data conversion for >2GB) but there’s no targeted unit test coverage shown for this method. Consider adding mocked/unit tests that validate: (1) correct input/output names for vision and NLP paths, and (2) external-data conversion is invoked when the size threshold is exceeded (can be done by mocking parameter sizing and ONNX helpers).

Suggested change

_ONNX_EXTERNAL_DATA_THRESHOLD_BYTES = 2 * 1024 * 1024 * 1024

def _get_model_parameter_size_bytes(self, model):

"""Return the total serialized parameter size in bytes for a model.

This helper is intentionally isolated so unit tests can mock parameter

shapes/sizes without performing a real ONNX export.

Args:

model: Model instance exposing ``parameters()``.

Returns:

int: Total parameter size in bytes.

"""

total_size = 0

for parameter in model.parameters():

total_size += parameter.nelement() * parameter.element_size()

return total_size

def _should_use_external_data_format(self, model):

"""Return whether ONNX external-data format should be used.

Args:

model: Model instance exposing ``parameters()``.

Returns:

bool: True when the model size exceeds the ONNX 2GB threshold.

"""

return self._get_model_parameter_size_bytes(model) > self._ONNX_EXTERNAL_DATA_THRESHOLD_BYTES

def _build_huggingface_export_config(self, model, batch_size=1, seq_length=512):

"""Build dummy input and ONNX I/O metadata for HuggingFace export.

This helper extracts the vision-vs-NLP branch logic into a directly

testable unit so mocked tests can validate input/output names and

dynamic axes without requiring a full export.

Args:

model: HuggingFace model instance to export.

batch_size (int): Batch size of input. Defaults to 1.

seq_length (int): Sequence length of input. Defaults to 512.

Returns:

tuple: (dummy_input, input_names, output_names, dynamic_axes)

"""

config = getattr(model, 'config', None)

model_type = getattr(config, 'model_type', '')

# Vision models typically consume pixel_values with NCHW layout.

if model_type in ('vit', 'swin', 'convnext', 'beit', 'deit', 'resnet', 'detr'):

dummy_input = torch.randn((batch_size, 3, 224, 224), device='cuda')

input_names = ['pixel_values']

output_names = ['output']

dynamic_axes = {

'pixel_values': {

0: 'batch_size',

},

'output': {

0: 'batch_size',

}

}

return dummy_input, input_names, output_names, dynamic_axes

# Default HuggingFace NLP-style export.

dummy_input = torch.ones((batch_size, seq_length), dtype=torch.int64, device='cuda')

input_names = ['input_ids']

output_names = ['output']

dynamic_axes = {

'input_ids': {

0: 'batch_size',

1: 'seq_length',

},

'output': {

0: 'batch_size',

}

}

return dummy_input, input_names, output_names, dynamic_axes

tests/benchmarks/micro_benchmarks/test_huggingface_e2e.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-14T20:59:37Z

superbench/benchmarks/micro_benchmarks/huggingface_model_loader.py

+            # Handle device mapping for large models
+            if device_map:
+                model_kwargs['device_map'] = device_map
+            elif device == 'cuda' and torch.cuda.is_available():
+                # Don't set device_map if device is explicitly cuda
+                pass
+            elif device != 'cpu':
+                model_kwargs['device_map'] = device
+
+            # Pass pre-downloaded config to from_pretrained so any overrides take effect
+            if config is not None:
+                model_kwargs['config'] = config
+
+            try:
+                model = AutoModel.from_pretrained(model_identifier, **model_kwargs)
+            except ValueError:
+                logger.info('AutoModel failed, trying AutoModelForCausalLM...')
+                model = AutoModelForCausalLM.from_pretrained(model_identifier, **model_kwargs)
+
+            # Move to device if not using device_map
+            if not device_map and device != 'auto':
+                model = model.to(device)


The decision to call model.to(device) is based on the argument device_map, but model_kwargs['device_map'] can be set even when device_map (the arg) is None (e.g., when device != 'cpu' and CUDA is unavailable). In that case, from_pretrained(..., device_map=...) returns a dispatched model and calling .to(...) can error. Track the effective device_map used (e.g., effective_device_map = model_kwargs.get('device_map')) and only call .to(device) when no device_map was actually passed.

Copilot · 2026-04-14T20:59:37Z

superbench/benchmarks/micro_benchmarks/huggingface_model_loader.py

+            ValueError: If dtype string is invalid.
+        """
+        dtype_map = {
+            'float32': torch.float32,
+            'float16': torch.float16,
+            'bfloat16': torch.bfloat16,
+            'int8': torch.int8,
+            'fp32': torch.float32,
+            'fp16': torch.float16,
+            'bf16': torch.bfloat16,
+        }
+
+        if dtype_str.lower() not in dtype_map:
+            raise ValueError(f"Invalid dtype '{dtype_str}'.Must be one of {list(dtype_map.keys())}")
+
+        return dtype_map[dtype_str.lower()]


Allowing torch_dtype='int8' and mapping it to torch.int8 is misleading: from_pretrained(..., torch_dtype=torch.int8) generally isn't a supported way to load int8 weights for Transformers (int8 inference typically requires dedicated quantization flows/backends). Consider rejecting int8 in _get_torch_dtype (or in ModelSourceConfig) and reserving int8 for post-export quantization (as you already do for ORT), or implement a supported HF quantization path explicitly.

Suggested change

ValueError: If dtype string is invalid.

"""

dtype_map = {

'float32': torch.float32,

'float16': torch.float16,

'bfloat16': torch.bfloat16,

'int8': torch.int8,

'fp32': torch.float32,

'fp16': torch.float16,

'bf16': torch.bfloat16,

}

if dtype_str.lower() not in dtype_map:

raise ValueError(f"Invalid dtype '{dtype_str}'.Must be one of {list(dtype_map.keys())}")

return dtype_map[dtype_str.lower()]

ValueError: If dtype string is invalid or unsupported for standard HF loading.

"""

normalized_dtype = dtype_str.lower()

if normalized_dtype == 'int8':

raise ValueError(

"Unsupported dtype 'int8' for Hugging Face model loading via torch_dtype. "

'Use a dedicated quantization/loading path for int8 models or apply int8 quantization '

'after export.'

)

dtype_map = {

'float32': torch.float32,

'float16': torch.float16,

'bfloat16': torch.bfloat16,

'fp32': torch.float32,

'fp16': torch.float16,

'bf16': torch.bfloat16,

}

if normalized_dtype not in dtype_map:

raise ValueError(f"Invalid dtype '{dtype_str}'.Must be one of {list(dtype_map.keys())}")

return dtype_map[normalized_dtype]

Copilot · 2026-04-14T20:59:38Z

superbench/benchmarks/micro_benchmarks/_export_torch_to_onnx.py

+            # Export to ONNX for large models (>2GB), use external data format
+            model_size_gb = sum(p.numel() * p.element_size() for p in model.parameters()) / (1024**3)
+            use_external_data = model_size_gb > 2.0
+
+            if use_external_data:
+                logger.info(f'Model size is {model_size_gb:.2f}GB, using external data format for ONNX export')
+
+            torch.onnx.export(
+                wrapped_model,
+                export_args,
+                file_name,
+                opset_version=14,
+                do_constant_folding=True,
+                input_names=input_names,
+                output_names=['output'],
+                dynamic_axes=dynamic_axes,
+            )


For models larger than ~2GB, torch.onnx.export(...) may fail before the later convert_model_to_external_data(...) step due to protobuf size limits, because the export itself still attempts to serialize initializers into the main ONNX file. For large-model support to be reliable, enable PyTorch's large/external-data export mode at export time (e.g., using the appropriate large_model / use_external_data_format option supported by your PyTorch version) rather than only converting after the fact.

Copilot · 2026-04-14T20:59:38Z

superbench/benchmarks/micro_benchmarks/tensorrt_inference_performance.py

+            # Vision models typically have 4D input (batch, channels, height, width)
+            # NLP models typically have 2D input (batch, sequence)
+            if input_name == 'pixel_values' or len(onnx_model.graph.input[0].type.tensor_type.shape.dim) == 4:
+                # Vision model: batch x channels x height x width
+                input_shapes = f'{input_name}:{self._args.batch_size}x3x224x224'
+            else:


Hard-coding 3x224x224 will produce incorrect shapes for many vision models (e.g., models trained/evaluated at 384px, grayscale, or non-3-channel inputs). Since you're already inspecting the ONNX graph, prefer deriving H/W/C from the declared input shape when static, or (when dynamic/unknown) using model/config metadata (e.g., image_size, num_channels) with sensible defaults.

Copilot · 2026-04-14T20:59:38Z

tests/benchmarks/micro_benchmarks/test_huggingface_loader.py

+    @pytest.fixture
+    def loader(self):
+        """Create a loader instance for testing."""
+        return HuggingFaceModelLoader(cache_dir='/tmp/test_cache', token=None)


Using a hard-coded /tmp/... path makes the test non-portable (e.g., Windows runners) and can cause interference across parallel test runs. Prefer tmp_path/tmp_path_factory to generate an isolated cache directory per test.

Aishwarya-Tonpe requested a review from a team as a code owner April 13, 2026 17:36

Copilot AI review requested due to automatic review settings April 13, 2026 17:36

Copilot AI reviewed Apr 13, 2026

View reviewed changes

Copilot started reviewing on behalf of Aishwarya-Tonpe April 13, 2026 17:48 View session

Aishwarya-Tonpe added 3 commits April 13, 2026 18:51

feat: Integrate HuggingFace Hub model loading with OOM pre-check into…

ed0d4f2

… SuperBench

Removing unnecessary code

21a4cc8

Making chnages for code uniformity and removing unnecessary code

c531a18

Aishwarya-Tonpe force-pushed the hf-models-clean branch from f689460 to 2a47dc8 Compare April 13, 2026 19:33

Aishwarya-Tonpe changed the title ~~Hf models clean~~ Hugging Face Model integration in Superbench Apr 14, 2026

Copilot AI review requested due to automatic review settings April 14, 2026 17:30

Aishwarya-Tonpe force-pushed the hf-models-clean branch from 2a47dc8 to a61db26 Compare April 14, 2026 17:30

Copilot AI reviewed Apr 14, 2026

View reviewed changes

Aishwarya-Tonpe force-pushed the hf-models-clean branch from a61db26 to 6bebb38 Compare April 14, 2026 18:27

Copilot started reviewing on behalf of Aishwarya-Tonpe April 14, 2026 18:34 View session

Aishwarya-Tonpe force-pushed the hf-models-clean branch from 6bebb38 to 4eec2f7 Compare April 14, 2026 20:05

Copilot AI review requested due to automatic review settings April 14, 2026 20:34

Aishwarya-Tonpe force-pushed the hf-models-clean branch from 4eec2f7 to 2f24e0f Compare April 14, 2026 20:34

Copilot AI reviewed Apr 14, 2026

View reviewed changes

Minor change in the test file

18df07b

Aishwarya-Tonpe force-pushed the hf-models-clean branch from 2f24e0f to 18df07b Compare April 14, 2026 20:47

Update tests/benchmarks/micro_benchmarks/test_huggingface_e2e.py

12a65ad

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings April 14, 2026 20:51

Copilot AI reviewed Apr 14, 2026

View reviewed changes

Copilot started reviewing on behalf of Aishwarya-Tonpe April 14, 2026 22:44 View session

Copilot started reviewing on behalf of Aishwarya-Tonpe April 14, 2026 23:07 View session

	with pytest.raises(ValueError, match='identifier must be provided'):
	with pytest.raises(ValueError, match='Model identifier must be provided'):

	dynamic_axes=dynamic_axes,
	dynamic_axes=dynamic_axes,
	use_external_data_format=use_external_data,

-            # Get GPU rank to create unique file paths and avoid race conditions
-            # when multiple processes export the same model simultaneously
-            gpu_rank = os.getenv('CUDA_VISIBLE_DEVICES', '0')
-            proc_rank = os.getenv('PROC_RANK', gpu_rank)
+            # Get a stable per-process rank to create unique file paths and avoid
+            # race conditions when multiple processes export the same model
+            # simultaneously. Do not use CUDA_VISIBLE_DEVICES here because it may
+            # be a comma-separated device list (for example, "0,1") rather than a
+            # unique per-process rank.
+            proc_rank = next(
+                (
+                    os.getenv(env_name) for env_name in (
+                        'PROC_RANK',
+                        'LOCAL_RANK',
+                        'OMPI_COMM_WORLD_LOCAL_RANK',
+                        'MPI_LOCALRANKID',
+                        'SLURM_LOCALID',
+                        'RANK',
+                    ) if os.getenv(env_name) is not None
+                ),
+                str(os.getpid()),
+            )

	help='Source of the model: inhouse (default) or huggingface.',
	help='Source of the model: in-house (default) or huggingface.',

	) -> Tuple[PreTrainedModel, PretrainedConfig, AutoTokenizer]:
	) -> Tuple[PreTrainedModel, PretrainedConfig, Optional[AutoTokenizer]]:

		if not self.identifier:
		raise ValueError('Model identifier must be provided.')

		output_dir = f'/tmp/tensorrt_onnx_rank_{proc_rank}'
		os.makedirs(output_dir, exist_ok=True)

+    _ONNX_EXTERNAL_DATA_THRESHOLD_BYTES = 2 * 1024 * 1024 * 1024
+    def _get_model_parameter_size_bytes(self, model):
+        """Return the total serialized parameter size in bytes for a model.
+        This helper is intentionally isolated so unit tests can mock parameter
+        shapes/sizes without performing a real ONNX export.
+        Args:
+            model: Model instance exposing ``parameters()``.
+        Returns:
+            int: Total parameter size in bytes.
+        """
+        total_size = 0
+        for parameter in model.parameters():
+            total_size += parameter.nelement() * parameter.element_size()
+        return total_size
+    def _should_use_external_data_format(self, model):
+        """Return whether ONNX external-data format should be used.
+        Args:
+            model: Model instance exposing ``parameters()``.
+        Returns:
+            bool: True when the model size exceeds the ONNX 2GB threshold.
+        """
+        return self._get_model_parameter_size_bytes(model) > self._ONNX_EXTERNAL_DATA_THRESHOLD_BYTES
+    def _build_huggingface_export_config(self, model, batch_size=1, seq_length=512):
+        """Build dummy input and ONNX I/O metadata for HuggingFace export.
+        This helper extracts the vision-vs-NLP branch logic into a directly
+        testable unit so mocked tests can validate input/output names and
+        dynamic axes without requiring a full export.
+        Args:
+            model: HuggingFace model instance to export.
+            batch_size (int): Batch size of input. Defaults to 1.
+            seq_length (int): Sequence length of input. Defaults to 512.
+        Returns:
+            tuple: (dummy_input, input_names, output_names, dynamic_axes)
+        """
+        config = getattr(model, 'config', None)
+        model_type = getattr(config, 'model_type', '')
+        # Vision models typically consume pixel_values with NCHW layout.
+        if model_type in ('vit', 'swin', 'convnext', 'beit', 'deit', 'resnet', 'detr'):
+            dummy_input = torch.randn((batch_size, 3, 224, 224), device='cuda')
+            input_names = ['pixel_values']
+            output_names = ['output']
+            dynamic_axes = {
+                'pixel_values': {
+: 'batch_size',
+                },
+                'output': {
+: 'batch_size',
+                }
+            }
+            return dummy_input, input_names, output_names, dynamic_axes
+        # Default HuggingFace NLP-style export.
+        dummy_input = torch.ones((batch_size, seq_length), dtype=torch.int64, device='cuda')
+        input_names = ['input_ids']
+        output_names = ['output']
+        dynamic_axes = {
+            'input_ids': {
+: 'batch_size',
+: 'seq_length',
+            },
+            'output': {
+: 'batch_size',
+            }
+        }
+        return dummy_input, input_names, output_names, dynamic_axes

Conversation

Aishwarya-Tonpe commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key Changes

Testing

Usage

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Aishwarya-Tonpe commented Apr 13, 2026 •

edited

Loading