id	model-benchmarks

Model Benchmarks

PyTorch Model Benchmarks

`model-benchmarks`

Introduction

Run training or inference tasks with single or half precision for deep learning models, including the following categories:

GPT: gpt2-small, gpt2-medium, gpt2-large and gpt2-xl
LLAMA: llama2-7b, llama2-13b, llama2-70b
MoE: mixtral-8x7b, mixtral-8x22b
BERT: bert-base and bert-large
LSTM
CNN, listed in torchvision.models, including:
- resnet: resnet18, resnet34, resnet50, resnet101, resnet152
- resnext: resnext50_32x4d, resnext101_32x8d
- wide_resnet: wide_resnet50_2, wide_resnet101_2
- densenet: densenet121, densenet169, densenet201, densenet161
- vgg: vgg11, vgg11_bn, vgg13, vgg13_bn, vgg16, vgg16_bn, vgg19_bn, vgg19
- mnasnet: mnasnet0_5, mnasnet0_75, mnasnet1_0, mnasnet1_3
- mobilenet: mobilenet_v2
- shufflenet: shufflenet_v2_x0_5, shufflenet_v2_x1_0, shufflenet_v2_x1_5, shufflenet_v2_x2_0
- squeezenet: squeezenet1_0, squeezenet1_1
- others: alexnet, googlenet, inception_v3

For inference, supported percentiles include 50^th, 90^th, 95^th, 99^th, and 99.9^th.

New: Support fp8_hybrid and fp8_e4m3 precision for BERT models.

New: Deterministic Training Support SuperBench now supports deterministic training to ensure reproducibility across runs. This includes fixed seeds and deterministic algorithms. To enable deterministic training, use the following flags:

Flags:
- --enable_determinism: Enables deterministic computation for reproducible results.
- --deterministic_seed <seed>: Sets the seed for reproducibility (default: 42).
- --check_frequency <steps>: How often to record deterministic metrics (default: 100).
Environment Variables (set automatically by SuperBench when --enable_determinism is used):
- CUBLAS_WORKSPACE_CONFIG=:4096:8: Ensures deterministic behavior in cuBLAS. This can be overridden by setting it manually before running SuperBench.

Comparing Deterministic Results

To compare deterministic results between runs, use the standard result analysis workflow:

Run benchmark with --enable_determinism flag
Generate baseline: sb result generate-baseline --data-file results.jsonl --summary-rule-file rules.yaml
Compare future runs: sb result diagnosis --data-file new-results.jsonl --rule-file diagnosis-rule.yaml --baseline-file baseline.json

This allows configurable tolerance for floating-point differences via YAML rules.

Configuration Parameter Validation

When determinism is enabled, benchmark configuration parameters (batch_size, num_steps, deterministic_seed, etc.) are automatically recorded in the results file as deterministic_config_* metrics. The diagnosis rules enforce exact matching of these parameters between runs to ensure valid comparisons:

If any configuration parameter differs between runs, the diagnosis will flag it as a failure, ensuring you only compare runs with identical configurations.

Summary Rule Snippet for Determinism

Include the following rule in your summary rule file (used with sb result summary or sb result generate-baseline --summary-rule-file) to surface deterministic metrics in the results summary:

superbench:
  rules:
    model-benchmarks-deterministic:
      statistics:
        - mean
      categories: Deterministic
      metrics:
        - model-benchmarks:.*/deterministic_loss.*
        - model-benchmarks:.*/deterministic_act_mean.*
        - model-benchmarks:.*/deterministic_check_count.*
        - model-benchmarks:.*/deterministic_step.*
        - model-benchmarks:.*/deterministic_config_.*
        - model-benchmarks:.*/return_code.*

This groups all deterministic outputs — loss fingerprints, activation means, check counts, step numbers, configuration parameters, and return codes — under the Deterministic category.

Diagnosis Rule Snippet for Determinism

Include the following rules in your diagnosis rule file (used with sb result diagnosis or sb result generate-baseline --diagnosis-rule-file) to detect Silent Data Corruption (SDC) and validate configuration consistency:

superbench:
  rules:
    deterministic_rule:
      function: variance
      criteria: "lambda x: x != 0"
      categories: SDC-Fingerprint
      metrics:
        - model-benchmarks:.*/deterministic_loss.*
        - model-benchmarks:.*/deterministic_act_mean.*
        - model-benchmarks:.*/deterministic_check_count.*

    deterministic_config_rule:
      function: variance
      criteria: "lambda x: x != 0"
      categories: SDC-Config
      metrics:
        - model-benchmarks:.*/deterministic_config_batch_size.*
        - model-benchmarks:.*/deterministic_config_num_steps.*
        - model-benchmarks:.*/deterministic_config_num_warmup.*
        - model-benchmarks:.*/deterministic_config_deterministic_seed.*
        - model-benchmarks:.*/deterministic_config_check_frequency.*
        - model-benchmarks:.*/deterministic_config_seq_len.*
        - model-benchmarks:.*/deterministic_config_hidden_size.*
        - model-benchmarks:.*/deterministic_config_num_classes.*
        - model-benchmarks:.*/deterministic_config_input_size.*
        - model-benchmarks:.*/deterministic_config_num_layers.*
        - model-benchmarks:.*/deterministic_config_num_hidden_layers.*
        - model-benchmarks:.*/deterministic_config_num_attention_heads.*
        - model-benchmarks:.*/deterministic_config_intermediate_size.*

    deterministic_failure_rule:
      function: failure_check
      criteria: "lambda x: x != 0"
      categories: SDC-Failed
      metrics:
        - model-benchmarks:.*/return_code

SDC-Fingerprint (deterministic_rule): Flags any node where loss, activation mean, or check count has any variance from baseline (x != 0), indicating a potential SDC issue.
SDC-Config (deterministic_config_rule): Ensures all determinism configuration parameters (seed, batch size, sequence length, hidden size, etc.) are identical across nodes — any mismatch means the comparison is invalid.
SDC-Failed (deterministic_failure_rule): Uses failure_check to catch nodes where the determinism benchmark failed to run or returned a non-zero exit code.

For complete rule files covering all benchmark categories (micro-benchmarks, NCCL, GPU copy bandwidth, NVBandwidth, etc.), refer to the rule file documentation in Result Summary and Data Diagnosis.

Metrics

Name	Unit	Description
model-benchmarks/pytorch-${model_name}/${precision}_train_step_time	time (ms)	The average training step time with fp32/fp16 precision.
model-benchmarks/pytorch-${model_name}/${precision}_train_throughput	throughput (samples/s)	The average training throughput with fp32/fp16 precision per GPU.
model-benchmarks/pytorch-${model_name}/${precision}_inference_step_time	time (ms)	The average inference step time with fp32/fp16 precision.
model-benchmarks/pytorch-${model_name}/${precision}_inference_throughput	throughput (samples/s)	The average inference throughput with fp32/fp16 precision.
model-benchmarks/pytorch-${model_name}/${precision}_inference_step_time_${percentile}	time (ms)	The n^th percentile inference step time with fp32/fp16 precision.
model-benchmarks/pytorch-${model_name}/${precision}_inference_throughput_${percentile}	throughput (samples/s)	The n^th percentile inference throughput with fp32/fp16 precision.

Megatron Model benchmarks

`megatron-gpt`

Introduction

Run GPT pretrain tasks with float32, float16, bfloat16 precisions with Megatron-LM or Megatron-DeepSpeed.

tips: batch_size in this benchmark represents global batch size, the batch size on each GPU instance is micro_batch_size.

Metrics

Name	Unit	Description
megatron-gpt/${precision}_train_step_time	time (ms)	The average training step time per iteration.
megatron-gpt/${precision}_train_throughput	throughput (samples/s)	The average training throughput per iteration.
megatron-gpt/${precision}_train_tflops	tflops/s	The average training tflops per second per iteration.
megatron-gpt/${precision}_train_mem_allocated	GB	The average GPU memory allocated per iteration.
megatron-gpt/${precision}_train_max_mem_allocated	GB	The average maximum GPU memory allocated per iteration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model Benchmarks

PyTorch Model Benchmarks