Skip to content

test: Regression tests show increased use of memory causing 1 additional experiment to OOM over last tests #128

@kmehant

Description

@kmehant

OOM experiments

epoch framework_config gradient_accumulation_steps mem_nvidia_mem_reserved model_name_or_path num_gpus per_device_train_batch_size torch_dtype train_loss train_runtime train_samples_per_second train_steps_per_second train_tokens_per_second
  none 16 78783.5 mistralai/Mixtral-8x7B-Instruct-v0.1 8 1 bfloat16          

Failed experiments - number of gpus not divisible by ep degree

epoch framework_config gradient_accumulation_steps mem_nvidia_mem_reserved model_name_or_path num_gpus per_device_train_batch_size torch_dtype train_loss train_runtime train_samples_per_second train_steps_per_second train_tokens_per_second
  moe-scattermoe-granite-ep2 16 0 ibm-granite/granite-3.0-3b-a800m-instruct 1 8 bfloat16          
  moe-scattermoe-granite-ep4 16 0 ibm-granite/granite-3.0-3b-a800m-instruct 1 8 bfloat16          
  moe-scattermoe-granite-ep2-padding-free-foak 16 0 ibm-research/moe-7b-1b-active-shared-experts 1 8 bfloat16          
  moe-scattermoe-granite-ep2-padding-free 16 0 ibm-granite/granite-3.0-3b-a800m-instruct 1 8 bfloat16          
  moe-scattermoe-granite-ep2-padding-free-foak 16 0 ibm-granite/granite-3.0-3b-a800m-instruct 1 8 bfloat16          
  moe-scattermoe-granite-ep4-padding-free 16 0 ibm-granite/granite-3.0-3b-a800m-instruct 1 8 bfloat16          
  moe-scattermoe-granite-ep4-padding-free-foak 16 0 ibm-granite/granite-3.0-3b-a800m-instruct 1 8 bfloat16          
  moe-scattermoe-granite-ep2 16 0 ibm-research/moe-7b-1b-active-shared-experts 1 8 bfloat16          
  moe-scattermoe-granite-ep2-padding-free 16 0 ibm-research/moe-7b-1b-active-shared-experts 1 8 bfloat16          
  moe-scattermoe-granite-ep4 8 2276 ibm-granite/granite-3.0-3b-a800m-instruct 2 8 bfloat16          
  moe-scattermoe-granite-ep4-padding-free 8 2276 ibm-granite/granite-3.0-3b-a800m-instruct 2 8 bfloat16          
  moe-scattermoe-granite-ep4-padding-free-foak 8 2277 ibm-granite/granite-3.0-3b-a800m-instruct 2 8 bfloat16          

Failed experiments - number of experts not divisible by ep degree

epoch framework_config gradient_accumulation_steps mem_nvidia_mem_reserved model_name_or_path num_gpus per_device_train_batch_size torch_dtype train_loss train_runtime train_samples_per_second train_steps_per_second train_tokens_per_second
  moe-scattermoe-granite-ep4 16 0 ibm-research/moe-7b-1b-active-shared-experts 1 8 bfloat16          
  moe-scattermoe-granite-ep4 8 2276 ibm-research/moe-7b-1b-active-shared-experts 2 8 bfloat16          
  moe-scattermoe-granite-ep4 4 2564 ibm-research/moe-7b-1b-active-shared-experts 4 8 bfloat16          
  moe-scattermoe-granite-ep4-padding-free 16 0 ibm-research/moe-7b-1b-active-shared-experts 1 8 bfloat16          
  moe-scattermoe-granite-ep4-padding-free 8 2276 ibm-research/moe-7b-1b-active-shared-experts 2 8 bfloat16          
  moe-scattermoe-granite-ep4-padding-free 4 2564 ibm-research/moe-7b-1b-active-shared-experts 4 8 bfloat16          
  moe-scattermoe-granite-ep4-padding-free-foak 16 0 ibm-research/moe-7b-1b-active-shared-experts 1 8 bfloat16          
  moe-scattermoe-granite-ep4-padding-free-foak 8 2277 ibm-research/moe-7b-1b-active-shared-experts 2 8 bfloat16          
  moe-scattermoe-granite-ep4-padding-free-foak 4 2564.5 ibm-research/moe-7b-1b-active-shared-experts 4 8 bfloat16          

Delta with previous experiments on OOM

epoch framework_config gradient_accumulation_steps mem_nvidia_mem_reserved model_name_or_path num_gpus per_device_train_batch_size torch_dtype train_loss train_runtime train_samples_per_second train_steps_per_second train_tokens_per_second
  none 16 78783.5 mistralai/Mixtral-8x7B-Instruct-v0.1 8 1 bfloat16          

Regression test is done as part of the PR - #126. The change in metrics may be attributed to transformers==4.49 but needs investigation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions