Skip to content

test: regression tests show higher memory, throughput and loss values for some settings and models #147

@kmehant

Description

@kmehant

Regression is done as part of addition of kernels for GraniteMoeHybridForCausalLM see: #143.

Final reg plots that includes past models

memory increased loss throughput increased
compare-mem_nvidia_mem_reserved compare-train_loss compare-train_tokens_per_second

Outliers

3 classes of outliers can be identified

  1. increased throughput
  2. increased memory consumption
  3. increased loss

Loss regression

Models ibm-granite/granite-3.0-3b-a800m-instruct and ibm-research/moe-7b-1b-active-shared-experts all padding-free runs regressed from previous bench loss showing larger losses than previous bench loss. However, its not clear if it has to do with padding-free since other models in the benchmark set didn't regress with padding free on.

Screenshot 2025-06-13 at 12 17 34 PM

All outliers

Screenshot 2025-06-13 at 12 21 36 PM

Additional failed runs compared to previous benchmark

Reason: OOM

Screenshot 2025-06-13 at 2 23 20 PM

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions