Skip to content

Failure on A100 Card #5

@karthik86248

Description

@karthik86248

Running the GPUStress Tool on a A100 card is reporting the below error. However, card seems to be healthy and working correctly per the HW tests performed by our hardware vendor.

Command Executed: ./gst -T=1
Output:
./gst capturing GPU information...
WATCHDOG starting, TIMEOUT: 600 seconds
Detected 2 CUDA Capable device(s)
Device 0: "NVIDIA A100 80GB PCIe"
Device 1: "NVIDIA A100 80GB PCIe"
Initilizing A100 80 GB based test suite
TYPE=2
GPU Memory: 79, memgb: 80
Device 0: "NVIDIA A100 80GB PCIe", PCIe: 17
***** STARTING TEST 0: INT8 On Device 0 NVIDIA A100 80GB PCIe

math_type 10

args: matrixSizeA 34878833064 matrixSizeB 16672535724 matrixSizeC 28662757344

args: ta=N tb=T m=244872 n=117052 k=142437 lda=7835904 ldb=3745792 ldc=7835904

loop=1
***** TEST INT8 On Device 0 NVIDIA A100 80GB PCIe
***** TEST PASSED ****
TEST TIME: 24 seconds
***** STARTING TEST 1: FP16 On Device 0 NVIDIA A100 80GB PCIe

math_type 0

args: matrixSizeA 13000629632 matrixSizeB 13114567936 matrixSizeC 13557084032

args: ta=N tb=N m=115928 n=116944 k=112144 lda=115928 ldb=112144 ldc=115928

loop=1
***** TEST FP16 On Device 0 NVIDIA A100 80GB PCIe
***** TEST PASSED ****
TEST TIME: 17 seconds
***** STARTING TEST 2: TF32 On Device 0 NVIDIA A100 80GB PCIe

math_type 0

args: matrixSizeA 18964675584 matrixSizeB 6192059904 matrixSizeC 14359400064

std::exception: out of memory
testing cublasLt fail

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions