Skip to content

aborting stress test leaves GPU at 100% and in P0 #14

@sherwoac

Description

@sherwoac

Hi.

Compiled gst and ran it in ubuntu 24 and sm_120a.

ran gst and then hit CTRL+C after a while, as per the output below.
The GPU appeared at 100% utilization in P0 state, had to reboot to stop this, this is repeatable.

dam@z10:~/CODE/GPUStressTest/build$ ./gst 1
./gst capturing GPU information...
WATCHDOG starting, TIMEOUT: 600 seconds
Detected 1 CUDA Capable device(s)
./gst Done.
Device 0: "NVIDIA GeForce RTX 5090"
./gst done capturing GPU information.
DEBUG_MATRIX_SIZES: Checking matrix size only (no CUDA execution) for: T4
Initilizing T4 based test suite
GPU Memory: 31, memgb: 16


Device 0: "NVIDIA GeForce RTX 5090", PCIe: a
stress_tests[0].test_name FP16
P hsh
m 31864
n 38648
k 88304
ta 0
tb 1
B 0

***** STARTING TEST 0: FP16 On Device 0 NVIDIA GeForce RTX 5090
testing cublasLt
Allocate matrixSize Total Bytes A + B + C:  14915943040 
#### args: ta=N tb=T m=31864 n=38648 k=88304  alpha = (0x3f800000, 1) beta= (0x00000000, 0)
#### args: lda=31864 ldb=38648 ldc=31864 ldd=31864 loop=10
^^^^ CUDA : elapsed = 15.22 sec,  Gflops = 142896.701 
testing cublasLt pass
***** TEST FP16 On Device 0 NVIDIA GeForce RTX 5090
stress_tests[1].test_name C32
P ccc
m 11432
n 16424
k 61000
ta 0
tb 1
B 0

***** STARTING TEST 1: C32 On Device 0 NVIDIA GeForce RTX 5090
testing cublasLt
Allocate matrixSize Total Bytes A + B + C:  15095801344 
#### args: ta=N tb=T m=11432 n=16424 k=61000  alpha = (0x3f800000 1), (0x00000000 0) beta= (0x00000000 0), (0x00000000 0)
#### args: lda=11432 ldb=16424 ldc=11432 ldd=11432 loop=10
^C
adam@z10:~/CODE/GPUStressTest/build$ nvidia-smi
Thu Jul  3 14:19:07 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5090        Off |   00000000:0A:00.0 Off |                  N/A |
| 41%   57C    P0            129W /  575W |       2MiB /  32607MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions