GPU-Health-eXpert
-
Updated
Oct 30, 2025 - C
GPU-Health-eXpert
Enhanced GPU throttle diagnostic for DGX Spark (GB10): NVML direct telemetry, throttle cause decoder, PCIe link monitoring, baseline drift detection, timeline capture.
gpu thrashingNVIDIA GPU Unified Memory diagnostic tool — architecture-aware, measurement-based, PCIe/coherent transport detection
Universal GPU setup for PyTorch and JAX — DirectML, ROCm, CUDA, MPS, CPU. Auto-detects hardware, fixes HSA_OVERRIDE_GFX_VERSION for AMD RX 5700 XT. Beats devicetorch + torchruntime combined.
NVIDIA GPU validation: PCIe transport, Unified Memory prefetch, SGEMM compute, drift detection.
Detect, install, and troubleshoot GPU support for PyTorch and JAX across NVIDIA, AMD, and Apple Silicon with one command
ML research control plane — experiment lifecycle, model registry, cloud training launcher
Cycle-accurate UMA fault latency and bandwidth measurement for NVIDIA GPUs. C and PTX. No Python. Pascal (SM 6.0) through Blackwell GB10 (SM 12.1).
Add a description, image, and links to the gpu-diagnostics topic page so that developers can more easily learn about it.
To associate your repository with the gpu-diagnostics topic, visit your repo's landing page and select "manage topics."