Skip to content

liurui-software/gpu-observe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Using OpenTelemetry to Monitor NVIDIA GPUs

If you want to use OpenTelemetry to monitor NVIDIA GPUs, the common approach is to use DCGM Exporter. However, DCGM Exporter seems to have some compatibility issues, with rather strict requirements on hardware, software, and driver versions. It also cannot run properly on platforms such as Windows WSL2.

In contrast, lightweight tools such as nvitop have consistently worked well. nvitop itself is based on NVML (NVIDIA Management Library). Therefore, following the approach used by nvitop, I wrote a small application that simply collects metrics and sends them to OpenTelemetry. This utility is very small and lightweight. At present, it consists of only one source file (agent.py) and can be run directly.

Before running the application, install the required Python packages:

pip install -r requirements.txt

You can configure environment variables to send monitoring data via the OTLP/gRPC protocol to any backend that supports OpenTelemetry, or to an OpenTelemetry Collector.

By default, the metrics are sent to an application listening on port 4317 on the local machine. Simply run:

python agent.py

Or specify the address of your backend server (or OpenTelemetry Collector) before running:

export OTEL_EXPORTER_OTLP_ENDPOINT="http://<OTel Backend>:4317"
python agent.py

I have also prepared a Grafana Dashboard configuration (see dashboard.json) that includes the following charts:

  • GPU Usage %
  • GPU Memory Usage
  • GPU Temperature
  • GPU Power Consumption
  • Per Process GPU Memory Usage

For users who do not already have an OpenTelemetry backend, I also provide a Docker Compose script that can launch a local OpenTelemetry Collector container, an OpenTelemetry Prometheus container, and a Grafana container for immediate use.

About

Using OpenTelemetry to Monitor NVIDIA GPUs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages