GitHub - liurui-software/gpu-observe: Using OpenTelemetry to Monitor NVIDIA GPUs

Using OpenTelemetry to Monitor NVIDIA GPUs

If you want to use OpenTelemetry to monitor NVIDIA GPUs, the common approach is to use DCGM Exporter. However, DCGM Exporter seems to have some compatibility issues, with rather strict requirements on hardware, software, and driver versions. It also cannot run properly on platforms such as Windows WSL2.

In contrast, lightweight tools such as nvitop have consistently worked well. nvitop itself is based on NVML (NVIDIA Management Library). Therefore, following the approach used by nvitop, I wrote a small application that simply collects metrics and sends them to OpenTelemetry. This utility is very small and lightweight. At present, it consists of only one source file (agent.py) and can be run directly.

Before running the application, install the required Python packages:

pip install -r requirements.txt

You can configure environment variables to send monitoring data via the OTLP/gRPC protocol to any backend that supports OpenTelemetry, or to an OpenTelemetry Collector.

By default, the metrics are sent to an application listening on port 4317 on the local machine. Simply run:

python agent.py

Or specify the address of your backend server (or OpenTelemetry Collector) before running:

export OTEL_EXPORTER_OTLP_ENDPOINT="http://<OTel Backend>:4317"
python agent.py

I have also prepared a Grafana Dashboard configuration (see dashboard.json) that includes the following charts:

GPU Usage %
GPU Memory Usage
GPU Temperature
GPU Power Consumption
Per Process GPU Memory Usage

For users who do not already have an OpenTelemetry backend, I also provide a Docker Compose script that can launch a local OpenTelemetry Collector container, an OpenTelemetry Prometheus container, and a Grafana container for immediate use.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
docker-compose		docker-compose
LICENSE		LICENSE
README.md		README.md
agent.py		agent.py
dashboard.json		dashboard.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Using OpenTelemetry to Monitor NVIDIA GPUs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Using OpenTelemetry to Monitor NVIDIA GPUs

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages