If you want to use OpenTelemetry to monitor NVIDIA GPUs, the common approach is to use DCGM Exporter. However, DCGM Exporter seems to have some compatibility issues, with rather strict requirements on hardware, software, and driver versions. It also cannot run properly on platforms such as Windows WSL2.
In contrast, lightweight tools such as nvitop have consistently worked well. nvitop itself is based on NVML (NVIDIA Management Library). Therefore, following the approach used by nvitop, I wrote a small application that simply collects metrics and sends them to OpenTelemetry. This utility is very small and lightweight. At present, it consists of only one source file (agent.py) and can be run directly.
Before running the application, install the required Python packages:
pip install -r requirements.txt
You can configure environment variables to send monitoring data via the OTLP/gRPC protocol to any backend that supports OpenTelemetry, or to an OpenTelemetry Collector.
By default, the metrics are sent to an application listening on port 4317 on the local machine. Simply run:
python agent.py
Or specify the address of your backend server (or OpenTelemetry Collector) before running:
export OTEL_EXPORTER_OTLP_ENDPOINT="http://<OTel Backend>:4317"
python agent.py
I have also prepared a Grafana Dashboard configuration (see dashboard.json) that includes the following charts:
- GPU Usage %
- GPU Memory Usage
- GPU Temperature
- GPU Power Consumption
- Per Process GPU Memory Usage
For users who do not already have an OpenTelemetry backend, I also provide a Docker Compose script that can launch a local OpenTelemetry Collector container, an OpenTelemetry Prometheus container, and a Grafana container for immediate use.