https://github.com/beuth-erdelt/prometheus_nvlink_exporter
This script collects some informations about NVLink and PCI bus traffic of NVidia GPUs. Results are published as prometheus metrics via a websocket.
https://github.com/beuth-erdelt/prometheus_nvlink_exporter
gpu nvidia-cuda nvidia-docker nvidia-gpu nvlink prometheus prometheus-exporter python
Last synced: 2 months ago
JSON representation
This script collects some informations about NVLink and PCI bus traffic of NVidia GPUs. Results are published as prometheus metrics via a websocket.
- Host: GitHub
- URL: https://github.com/beuth-erdelt/prometheus_nvlink_exporter
- Owner: Beuth-Erdelt
- License: mit
- Created: 2019-05-23T09:15:57.000Z (about 6 years ago)
- Default Branch: master
- Last Pushed: 2019-07-29T09:31:28.000Z (almost 6 years ago)
- Last Synced: 2025-03-28T05:12:21.381Z (3 months ago)
- Topics: gpu, nvidia-cuda, nvidia-docker, nvidia-gpu, nvlink, prometheus, prometheus-exporter, python
- Language: Python
- Size: 10.7 KB
- Stars: 6
- Watchers: 3
- Forks: 0
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# prometheus_nvlink_exporter
This script collects some informations about NVLink and PCI bus traffic of NVidia GPUs.
Results are published as prometheus metrics via a websocket.## Usage
We also provide a Docker file.
This is based on NVidia's CUDA container, adds a python installation and runs the exporter script.The basic usage is `docker run -d ...`
The metrics can be scraped from port 8001.
The docker image is compatible to kubernetes environments.
## Prerequisites
The docker image requires docker and NVidia GPUs capable of NVLink and the basic drivers being installed.
The script expects the GPUs to be set via
```
nvidia-smi nvlink -sc 0bz
nvidia-smi nvlink -sc 1pz
```
The script uses `nvidia-smi` and some python libraries, in particular https://github.com/prometheus/client_python## Working examples
Basically the script runs `nvidia-smi` commands and transforms output to some format that can be scraped by prometheus.
### Collecting NVLink Informations
This automatically runs `nvidia-smi nvlink -g 0`:
```
GPU 0: Tesla V100-SXM2-16GB (UUID: GPU-8dfc570f-9ee4-bdf1-abcd-192837465abc)
Link 0: Rx0: 0 KBytes, Tx0: 0 KBytes
Link 1: Rx0: 100 KBytes, Tx0: 0 KBytes
Link 2: Rx0: 0 KBytes, Tx0: 0 KBytes
Link 3: Rx0: 0 KBytes, Tx0: 0 KBytes
GPU 1: Tesla V100-SXM2-16GB (UUID: GPU-29123255-8aab-d30e-abcd-192837465abc)
Link 0: Rx0: 0 KBytes, Tx0: 0 KBytes
Link 1: Rx0: 0 KBytes, Tx0: 0 KBytes
Link 2: Rx0: 50 KBytes, Tx0: 0 KBytes
Link 3: Rx0: 0 KBytes, Tx0: 0 KBytes
GPU 2: Tesla V100-SXM2-16GB (UUID: GPU-7db3a1e6-6150-9c24-abcd-192837465abc)
Link 0: Rx0: 0 KBytes, Tx0: 0 KBytes
Link 1: Rx0: 0 KBytes, Tx0: 0 KBytes
Link 2: Rx0: 0 KBytes, Tx0: 0 KBytes
Link 3: Rx0: 0 KBytes, Tx0: 0 KBytes
Link 4: Rx0: 0 KBytes, Tx0: 0 KBytes
GPU 3: Tesla V100-SXM2-16GB (UUID: GPU-22ea33c7-5a76-9747-abcd-192837465abc)
Link 0: Rx0: 0 KBytes, Tx0: 0 KBytes
Link 1: Rx0: 0 KBytes, Tx0: 0 KBytes
Link 2: Rx0: 0 KBytes, Tx0: 0 KBytes
Link 3: Rx0: 0 KBytes, Tx0: 0 KBytes
Link 4: Rx0: 0 KBytes, Tx0: 0 KBytes
```### Collecting PCI Informations
This automatically runs `nvidia-smi dmon -s t -c 1`
```
# gpu rxpci txpci
# Idx MB/s MB/s
1 0 0
2 0 0
```### Publishing Metrics
Output is similar to
```
# HELP gpu_nvlink_tx_kbytes Transmitted KBytes via NVLink
# TYPE gpu_nvlink_tx_kbytes gauge
gpu_nvlink_tx_kbytes{GPUID="0",LinkID="2"} 27598895329.0
gpu_nvlink_tx_kbytes{GPUID="0",LinkID="1"} 31602715771.0
gpu_nvlink_tx_kbytes{GPUID="4",LinkID="2"} 0.0
gpu_nvlink_tx_kbytes{GPUID="7",LinkID="0"} 0.0
gpu_nvlink_tx_kbytes{GPUID="4",LinkID="3"} 0.0
gpu_nvlink_tx_kbytes{GPUID="5",LinkID="1"} 0.0
gpu_nvlink_tx_kbytes{GPUID="0",LinkID="3"} 31602715771.0
gpu_nvlink_tx_kbytes{GPUID="5",LinkID="0"} 0.0
gpu_nvlink_tx_kbytes{GPUID="7",LinkID="2"} 0.0
gpu_nvlink_tx_kbytes{GPUID="2",LinkID="3"} 1019788145.0
gpu_nvlink_tx_kbytes{GPUID="7",LinkID="1"} 0.0
gpu_nvlink_tx_kbytes{GPUID="3",LinkID="2"} 1017047660.0
gpu_nvlink_tx_kbytes{GPUID="2",LinkID="0"} 1014424036.0
gpu_nvlink_tx_kbytes{GPUID="2",LinkID="1"} 1017028693.0
gpu_nvlink_tx_kbytes{GPUID="1",LinkID="2"} 1017047660.0
gpu_nvlink_tx_kbytes{GPUID="6",LinkID="2"} 49.0
gpu_nvlink_tx_kbytes{GPUID="5",LinkID="3"} 2986639.0
gpu_nvlink_tx_kbytes{GPUID="0",LinkID="0"} 0.0
gpu_nvlink_tx_kbytes{GPUID="3",LinkID="3"} 1017028657.0
gpu_nvlink_tx_kbytes{GPUID="6",LinkID="1"} 0.0
gpu_nvlink_tx_kbytes{GPUID="5",LinkID="2"} 0.0
gpu_nvlink_tx_kbytes{GPUID="6",LinkID="0"} 2555441.0
gpu_nvlink_tx_kbytes{GPUID="3",LinkID="0"} 1014357462.0
gpu_nvlink_tx_kbytes{GPUID="6",LinkID="3"} 0.0
gpu_nvlink_tx_kbytes{GPUID="1",LinkID="3"} 0.0
gpu_nvlink_tx_kbytes{GPUID="3",LinkID="1"} 0.0
gpu_nvlink_tx_kbytes{GPUID="1",LinkID="0"} 1014341346.0
gpu_nvlink_tx_kbytes{GPUID="1",LinkID="1"} 5022027981.0
gpu_nvlink_tx_kbytes{GPUID="4",LinkID="0"} 0.0
gpu_nvlink_tx_kbytes{GPUID="4",LinkID="1"} 0.0
gpu_nvlink_tx_kbytes{GPUID="2",LinkID="2"} 4007720847.0
gpu_nvlink_tx_kbytes{GPUID="7",LinkID="3"} 0.0
# HELP gpu_pci_rx_mb_per_s Received MBytes per second via PCI
# TYPE gpu_pci_rx_mb_per_s gauge
gpu_pci_rx_mb_per_s{GPUID="2"} 0.0
gpu_pci_rx_mb_per_s{GPUID="5"} 0.0
gpu_pci_rx_mb_per_s{GPUID="7"} 0.0
gpu_pci_rx_mb_per_s{GPUID="3"} 0.0
gpu_pci_rx_mb_per_s{GPUID="6"} 0.0
gpu_pci_rx_mb_per_s{GPUID="4"} 0.0
gpu_pci_rx_mb_per_s{GPUID="0"} 0.0
gpu_pci_rx_mb_per_s{GPUID="1"} 0.0
```