Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/a0s/nvidia-smi-exporter

Nvidia-smi Prometheus exporter with respecting of GPU-UUID
https://github.com/a0s/nvidia-smi-exporter

docker mining nvidia nvidia-smi prometheus-exporter

Last synced: 7 days ago
JSON representation

Nvidia-smi Prometheus exporter with respecting of GPU-UUID

Awesome Lists containing this project

README

        

Nvidia SMI Exporter
---------------------
Another one nvidia-smi data exporter for Prometheus.

```bash
> curl localhost:9454/ping
OK

> curl localhost:9454/metrics
nvidia_smi_attached_gpus 2
nvidia_smi_display_mode{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 1
nvidia_smi_display_active{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 0
nvidia_smi_persistence_mode{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 1
nvidia_smi_accounting_mode{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 0
nvidia_smi_accounting_mode_buffer_size{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 4000
nvidia_smi_minor_number{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 0
nvidia_smi_multigpu_board{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 0
nvidia_smi_inforom_version_oem_object{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 1.1
nvidia_smi_pci_pci_sub_system_id{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 37021458
nvidia_smi_pci_pci_gpu_link_info_pcie_gen_max_link_gen{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 3
nvidia_smi_pci_pci_gpu_link_info_pcie_gen_current_link_gen{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 3
nvidia_smi_pci_pci_gpu_link_info_link_widths_max_link_width{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 16
nvidia_smi_pci_pci_gpu_link_info_link_widths_current_link_width{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 1
nvidia_smi_pci_replay_counter{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 65535
nvidia_smi_pci_replay_rollover_counter{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 0
nvidia_smi_pci_tx_util_bytes_per_second{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 11264000
nvidia_smi_pci_rx_util_bytes_per_second{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 31744000
nvidia_smi_fan_speed_ratio{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 0.53
nvidia_smi_performance_state{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 2
nvidia_smi_clocks_throttle_reasons_clocks_throttle_reason_gpu_idle{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 0
nvidia_smi_clocks_throttle_reasons_clocks_throttle_reason_applications_clocks_setting{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 0
nvidia_smi_clocks_throttle_reasons_clocks_throttle_reason_sw_power_cap{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 1
nvidia_smi_clocks_throttle_reasons_clocks_throttle_reason_hw_slowdown{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 0
nvidia_smi_clocks_throttle_reasons_clocks_throttle_reason_hw_thermal_slowdown{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 0
nvidia_smi_clocks_throttle_reasons_clocks_throttle_reason_hw_power_brake_slowdown{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 0
nvidia_smi_clocks_throttle_reasons_clocks_throttle_reason_sync_boost{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 0
nvidia_smi_clocks_throttle_reasons_clocks_throttle_reason_sw_thermal_slowdown{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 0
nvidia_smi_clocks_throttle_reasons_clocks_throttle_reason_display_clocks_setting{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 0
nvidia_smi_fb_memory_usage_total_bytes{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 8510242816
nvidia_smi_fb_memory_usage_used_bytes{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 3040870400
nvidia_smi_fb_memory_usage_free_bytes{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 5469372416
nvidia_smi_bar1_memory_usage_total_bytes{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 268435456
nvidia_smi_bar1_memory_usage_used_bytes{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 5242880
nvidia_smi_bar1_memory_usage_free_bytes{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 263192576
nvidia_smi_compute_mode{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 0
nvidia_smi_utilization_gpu_util_ratio{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 1.0
nvidia_smi_utilization_memory_util_ratio{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 1.0
nvidia_smi_utilization_encoder_util_ratio{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 0.0
nvidia_smi_utilization_decoder_util_ratio{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 0.0
nvidia_smi_encoder_stats_session_count{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 0
nvidia_smi_encoder_stats_average_fps{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 0
nvidia_smi_encoder_stats_average_latency{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 0
nvidia_smi_fbc_stats_session_count{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 0
nvidia_smi_fbc_stats_average_fps{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 0
nvidia_smi_fbc_stats_average_latency{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 0
nvidia_smi_temperature_gpu_temp_celsius{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 67.0
nvidia_smi_temperature_gpu_temp_max_threshold_celsius{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 96.0
nvidia_smi_temperature_gpu_temp_slow_threshold_celsius{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 93.0
nvidia_smi_temperature_gpu_target_temperature_celsius{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 83.0
nvidia_smi_supported_gpu_target_temp_gpu_target_temp_min_celsius{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 60.0
nvidia_smi_supported_gpu_target_temp_gpu_target_temp_max_celsius{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 92.0
nvidia_smi_power_readings_power_state{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 2
nvidia_smi_power_readings_power_management{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 1
nvidia_smi_power_readings_power_draw_watts{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 211.86
nvidia_smi_power_readings_power_limit_watts{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 216.0
nvidia_smi_power_readings_default_power_limit_watts{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 200.0
nvidia_smi_power_readings_enforced_power_limit_watts{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 216.0
nvidia_smi_power_readings_min_power_limit_watts{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 100.0
nvidia_smi_power_readings_max_power_limit_watts{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 216.0
nvidia_smi_clocks_graphics_clock_hz{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 1784000000.0
nvidia_smi_clocks_sm_clock_hz{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 1784000000.0
nvidia_smi_clocks_mem_clock_hz{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 5514000000.0
nvidia_smi_clocks_video_clock_hz{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 1594000000.0
nvidia_smi_max_clocks_graphics_clock_hz{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 1999000000.0
nvidia_smi_max_clocks_sm_clock_hz{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 1999000000.0
nvidia_smi_max_clocks_mem_clock_hz{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 5005000000.0
nvidia_smi_max_clocks_video_clock_hz{uuid="683c6e09-3969-31a5-b7ca-cc88252b7fce"} 1708000000.0
```

Caveats
=======

In XML mode (see below) the exporter skips all implicit arrays like this:

```xml


3504 MHz
1721 MHz <== will skip all supported_graphics_clock
1708 MHz
1695 MHz
1683 MHz
1670 MHz
1657 MHz
1645 MHz
1632 MHz
```

Also, it skips all empty and "N/A" values (and their keys accordingly).

Prometheus is [not supporting](https://github.com/prometheus/prometheus/issues/2227) string values, thus exporter converts some strings to numbers:
- `Disabled`, `Not Active`, `No`, `Not Supported` and `Unsupported` will convert to 0
- `Enabled`, `Active`, `Yes`, `Supported` will convert to 1
- `P0`..`P15` will convert to 0..15
- `1x`, `2x`, `4x`, `8x`, `16x`, `32x` (pci lines) will convert to 1,2,4,8,16,32
- `Default`, `Exclusive_Thread`, `Prohibited`, `Exclusive_Process` (_compute_mode) will convert to 0..3

Other not integer/float strings will eliminate.

Run in Docker
=============

`nvidia-smi` requires using the same versions of packages (`libnvidia-compute-460` and `nvidia-utils-460`) inside the container and outside (on the host).

1) Get driver version on the host:

```shell
> dpkg --list | grep nvidia-driver-460
ii nvidia-driver-460 460.27.04-0ubuntu1 amd64 NVIDIA driver metapackage
```

2) Build container with appropriate version of driver:

```shell
> docker build . --tag nvidia-smi-exporter --build-arg NVIDIA_VERSION=460.27.04-0ubuntu1
```

All available versions can be found [in the Cuda repository](https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/)

3) Run with `--privileged` flag (not recommended due to security)
```shell
> docker run --rm --privileged -p 9454:9454 nvidia-smi-exporter
```

or enumerating (map inside container) all devices explicitly

```shell
> docker run --rm \
--device /dev/nvidiactl:/dev/nvidiactl \
--device /dev/nvidia0:/dev/nvidia0 \
--device /dev/nvidia1:/dev/nvidia1 \
-p 9454:9454 nvidia-smi-exporter
```

Parameters
==================

There are several environment variables to setting runtime parameters:

- `NVIDIA_SMI_EXPORTER_BINARY` - path to nvidia-smi executive binary, default: `$(which nvidia-smi)`

- `NVIDIA_SMI_EXPORTER_PORT` - a port where the server will be started, default: `9454`

- `NVIDIA_SMI_EXPORTER_HOST` - a host where the server will be started, default: `0.0.0.0`

- `NVIDIA_SMI_EXPORTER_NAME_PREFIX` - prefix for every parameter name in output, default: `nvidia_smi_`

- `NVIDIA_SMI_EXPORTER_SOURCE` - source to processing - `xml` or `csv`, default: `xml`.

When `NVIDIA_SMI_EXPORTER_SOURCE` == `csv`:

- `NVIDIA_SMI_EXPORTER_QUERY` - comma-separated list to query parameters from nvidia-smi, default: `clocks.current.graphics,clocks.current.memory,clocks.current.sm,clocks.current.video,clocks.max.graphics,clocks.max.memory,clocks.max.sm,fan.speed,memory.total,memory.used,power.draw,power.limit,temperature.gpu,utilization.gpu,utilization.memory`