Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/imcf/nvsmi-prometheus-textfile

A zero-dependencies metrics collector for Prometheus based on "nvidia-smi" written in Python.
https://github.com/imcf/nvsmi-prometheus-textfile

grafana metrics monitoring nvidia nvidia-smi prometheus prometheus-exporter vgpu

Last synced: about 1 month ago
JSON representation

A zero-dependencies metrics collector for Prometheus based on "nvidia-smi" written in Python.

Awesome Lists containing this project

README

        

# Prometheus textfile collector for `nvidia-smi`

![Python: 2.7](https://img.shields.io/badge/python-2.7-yellow) [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) [![License: GPL](https://img.shields.io/badge/license-GPL-blue)](https://github.com/imcf/nvsmi-prometheus-textfile/blob/main/LICENSE)

This is a zero-dependencies (see below for details) standalone tool collecting metrics
using the [`nvidia-smi`][1] (NVIDIA System Management Interface) command and formatting
them in a [Prometheus][2] compatible style that can be used through the
[node_exporter][3]'s `textfile` collector.

Below is shown an example for a visualization generated by [Grafana][9] using the
metrics of two GPUs, showing

* temperature as *solid lines*
* power draft as *dotted lines*
* the (intended) fan speed as *dashed lines*

![Example using Grafana to visualize GPU metrics](/resources/nvsmi-grafana.png)

## Zero Dependencies

Or: **"Why not using the official Prometheus Python Client?"**

The tool is intended to work on minimalistic installations, e.g. we are using it on our
[Xen][4] / [Citrix Hypervisor][5] instances. Those setups come with very basic installs
(currently based on [CentOS][6]) and the installation of additional tools like `pip`
(which would be required for the Python Client) is not always possible / desirable.

Therefore the only *actual* dependencies of this collector are already always fulfilled
on the relevant systems:

* Python 2.7 - comes with the base OS installation
* `nvidia-smi` - available as soon as the NVIDIA driver package is installed

## Permissions

No *root permissions* are required to collect the metrics through `nvidia-smi`, instead
having a user that is having write permissions to the textfile collector directory (or
actually just a single file therein, to be precise) of `node_exporter` is sufficient.

One simple solution is to run the script under the same account that is also used for
the `node_exporter`. A possible setup could look like this:

```bash
adduser \
--home-dir /var/lib/node_exporter \
--comment "Prometheus Node Exporter daemon" \
--system \
node_exporter

mkdir -pv /var/lib/node_exporter/textfile_collector
chown -R node_exporter:node_exporter /var/lib/node_exporter
```

## Installation

Assuming you have followed the strategy for the user account outlined above, you can
simply clone this repo to `/opt/nvsmi-prometheus-textfile/` and use the *service* file
provided in the `resources` directory to run metrics collection via *systemd*:

```bash
cd /opt/
git clone https://github.com/imcf/nvsmi-prometheus-textfile
cd nvsmi-prometheus-textfile/resources
cp -v nvsmi-prometheus-textfile.service /etc/systemd/system/
systemctl daemon-reload
systemctl enable --now nvsmi-prometheus-textfile.service
```

## Seriously, Python 2.7? In 2021??

Well, that's what is available on the Citrix Hypervisor default installation that we're
running. Let's re-evaluate the situation with the next version.

## Metric and Label Naming

See the official Prometheus instructions on [writing exporters][7] and [metric and
label naming][8] for more information.

[1]: https://developer.nvidia.com/nvidia-system-management-interface
[2]: https://prometheus.io/
[3]: https://github.com/prometheus/node_exporter
[4]: https://xenproject.org/
[5]: https://docs.citrix.com/en-us/citrix-hypervisor.html
[6]: https://centos.org/
[7]: https://prometheus.io/docs/instrumenting/writing_exporters/
[8]: https://prometheus.io/docs/practices/naming/
[9]: https://grafana.com/