Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/instadeepai/gcp-gpu-metrics
📈 Tiny Go binary that aims to export Nvidia GPU metrics to GCP monitoring, based on nvidia-smi.
https://github.com/instadeepai/gcp-gpu-metrics
gcp golang google-cloud-platform gpu metrics metrics-exporter monitoring nvidia nvidia-smi
Last synced: 5 days ago
JSON representation
📈 Tiny Go binary that aims to export Nvidia GPU metrics to GCP monitoring, based on nvidia-smi.
- Host: GitHub
- URL: https://github.com/instadeepai/gcp-gpu-metrics
- Owner: instadeepai
- License: mit
- Created: 2020-12-07T16:27:15.000Z (about 4 years ago)
- Default Branch: master
- Last Pushed: 2021-01-11T11:53:38.000Z (about 4 years ago)
- Last Synced: 2024-06-19T06:49:17.818Z (8 months ago)
- Topics: gcp, golang, google-cloud-platform, gpu, metrics, metrics-exporter, monitoring, nvidia, nvidia-smi
- Language: Go
- Homepage:
- Size: 2.55 MB
- Stars: 11
- Watchers: 4
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# gcp-gpu-metrics
![CI](https://github.com/instadeepai/gcp-gpu-metrics/workflows/CI/badge.svg?branch=master)
Tiny Go binary that aims to export Nvidia GPU metrics to GCP monitoring, based on nvidia-smi.
## Requirements ⚓
* Your machine must be a GCE (Google Compute Engine) instance.
* The `Cloud API access scopes` of the instance or `Service Account` must have the `Monitoring Metric Writer` permission.
* You need the `nvidia-smi` binary installed on your GCE instance.Protip: You can use a [machine learning image](https://cloud.google.com/ai-platform/deep-learning-vm/docs/images) provided by GCP as default image.
## Install ⏬
If you're root, you can install the latest binary version using the following script:
```bash
$ bash < <(curl -sSL https://raw.githubusercontent.com/instadeepai/gcp-gpu-metrics/master/install-latest.sh)
```Or, you can download a [release/binary from this page](https://github.com/instadeepai/gcp-gpu-metrics/releases) and install it manually.
## Usage 💻
gcp-gpu-metrics is an UNIX compliant and very simple CLI, you just have to use it as usual:
```bash
$ gcp-gpu-metrics
```Available flags:
* `--service-account-path string` | GCP service account path. (default "")
* `--metrics-interval uint` | Fetch metrics interval in seconds. (default 10)
* `--enable-nvidiasmi-pm` | Enable persistence mod for nvidia-smi. (default false)
* `--version` | Display current version/release and commit hash.Available env variables:
* `GGM_SERVICE_ACCOUNT_PATH=./service-account.json` linked to `--service-account-path` flag.
* `GGM_METRICS_INTERVAL=10` linked to `--metrics-interval` flag.
* `GGM_ENABLE_NVIDIASMI_PM=true` linked to `--enable-nvidiasmi-pm` flag.Priority order is `binary flag` ➡️ `env var` ➡️ `default value`.
Nvidia-smi persistence mod is very useful, the option permits to run `nvidia-smi` as a daemon in background to prevent 100% of GPU load at each request. Enabling this option requires root.
About logs, they're all located under syslog.
## Metrics 📈
There are 6 differents metrics fetched, this number will grow in the future.
* `temperature.gpu` as `custom.googleapis.com/gpu/temperature_gpu` | Core GPU temperature. in degrees C.
* `utilization.gpu` as `custom.googleapis.com/gpu/utilization_gpu` | Percent of time over the past sample period during which one or more kernels were executed on the GPU.
* `utilization.memory` as `custom.googleapis.com/gpu/utilization_memory` | Percent of time over the past sample period during which global (device) memory was being read or written.
* `memory.total` as `custom.googleapis.com/gpu/memory_total` | Total installed GPU memory.
* `memory.free` as `custom.googleapis.com/gpu/memory_free` | Total GPU free memory.
* `memory.used` as `custom.googleapis.com/gpu/memory_used` | Total memory allocated by active contexts.It creates an amount of time series equal to GPU amount with the label `gpu_id` + a GPU average.
Here is a list of other labels:
* `bus_id` | Identify your GPUs at hardware level.
* `instance_name` | Identify instance name.Example for 2 GPUs with `temperature.gpu` query, it will create:
| gpu_id | bus_id | instance_name | Value |
|---|---|---|---|
| gpu_0 | 00000000:00:04.0 | gcp-gpu-instance | 50 |
| gpu_1 | 00000000:00:05.0 | gcp-gpu-instance | 60 |
| gpu_avg | null | gcp-gpu-instance | 55 |## Compile gcp-gpu-metrics ⚙
There is a `re` command in the Makefile.
```bash
$ make re
```gcp-gpu-metrics has been tested with `go1.15` and use go modules.
## Report an issue 📢
Feel free to open a GitHub issue on this project 🚀
## License 🔑
See [LICENSE](LICENSE).