https://github.com/letmutx/nomad-nvidia-vgpu-plugin
Nomad plugin for sharing Nvidia GPU across multiple jobs
https://github.com/letmutx/nomad-nvidia-vgpu-plugin
gpu nomad nvidia
Last synced: 10 months ago
JSON representation
Nomad plugin for sharing Nvidia GPU across multiple jobs
- Host: GitHub
- URL: https://github.com/letmutx/nomad-nvidia-vgpu-plugin
- Owner: letmutx
- License: mpl-2.0
- Created: 2022-05-21T11:38:50.000Z (about 4 years ago)
- Default Branch: main
- Last Pushed: 2024-12-22T02:32:32.000Z (over 1 year ago)
- Last Synced: 2025-05-01T12:17:15.350Z (about 1 year ago)
- Topics: gpu, nomad, nvidia
- Language: Go
- Homepage:
- Size: 118 KB
- Stars: 5
- Watchers: 1
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
Nomad Nvidia Virtual Device Plugin
==================
This repo contains a device plugin for [Nomad](https://www.nomadproject.io/) to support exposing a number of virtual GPUs for each physical GPU present on the machine. This enables running workloads which don't consume the whole GPU.
Installation requirements
-----------------------
This plugin needs the following dependencies to function:
* [Nomad](https://www.nomadproject.io/downloads.html) 0.9+
* GNU/Linux x86_64 with kernel version > 3.10
* NVIDIA GPU with Architecture > Fermi (2.1)
* NVIDIA drivers >= 340.29 with binary nvidia-smi
* Docker v19.03+
Copy the plugin binary to the [plugins directory](https://www.nomadproject.io/docs/configuration/index.html#plugin_dir) and [configure the plugin](https://www.nomadproject.io/docs/configuration/plugin.html) in the client config. Also, see the requirements for the official [nvidia-plugin](https://www.nomadproject.io/plugins/devices/nvidia#installation-requirements).
```hcl
plugin "nvidia-vgpu" {
config {
ignored_gpu_ids = ["uuid1", "uuid2"]
fingerprint_period = "5s"
vgpus = 16
}
}
```
Usage
--------------
Use the [device stanza](https://www.nomadproject.io/docs/job-specification/device.html) in the job file to schedule with device support.
```hcl
job "gpu-test" {
datacenters = ["dc1"]
type = "batch"
group "smi" {
task "smi" {
driver = "docker"
config {
image = "nvidia/cuda:11.0-base"
command = "nvidia-smi"
}
resources {
device "letmutx/gpu" {
count = 1
# Add an affinity for a particular model
affinity {
attribute = "${device.model}"
value = "Tesla K80"
weight = 50
}
}
}
}
}
}
```
Notes
-------
* GPU memory allocation/usage is handled in a cooperative manner. This means that one bad GPU process using more memory than assigned can cause starvation for other processes.
* Managing memory isolation per task is left to the user. It depends on a lot of factors like [MPS](https://docs.nvidia.com/deploy/mps/index.html#topic_3_3_3), GPU architecture etc. [This doc](https://drops.dagstuhl.de/opus/volltexte/2018/8984/pdf/LIPIcs-ECRTS-2018-20.pdf) has some information.
Testing
---------
The best way to test the plugin is to go to a target machine with Nvidia GPU and run the plugin using Nomad's [plugin launcher](https://github.com/hashicorp/nomad/blob/main/plugins/shared/cmd/launcher/README.md) with:
```shell
make eval
```
Inspired by
--------------
* https://github.com/awslabs/aws-virtual-gpu-device-plugin
* https://github.com/kubernetes/kubernetes/issues/52757#issuecomment-402772200