Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/NVIDIA/k8s-dra-driver

Dynamic Resource Allocation (DRA) for NVIDIA GPUs in Kubernetes
https://github.com/NVIDIA/k8s-dra-driver

Last synced: 4 months ago
JSON representation

Dynamic Resource Allocation (DRA) for NVIDIA GPUs in Kubernetes

Host: GitHub
URL: https://github.com/NVIDIA/k8s-dra-driver
Owner: NVIDIA
License: apache-2.0
Created: 2023-04-17T19:12:49.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2024-10-24T10:20:42.000Z (4 months ago)
Last Synced: 2024-10-25T08:37:41.068Z (4 months ago)
Language: Go
Size: 15.9 MB
Stars: 251
Watchers: 17
Forks: 47
Open Issues: 31
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE

Awesome Lists containing this project

README

# Dynamic Resource Allocation (DRA) for NVIDIA GPUs in Kubernetes

This DRA resource driver is currently under active development and not yet
designed for production use.
We may (at times) decide to push commits over `main` until we have something more stable.
Use at your own risk.

A document and demo of the DRA support for GPUs provided by this repo can be found below:
| Document | Demo |
|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
| [](https://docs.google.com/document/d/1BNWqgx_SmZDi-va_V31v3DnuVwYnF2EmN7D-O_fB6Oo) | [](https://drive.google.com/file/d/1iLg2FEAEilb1dcI27TnB19VYtbcvgKhS/view?usp=sharing "Demo of Dynamic Resource Allocation (DRA) for GPUs in Kubernetes") |

## Demo

This section describes using `kind` to demo the functionality of the NVIDIA GPU DRA Driver.

First since we'll launch kind with GPU support, ensure that the following prerequisites are met:
1. `kind` is installed. See the official documentation [here](https://kind.sigs.k8s.io/docs/user/quick-start/#installation).
1. Ensure that the NVIDIA Container Toolkit is installed on your system. This
can be done by following the instructions
[here](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html).
1. Configure the NVIDIA Container Runtime as the **default** Docker runtime:
```console
sudo nvidia-ctk runtime configure --runtime=docker --set-as-default
```
1. Restart Docker to apply the changes:
```console
sudo systemctl restart docker
```
1. Set the `accept-nvidia-visible-devices-as-volume-mounts` option to `true` in
the `/etc/nvidia-container-runtime/config.toml` file to configure the NVIDIA
Container Runtime to use volume mounts to select devices to inject into a
container.
``` console
sudo nvidia-ctk config --in-place --set accept-nvidia-visible-devices-as-volume-mounts=true
```

1. Show the current set of GPUs on the machine:
```console
nvidia-smi -L
```

We start by first cloning this repository and `cd`ing into it.
All of the scripts and example Pod specs used in this demo are in the `demo`
subdirectory, so take a moment to browse through the various files and see
what's available:

```console
git clone https://github.com/NVIDIA/k8s-dra-driver.git
```
```console
cd k8s-dra-driver
```

### Setting up the infrastructure

Here's a demo showing how to install and configure DRA, and run a pod in a `kind` cluster on a Linux workstation.

Below are the detailed, step-by-step instructions.

First, create a `kind` cluster to run the demo:
```bash
./demo/clusters/kind/create-cluster.sh
```

From here we will build the image for the example resource driver:
```console
./demo/clusters/kind/build-dra-driver.sh
```

This also makes the built images available to the `kind` cluster.

We now install the NVIDIA GPU DRA driver:
```console
./demo/clusters/kind/install-dra-driver.sh
```

This should show two pods running in the `nvidia-dra-driver` namespace:
```console
kubectl get pods -n nvidia-dra-driver
```
```
NAME READY STATUS RESTARTS AGE
nvidia-k8s-dra-driver-kubelet-plugin-t5qgz 1/1 Running 0 44s
```

### Run the examples by following the steps in the demo script
Finally, you can run the various examples contained in the `demo/specs/quickstart` folder.
With the most recent updates for Kubernetes v1.31, only the first 3 examples in
this folder are currently functional.

You can run them as follows:
```console
kubectl apply --filename=demo/specs/quickstart/gpu-test{1,2,3}.yaml
```

Get the pods' statuses. Depending on which GPUs are available, running the first three examples will produce output similar to the following...

**Note:** there is a [known issue with kind](https://kind.sigs.k8s.io/docs/user/known-issues/#pod-errors-due-to-too-many-open-files). You may see an error while trying to tail the log of a running pod in the kind cluster: `failed to create fsnotify watcher: too many open files.` The issue may be resolved by increasing the value for `fs.inotify.max_user_watches`.
```console
kubectl get pod -A -l app=pod
```
```
NAMESPACE NAME READY STATUS RESTARTS AGE
gpu-test1 pod1 1/1 Running 0 34s
gpu-test1 pod2 1/1 Running 0 34s
gpu-test2 pod 2/2 Running 0 34s
gpu-test3 pod1 1/1 Running 0 34s
gpu-test3 pod2 1/1 Running 0 34s
```
```console
kubectl logs -n gpu-test1 -l app=pod
```
```
GPU 0: A100-SXM4-40GB (UUID: GPU-662077db-fa3f-0d8f-9502-21ab0ef058a2)
GPU 0: A100-SXM4-40GB (UUID: GPU-4cf8db2d-06c0-7d70-1a51-e59b25b2c16c)
```
```console
kubectl logs -n gpu-test2 pod --all-containers
```
```
GPU 0: A100-SXM4-40GB (UUID: GPU-79a2ba02-a537-ccbf-2965-8e9d90c0bd54)
GPU 0: A100-SXM4-40GB (UUID: GPU-79a2ba02-a537-ccbf-2965-8e9d90c0bd54)
```

```console
kubectl logs -n gpu-test3 -l app=pod
```
```
GPU 0: A100-SXM4-40GB (UUID: GPU-4404041a-04cf-1ccf-9e70-f139a9b1e23c)
GPU 0: A100-SXM4-40GB (UUID: GPU-4404041a-04cf-1ccf-9e70-f139a9b1e23c)
```

### Cleaning up the environment

Remove the cluster created in the preceding steps:
```console
./demo/clusters/kind/delete-cluster.sh
```