https://github.com/marcorentap/kokkos-docker-cluster
Deploy Docker containers with Kokkos, OpenMP, OpenMPI and CUDA as a Docker swarm.
https://github.com/marcorentap/kokkos-docker-cluster
cuda docker hpc kokkos
Last synced: 7 months ago
JSON representation
Deploy Docker containers with Kokkos, OpenMP, OpenMPI and CUDA as a Docker swarm.
- Host: GitHub
- URL: https://github.com/marcorentap/kokkos-docker-cluster
- Owner: marcorentap
- Created: 2024-02-17T12:53:20.000Z (over 1 year ago)
- Default Branch: master
- Last Pushed: 2024-03-07T05:38:44.000Z (over 1 year ago)
- Last Synced: 2024-07-29T19:07:43.130Z (about 1 year ago)
- Topics: cuda, docker, hpc, kokkos
- Language: Python
- Homepage:
- Size: 30.3 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Prerequisites
Ensure you have [HPCCM](https://github.com/NVIDIA/hpc-container-maker) installed.## CUDA
If you need CUDA, install [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/index.html).Then modify `/etc/docker/daemon.json` to set NVIDIA runtime as the default:
```
{
"runtimes": {
"nvidia": {
"args": [],
"path": "/usr/bin/nvidia-container-runtime"
}
},
"default-runtime" : "nvidia"
}
```
Then restart Docker:
```
sudo systemctl restart docker
```Ensure you have GPU access with the default runtime:
```
docker run -it nvcr.io/nvidia/cuda:12.3.1-devel-ubuntu22.04 nvidia-smi
```---
# Usage## Building Images
Dockerfiles are generated using HPCCM from recipes located in `recipes/`. These images are based on `nvcr.io/nvidia/cuda:12.3.1-devel-ubuntu22.04`.
There are two images: `kokkos-compute` and `kokkos-sherlock`, both sharing SSH keys found in `ssh/`. To specify the architecture for building `kokkos`, set the environment variable `KOKKOS_CLUSTER_ARCH=`. The scripts will then use the `Kokkos_ARCH_` compile flag for building `Kokkos`. For example, to generate SSH keys and build the images for `Kokkos_ARCH_VOLTA70`, execute:
```
export KOKKOS_CLUSTER_ARCH=VOLTA70
./make_keys && ./make_compute.sh && ./make_sherlock.sh
```Both images contain users `root` with password `kokkosroot` and `compute` with password `kokkoscompute`.
## Starting Containers
`kokkos-compute` containers are intended to run continuously in the background, while `kokkos-sherlock` containers can be started as needed to launch jobs. Additionally, `shared/` is mounted to `/shared` in both images.
Start by building the `kokkos-overlay` network:
```
# Initialize Docker swarm if not already done
docker swarm init
./make_network.sh
```The file `compose.yaml` is configured to launch 100 `kokkos-compute` containers. To deploy it, run:
```
docker stack deploy --compose-file=compose.yaml kokkos
```
Ensure that all 100 containers have started with `docker service ls`. Then, start a `kokkos-sherlock` container using:
```
./start_sherlock.sh
```## Try mpirun
`/shared/hostfile` contains an `mpirun` hostfile with 100 hosts and 4 slots each.
`/shared/hello.sh` is a test program to print hostnames:
```
mpirun --np 400 --hostfile /shared/hostfile /shared/hello.sh
```## Scaling
If 100 containers are insufficient and you require 150 containers, execute:
```
docker service scale kokkos_compute=150
./gen_hostfile.py 4 > shared/hostfile
```
`./gen_hostfile ` generates a hostfile based on currently running `kokkos-compute` containers.## Stopping
To stop the setup, execute:
```
docker stack rm kokkos
```