https://github.com/MachineLearningSystem/socc22-miso

Last synced: 28 days ago
JSON representation

Host: GitHub
URL: https://github.com/MachineLearningSystem/socc22-miso
Owner: MachineLearningSystem
License: mit
Fork: true (boringlee24/socc22-miso)
Created: 2022-12-25T02:58:00.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2022-12-06T20:08:03.000Z (over 2 years ago)
Last Synced: 2024-11-07T09:44:12.111Z (6 months ago)
Size: 2.48 MB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-AI-system - MISO: Exploiting Multi-Instance GPU Capability on Multi-Tenant GPU Clusters SOCC'22

README

# MISO: Exploiting Multi-Instance GPU Capability on Multi-Tenant GPU Clusters
## Published at 2022 ACM Symposium on Cloud Computing (SoCC '22)

Presentation slides available at: [https://baolin-li.netlify.app/uploads/SoCC22_MISO.pdf](https://baolin-li.netlify.app/uploads/SoCC22_MISO.pdf)

This repository require access to NVIDIA A100 GPUs and sudo access to control the GPU.

Multi-Instance GPU user guide: [https://docs.nvidia.com/datacenter/tesla/mig-user-guide/](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/)

## Preparation

#### Environment

Below are the software environment specifications:

OS: CentOS 7
Virtual Environment Manager: Anaconda 4.10.3
CUDA: 11.4
NVIDIA Driver: 470.82.01

Use the ``environment.yml`` file to install the virtual environment in Anaconda

```
conda env create -f environment.yml
```

then activate the environment in every node.

Make sure there this repo directory is ``/home/${USER}/GIT/socc22-miso`` where `${USER}` is the username. Also get a scratch directory available for temporary storage. Currently the scratch directory is `/scratch/${USER}`. If need to use another scratch directory, replace all `/scratch/${USER}` instances in this repo.

#### GPU Node Setup
First download the necessary files (e.g., datasets) needed for the workloads. Go to this Google drive link and [download](https://drive.google.com/file/d/1pcPcPNdDRSYTMnwuibjBSeobm1tGFmxE/view?usp=sharing) the file and unzip:
`unzip MISO_Workload.zip`

On each GPU node, first copy the necessary files into memory by modifying the files ``workloads/copy_memory.sh`` and ``workloads/clear_memory.sh``. Replace ''/dev/shm/tmp'' with the system shared memory location if not on Linux, and replace ''/work/li.baol/MISO_Workload/'' with the path where you extracted the .zip file. Then run

```
./workloads/clear_memory.sh
./workloads/copy_memory.sh
```

On each GPU node, do the following to set up MIG:

Run the following command to enable MIG

```
python mig_helper.py --init
```

Record the MIG slice UUID as lookup tables.

```
python export_cuda_device_auto.py
```

Wait for it to finish, then do the same above for the next GPU node. At this point, all GPUs have been set up and ready to go.

On each GPU, run the following command:

```
python gpu_server.py
```

## Start running

Allocate a CPU node as the scheduler, it should be able to communicate with the GPU nodes through TCP.

Use 4 A100 GPUs to verify the code can work in your system. In the ``run.py`` script, find the variable "physical_nodes". In the current version, both items represent the hostname of each node, meaning two nodes each containing two GPUs. Modify this variable to match your system.

On the CPU (scheduler) node, run the following script:

```
python run.py --arrival 100 --num_gpu 4 --num_job 30 --random_trace
```

It will take several hours to finish these shortened experiments. If succesful, this means the repository has been successfully set up.

## Clean up

Disable MIG and MPS, clear up memory

```
python mig_helper.py --disable
./disable_mps.sh
./workloads/clear_memory.sh
```

## Note

You can reach me at my email: [email protected]

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/MachineLearningSystem/socc22-miso

Awesome Lists containing this project

README