https://github.com/sjtu-ipads/reef
REEF is a GPU-accelerated DNN inference serving system that enables instant kernel preemption and biased concurrent execution in GPU scheduling.
https://github.com/sjtu-ipads/reef
Last synced: 2 months ago
JSON representation
REEF is a GPU-accelerated DNN inference serving system that enables instant kernel preemption and biased concurrent execution in GPU scheduling.
- Host: GitHub
- URL: https://github.com/sjtu-ipads/reef
- Owner: SJTU-IPADS
- License: apache-2.0
- Created: 2022-05-27T03:06:39.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2022-12-24T14:49:00.000Z (almost 3 years ago)
- Last Synced: 2025-04-30T05:33:55.502Z (5 months ago)
- Language: Cuda
- Homepage:
- Size: 67 MB
- Stars: 94
- Watchers: 9
- Forks: 11
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# REEF - Real-time GPU-accelerated DNN Inference Scheduling System
REEF is a real-time GPU-accelerated DNN inference scheduling system that supports instant kernel preemption and controlled concurrent execution in GPU scheduling.
## Table of Contents
- [Introduction](#introduction)
- [Paper](#paper)
- [REEF Example](#reef-example)
- [Project Structure](#project-structure)
- [Hardware Requirement](#hardware-requirement)
- [Installation](#installation)
- [Artifact Evaluation](#artifact-evaluation)## Introduction
REEF is a real-time GPU-accelerated DNN inference scheduling system.
REEF divides DNN inference tasks into two priorities: *real-time tasks(RT tasks)* and *best-effort tasks(BE tasks)*.
The scheduling goal of REEF is to minimize the latency of RT task and improve the throughput as much as possible.REEF achieves such goal by providing two key techniques:
* *Reset-based Preemption:* BE tasks can be preempted in a few microseconds once a RT task arrives. The preemption is achieved by just killing
the running BE kernels and clearing the queued BE kernels, which is bases on the *idempotence* of DNN inference kernel.* *Dynamic Kernel Padding(DKP):* BE tasks can be co-executed with the RT task by only using the CUs leftover from the RT kernels. This approach can improve the throughput and avoid starvation of BE tasks with minimal latency overhead on RT tasks.
## REEF Example
After [building REEF](INSTALL.md), the example below can show how REEF works when there are concurrent tasks (one RT and multiple BEs).
First, start a REEF server.
```bash
# in ./build
$ ./reef_server
```Then, start multiple BE clients.
```bash
# in ./build
$ for i in {1..4}; do ./reef_client_cont ../resource/resnet152 resnet152 0 0 & done
```
You can see 4 BE clients are submitting BE tasks concurrently, the client will echo the inference latency of each task, e.g.:
```
client 3 inference latency: 16.567 ms
client 2 inference latency: 29.347 ms
client 1 inference latency: 32.506 ms
client 0 inference latency: 24.848 ms
```Then, start a RT client, which submitting requests without pause.
```bash
# in ./build
$ ./reef_client_cont ../resource/resnet152 resnet152 1 0
```You can see the RT client has the lowest inference latency.
```
...
client 4 inference latency: 12.743 ms
client 4 inference latency: 12.608 ms
client 4 inference latency: 12.944 ms
client 4 inference latency: 12.637 ms
```While, the BE task can still execute concurrently with RT task without affecting the performance of RT tasks.
```
...
client 2 inference latency: 48.183 ms
client 1 inference latency: 68.599 ms
client 0 inference latency: 34.857 ms
client 3 inference latency: 43.565 ms
```## Project Structure
```
> tree .
├── cmake
├── resource # DNN model resources for the evaluations
│ ├── resnet # DNN model for ResNet
│ │ ├── resnet.json # The schedule graph (meta data) of the DNN model
│ │ ├── resnet.cu # The raw GPU device code (GPU kernels) for the DNN model
│ │ ├── resnet.trans.cu # The transformed GPU device code which supports dynamic kernel padding
│ │ ├── resnet.be.cu # The transformed GPU devide code which supports reset-based preemption
│ │ ├── resnet.profile.json # The profile of the kernel execution time
│ ├── densenet
│ ├── inception
├── script # Utility scripts
└── src # source code
│ ├── example # REEF examples
│ └── reef # REEF source code
└── env.sh # Environment variables
│
```## Hardware Requirement
Currently, REEF only supports **AMD Radeon Instinct MI50 GPU**.
## Installation
see [INSTALL](INSTALL.md).
## Artifact Evaluation
For OSDI'22 atrifact evaluation, see [reef-artifacts](https://github.com/SJTU-IPADS/reef-artifacts).
## Paper
If you use REEF in your research, please cite our paper:
```bibtex
@inproceedings {osdi2022reef,
author = {Mingcong Han and Hanze Zhang and Rong Chen and Haibo Chen},
title = {Microsecond-scale Preemption for Concurrent {GPU-accelerated} {DNN} Inferences},
booktitle = {16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)},
year = {2022},
isbn = {978-1-939133-28-1},
address = {Carlsbad, CA},
pages = {539--558},
url = {https://www.usenix.org/conference/osdi22/presentation/han},
publisher = {USENIX Association},
month = jul,
}
```## The Team
REEF is developed and maintained by members from [IPADS@SJTU](https://github.com/SJTU-IPADS) and Shanghai AI Laboratory. See [Contributors](CONTRIBUTORS.md).
## License
REEF uses [Apache License 2.0](http://www.apache.org/licenses/LICENSE-2.0.html).