Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ryujaehun/one-shot-tuner
one-shot-tuner
https://github.com/ryujaehun/one-shot-tuner
Last synced: about 1 month ago
JSON representation
one-shot-tuner
- Host: GitHub
- URL: https://github.com/ryujaehun/one-shot-tuner
- Owner: ryujaehun
- License: apache-2.0
- Created: 2022-03-06T05:24:35.000Z (almost 3 years ago)
- Default Branch: main
- Last Pushed: 2022-12-04T01:25:16.000Z (about 2 years ago)
- Last Synced: 2023-07-31T15:11:41.783Z (over 1 year ago)
- Language: Python
- Size: 86.1 MB
- Stars: 8
- Watchers: 2
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Codeowners: .github/CODEOWNERS
Awesome Lists containing this project
README
to cite One-shot tuner:
``` bibtex
@inproceedings{10.1145/3497776.3517774,
author = {Ryu, Jaehun and Park, Eunhyeok and Sung, Hyojin},
title = {One-Shot Tuner for Deep Learning Compilers},
year = {2022},
isbn = {9781450391832},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3497776.3517774},
doi = {10.1145/3497776.3517774},
abstract = {Auto-tuning DL compilers are gaining ground as an optimizing back-end for DL frameworks. While existing work can generate deep learning models that exceed the performance of hand-tuned libraries, they still suffer from prohibitively long auto-tuning time due to repeated hardware measurements in large search spaces. In this paper, we take a neural-predictor inspired approach to reduce the auto-tuning overhead and show that a performance predictor model trained prior to compilation can produce optimized tensor operation codes without repeated search and hardware measurements. To generate a sample-efficient training dataset, we extend input representation to include task-specific information and to guide data sampling methods to focus on learning high-performing codes. We evaluated the resulting predictor model, One-Shot Tuner, against AutoTVM and other prior work, and the results show that One-Shot Tuner speeds up compilation by 2.81x to 67.7x compared to prior work while providing comparable or improved inference time for CNN and Transformer models.},
booktitle = {Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction},
pages = {89–103},
numpages = {15},
keywords = {deep neural networks, autotuning, performance models, optimizing compilers},
location = {Seoul, South Korea},
series = {CC 2022}
}
```# Install TVM
```
git clone --recursive https://github.com/ryujaehun/one-shot-tuner.git ost
cd one-shot-tuner
```## To install the these minimal pre-requisites
```
sudo apt-get update
sudo apt-get install -y python3 python3-dev python3-setuptools python3-pip gcc libtinfo-dev zlib1g-dev build-essential libedit-dev libxml2-dev libjpeg-dev llvm llvm-10 llvm-10-dev clang-10 git
pip3 install cmake
```Edit build/config.cmake to customize the compilation options
```
mkdir build
cp cmake/config.cmake build
```
Change set(USE_CUDA OFF) to set(USE_CUDA ON) to enable CUDA backend
(e.g. https://gist.github.com/ryujaehun/5c841d3f5a7f720a14a3a7eb05326176)## build tvm
```
cd build
cmake ..
make -j $(($(nproc) + 1))
cd ..
```## set the environment variable
Append `~/.bashrc.`
```
export TVM_HOME=/path/to/one-shot-tuner
export PYTHONPATH=$TVM_HOME/python:${PYTHONPATH}
```## Python dependencies
```
pip3 install tornado psutil xgboost cloudpickle decorator pytest
pip3 install -r requirements.txt
```# prior-guided task sampilng(PBS) and Exploration Based code Sampling(EBS)
The extracted dataset for CUDA data is included.- `-p` activates Prior Guided Task Sampling.
- `-e` activates Exploration Based code sampling.```
python3 dataset_generate/sampling.py -p -e
```# Training a cost model
```
python3 train_model/train.py --dataset_dir --layout NCHW --batch 1
```# Evaluating and collecting results
You can run all main experiment(CUDA device,NCHW format batch 1) using `main.sh` script.
```
main.sh
```Create a folder using the save path and parameters that can be specified in the script and save the result (second,flops/s and end-to-end time).
__Example path__
- `/eval_tuner/save_path/resnet-18/NCHW/1/sa/flops.npy`
__How to collect results__
```
python3 get_result.py
```## Docker guide
If setting the environment is difficult, try using Docker container
```
docker run -it --rm --gpus 1 --name test jaehun/ost:v2 bash # docker running
cd /root/tvm
./main.sh # start experiment
python3 get_result.py # get results
```## Hardware dependencies
We recommend systems with NVIDIA GeForce RTX 2080
Ti GPU and Intel Xeon CPU E5-2666 v3 CPU (AWS c4 4x
large instances) for verifying GPU and CPU results respec-
tively.## Software dependencies
Our code is implemented and tested on Ubuntu 18.04 x86-64
system, with CUDA 10.2 and cudnn 7. Additional software
dependencies include minimal pre-requisites on Ubuntu for
TVM and deep learning frameworks, i.e., PyTorch v1.6.0,
for model implementations. We highly recommend using
the following docker image, ”jaehun/ost:v2”.