Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/MachineLearningSystem/Lucid
Lucid: A Non-Intrusive, Scalable and Interpretable Scheduler for Deep Learning Training Jobs
https://github.com/MachineLearningSystem/Lucid
Last synced: 9 days ago
JSON representation
Lucid: A Non-Intrusive, Scalable and Interpretable Scheduler for Deep Learning Training Jobs
- Host: GitHub
- URL: https://github.com/MachineLearningSystem/Lucid
- Owner: MachineLearningSystem
- License: other
- Fork: true (S-Lab-System-Group/Lucid)
- Created: 2022-10-26T03:23:54.000Z (about 2 years ago)
- Default Branch: master
- Last Pushed: 2022-10-17T04:12:08.000Z (about 2 years ago)
- Last Synced: 2024-08-02T19:35:47.268Z (4 months ago)
- Size: 3.15 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
- awesome-AI-system - Lucid: A Non-Intrusive, Scalable and Interpretable Scheduler for Deep Learning Training Jobs ASPLOS'23
README
# Artifact for Lucid
This repository contains the artifact for our ASPLOS '23 paper "*Lucid: A Non-Intrusive, Scalable and Interpretable Scheduler for Deep Learning Training Jobs*". It includes following parts:
+ `simulation`: It contains code and data for reproducing key results in our paper.
+ `workloads`: The Pytorch implementation of 14 different workloads used in experiments.
+ `profile`: It contains the code to collect traces of each training job type.
# Getting Started
### Results Reproduction (for ASPLOS '23 Artifact Evaluation)
`simulation` (adopted from [Helios](https://github.com/S-Lab-System-Group/HeliosArtifact)) contains instructions for reproducing the `Venus` cluster experiments shown in Section 4. These scripts have been tested on Ubuntu 20.04 with Python 3.9.#### 0. Structure
The contents inside `simulation` folder are summarized as follows:
- **data/** contains `Venus` cluster job trace and cluster configuration used for evaluation.
- **analyzer/** contains the *Packing Analyze Model* and profiled workloads information used in our experiment.
- **estimator/** contains the *Workload Estimate Model* and job duration estimation for both Lucid and QSSF.
- **plot/** contains notebook for visualizing experiment results.
- **policy/** contains implementations of the Lucid scheduling policy, and baseline policies including FIFO, SJF, QSSF, Tiresias.
- **predictor/** contains the *Throughput Predict Model* and cluster throughput estimation in Venus September.
- **profiler/** contains the Least-GPU-First and Auto-Scaling Profiler implementation for Lucid.
- **cluster.py**, **job.py** and **updater.py** contain implementations of the GPU cluster and workload logic.
- **simulator.py** is the main entry of the simulator.#### 1. Environment Preparation
We suggest using a conda environment to install the dependencies:
```bash
conda create -n lucid python=3.9
conda activate lucid
cd simulation
pip install -r requirements.txt
```Besides, we recommend execute Jupyter notebook (`.ipynb`) files with **VSCode** or **JupyterLab** (`conda install jupyterlab`).
#### 2. Lucid Model Training and Interpretation
We train *Throughput Predict Model* as a reproduction example. Please follow below steps:
+ Enter `predictor` folder and open `predictor.ipynb` file
+ Run all cells inside the notebook. It contains the interpretable model (Primo EBM) used in Lucid and other ML baselines (LightGBM, XGBoost, Random Forest, DNN).
+ **Table 7: Interpretable Model Performance**: Check `Result Comparison` cell, the MAE scores of all baselines are listed.
+ **Figure 13 (a): Throughput Predict Performance**: Check `Prediction Visualization` cell (or `Venus_throughput.pdf` output file), both the real and predicted throughput are plotted. Generated figures should have similar patterns as the paper. The difference is because we release the *Venus Job* throughput prediction code but we plot *Saturn Job* throughput prediction in our paper.
+ **Figure 7 (a)(b): Global Model Interpretation and Learned Shape Function**: Check `Model Interpretation` cell (or `interpret_Venus_throughput.pdf` & `interpret_Venus_shapefunc.pdf` output files). Generated figures should have similar patterns as the paper. The difference is because we release the *Venus Job* throughput prediction code but we plot *Saturn GPU* throughput prediction in our paper.
More model training codes are also provided (`estimator/estimator_lucid.ipynb` and `analyzer/analyzer.py`).
#### 3. Reproduce Baseline Results
Use the following command to run all baselines simultaneously
```bash
cd simulation
python simulator.py --sweep
```The output of this script looks like this:
```
2022 Oct 08 14:32:57 | MainProcess | Total Job Number in Cluster Training: 23859
2022 Oct 08 14:32:59 | ForkPoolWorker-1 | vcEwI | Time: 13220000 | Total Job: 7603 | End job: 13 | Running job: 2 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-2 | vcWoR | Time: 13220000 | Total Job: 2826 | End job: 0 | Running job: 0 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-1 | vcEwI | Time: 13230000 | Total Job: 7603 | End job: 120 | Running job: 4 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-2 | vcWoR | Time: 13230000 | Total Job: 2826 | End job: 0 | Running job: 1 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-1 | vcEwI | Time: 13240000 | Total Job: 7603 | End job: 120 | Running job: 4 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-3 | vcHvQ | Time: 13220000 | Total Job: 2654 | End job: 1 | Running job: 1 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-2 | vcWoR | Time: 13240000 | Total Job: 2826 | End job: 0 | Running job: 1 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-1 | vcEwI | Time: 13250000 | Total Job: 7603 | End job: 121 | Running job: 4 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-4 | vcvGl | Time: 13220000 | Total Job: 1452 | End job: 0 | Running job: 0 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-2 | vcWoR | Time: 13250000 | Total Job: 2826 | End job: 0 | Running job: 2 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-3 | vcHvQ | Time: 13230000 | Total Job: 2654 | End job: 2 | Running job: 0 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-1 | vcEwI | Time: 13260000 | Total Job: 7603 | End job: 162 | Running job: 9 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-5 | vc8Gr | Time: 13220000 | Total Job: 710 | End job: 0 | Running job: 0 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-4 | vcvGl | Time: 13230000 | Total Job: 1452 | End job: 1 | Running job: 2 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-5 | vc8Gr | Time: 13230000 | Total Job: 710 | End job: 0 | Running job: 1 | Pending job: 0
```#### 4. Reproduce Lucid Results
Similarly, use the following command to run all baselines simultaneously
```bash
python simulator.py -s lucid
```The output of this script looks like this:
```
2022 Oct 08 14:45:07 | MainProcess | Total Job Number in Cluster Training: 23859
2022 Oct 08 14:45:08 | MainProcess | profvc | Time: 13220000 | Total Job: 23859 | End job: 17 | Running job: 1 | Pending job: 0 | Avail Nodes: 2
2022 Oct 08 14:45:08 | MainProcess | profvc | Time: 13230000 | Total Job: 23859 | End job: 134 | Running job: 0 | Pending job: 0 | Avail Nodes: 2
2022 Oct 08 14:45:08 | MainProcess | profvc | Time: 13240000 | Total Job: 23859 | End job: 134 | Running job: 0 | Pending job: 0 | Avail Nodes: 2
2022 Oct 08 14:45:08 | MainProcess | profvc | Time: 13250000 | Total Job: 23859 | End job: 136 | Running job: 0 | Pending job: 0 | Avail Nodes: 2
2022 Oct 08 14:45:08 | MainProcess | profvc | Time: 13260000 | Total Job: 23859 | End job: 249 | Running job: 3 | Pending job: 4 | Avail Nodes: 1
2022 Oct 08 14:45:08 | MainProcess | profvc | Time: 13270000 | Total Job: 23859 | End job: 385 | Running job: 3 | Pending job: 2 | Avail Nodes: 1
2022 Oct 08 14:45:08 | MainProcess | profvc | Time: 13280000 | Total Job: 23859 | End job: 589 | Running job: 2 | Pending job: 0 | Avail Nodes: 1
2022 Oct 08 14:45:08 | MainProcess | profvc | Time: 13290000 | Total Job: 23859 | End job: 780 | Running job: 2 | Pending job: 0 | Avail Nodes: 2
```After the program is executed, you can check the result in the `log` folder. The job log and time sequence of each VC are provided separately.
#### 5. Visualize the Key Results
We provide simulation analysis and plot scripts to generate the figures shown in our paper. Please follow below steps:
+ Enter `plot` folder and open `result_plot.ipynb` file
+ Run all cells inside the notebook.
+ **Table 4: Scheduling Performance**: Check `Table 4: Result Summary` cell (or `result_summary.csv` output file), the Average JCT, Average Queuing Delay and Queuing Delay 99.9 Quantile of all policies are listed.
+ **Table 5: Scheduling Performance (workload analysis)**: Check `Table 5: Result Summary of Different Scales of Workloads` cell, the Average JCT, Average Queuing Delay of large and small jobs are listed.
+ **Figure 8: CDF of JCT**: Check `Plot Result 8: JCT` cell (or `result_cdf_jct.pdf` output file), JCT CDF of all policies are plotted.
+ **Figure 9: Queue Time in each VC**: Check `Plot Result 9: Queue Time in each VC` cell (or `result_bar_queue.pdf` output file), queuing delay of all policies are plotted.
# Workloads Profiling
This part `profile` contains code for profiling metrics of multiple workloads.
## Directory
Note that `./result/` will be created when `main_co.py` or `main_single.py` is launched.## Basic Usage
Run `main_co.py` will generate the colocated jobs' metrics under `./result/colocate`. Run `main_single.py` will generate single jobs' metrics under `./result/`. Some specific settings can be set in each workload's profiling file, e.g.`profile_cifar.py`. The output will be like this:
```
imagenet + imagenet
co-locate:
==> Training mobilenet_v3_small model with 32 batchsize, 0 mp..
==> Training mobilenet_v3_small model with 32 batchsize, 0 mp..
co-locate:
==> Training mobilenet_v3_small model with 32 batchsize, 0 mp..
==> Training mobilenet_v3_small model with 32 batchsize, 1 mp..
co-locate:
==> Training mobilenet_v3_small model with 32 batchsize, 1 mp..
==> Training mobilenet_v3_small model with 32 batchsize, 1 mp..
imagenet + cifar10
co-locate:
Files already downloaded and verified
==> Training ResNet18 model with 32 batchsize, 0 mp..
==> Training mobilenet_v3_small model with 32 batchsize, 0 mp..
...
```## Datasets
The data path storing all datasets is specified in `./workloads/settings.py` as `data_dir`. You can also specify the total runtime of some workloads by changing `total_runtime`.- CIFAR-10: The cifar10 dataset will be downloaded automatically(if not exist) when `./workloads/cifar/profile_cifar.py` is run.
- ImageNet: The dataset is generated automatically in `./workloads/imagenet/profile_imagenet.py`.
- LSUN: The dataset is generated automatically in `./workloads/dcgan/profile_dcgan.py`. You can change the custom image size of generated data via `--imageSize`. The default value is 64.
- ShapeNet: Use the following command to download dataset under directory `data_dir/shapenetcore/`:
```bash
wget https://shapenet.cs.stanford.edu/ericyi/shapenetcore_partanno_segmentation_benchmark_v0.zip --no-check-certificate
unzip shapenetcore_partanno_segmentation_benchmark_v0.zipcollect_metric/workloadspointnet.pytorch.- SQuAD: The data can be downloaded with the following link and should be saved under `data_dir/SQUAD_DIR/` directory.
[train-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json)
- Wikitext2: The dataset can be downloaded from
[wikitext-2](https://github.com/pytorch/examples/tree/main/word_language_model/data/wikitext-2)
File `test.txt`, `train.txt` and `valid.txt` should be saved in `data_dir/wikitext-2/` directory.
- Multi30k: First download the Moses tokenizer(http://www.statmt.org/moses/) for data preparation:
```bash
wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/tokenizer/tokenizer.perl
wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/share/nonbreaking_prefixes/nonbreaking_prefix.de
wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/share/nonbreaking_prefixes/nonbreaking_prefix.en
sed -i "s/$RealBin\/..\/share\/nonbreaking_prefixes//" tokenizer.perl
wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/generic/multi-bleu.perl
```
These files should be downloaded in `./workloads/translation/`.Then download data in `data_dir/multi30k/`:
```bash
mkdir -p data/multi30k
wget http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/training.tar.gz && tar -xf training.tar.gz -C data/multi30k && rm training.tar.gz
wget http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/validation.tar.gz && tar -xf validation.tar.gz -C data/multi30k && rm validation.tar.gz
wget http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/mmt16_task1_test.tar.gz && tar -xf mmt16_task1_test.tar.gz -C data/multi30k && rm mmt16_task1_test.tar.gz
```
Preprocess the data:
```bash
for l in en de; do for f in ~/data/multi30k/*.$l; do if [[ "$f" != *"test"* ]]; then sed -i "$ d" $f; fi; done; done
for l in en de; do for f in ~/data/multi30k/*.$l; do perl tokenizer.perl -a -no-escape -l $l -q < $f > $f.atok; done; done
python preprocess.py -train_src ~/data/multi30k/train.en.atok -train_tgt ~/data/multi30k/train.de.atok -valid_src ~/data/multi30k/val.en.atok -valid_tgt ~/data/multi30k/val.de.atok -save_data ~/data/multi30k.atok.low.pt
```
Referenced from: https://github.com/Eathoublu/attention-is-all-you-need-pytorch.- MovieLens: Use the following command to download the dataset in `data_dir/ml-1m/`:
```bash
wget https://github.com/hexiangnan/neural_collaborative_filtering/raw/master/Data/ml-1m.test.negative
wget https://github.com/hexiangnan/neural_collaborative_filtering/raw/master/Data/ml-1m.test.rating
wget https://github.com/hexiangnan/neural_collaborative_filtering/raw/master/Data/ml-1m.train.rating
```