{"id":13604720,"url":"https://github.com/MachineLearningSystem/Lucid","last_synced_at":"2025-04-12T02:31:25.925Z","repository":{"id":185461845,"uuid":"557647517","full_name":"MachineLearningSystem/Lucid","owner":"MachineLearningSystem","description":"Lucid: A Non-Intrusive, Scalable and Interpretable Scheduler for Deep Learning Training Jobs","archived":false,"fork":true,"pushed_at":"2022-10-17T04:12:08.000Z","size":3305,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2024-08-02T19:35:47.268Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"S-Lab-System-Group/Lucid","license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MachineLearningSystem.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2022-10-26T03:23:54.000Z","updated_at":"2022-10-16T16:05:28.000Z","dependencies_parsed_at":null,"dependency_job_id":"ded24bbe-adbe-4076-8a8c-5640f5f039b8","html_url":"https://github.com/MachineLearningSystem/Lucid","commit_stats":null,"previous_names":["machinelearningsystem/lucid"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2FLucid","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2FLucid/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2FLucid/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2FLucid/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MachineLearningSystem","download_url":"https://codeload.github.com/MachineLearningSystem/Lucid/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223489627,"owners_count":17153790,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T19:00:50.498Z","updated_at":"2024-11-07T09:30:48.047Z","avatar_url":"https://github.com/MachineLearningSystem.png","language":null,"funding_links":[],"categories":["Paper-Code"],"sub_categories":["GPU Cluster Management"],"readme":"# Artifact for Lucid\n\nThis repository contains the artifact for our ASPLOS '23 paper \"*Lucid: A Non-Intrusive, Scalable and Interpretable Scheduler for Deep Learning Training Jobs*\". It includes following parts:\n\n+ `simulation`: It contains code and data for reproducing key results in our paper.\n\n+ `workloads`: The Pytorch implementation of 14 different workloads used in experiments. \n\n+ `profile`: It contains the code to collect traces of each training job type.\n\n# Getting Started\n\n### Results Reproduction (for ASPLOS '23 Artifact Evaluation)\n`simulation` (adopted from [Helios](https://github.com/S-Lab-System-Group/HeliosArtifact)) contains instructions for reproducing the `Venus` cluster experiments shown in Section 4. These scripts have been tested on Ubuntu 20.04 with Python 3.9.\n\n#### 0. Structure\n\nThe contents inside `simulation` folder are summarized as follows:\n\n- **data/** contains `Venus` cluster job trace and cluster configuration used for evaluation.\n- **analyzer/** contains the *Packing Analyze Model* and profiled workloads information used in our experiment.\n- **estimator/** contains the *Workload Estimate Model* and job duration estimation for both Lucid and QSSF.\n- **plot/** contains notebook for visualizing experiment results.\n- **policy/** contains implementations of the Lucid scheduling policy, and baseline policies including FIFO, SJF, QSSF, Tiresias. \n- **predictor/** contains the *Throughput Predict Model* and cluster throughput estimation in Venus September.\n- **profiler/** contains the Least-GPU-First and Auto-Scaling Profiler implementation for Lucid.\n- **cluster.py**, **job.py** and **updater.py** contain implementations of the GPU cluster and workload logic.\n- **simulator.py** is the main entry of the simulator.\n\n\n#### 1. Environment Preparation\n\nWe suggest using a conda environment to install the dependencies:\n\n```bash\nconda create -n lucid python=3.9\nconda activate lucid\ncd simulation\npip install -r requirements.txt\n```\n\nBesides, we recommend execute Jupyter notebook (`.ipynb`) files with **VSCode** or **JupyterLab** (`conda install jupyterlab`).\n\n#### 2. Lucid Model Training and Interpretation\n\nWe train *Throughput Predict Model* as a reproduction example. Please follow below steps: \n\n+ Enter `predictor` folder and open `predictor.ipynb` file\n\n+ Run all cells inside the notebook. It contains the interpretable model (Primo EBM) used in Lucid and other ML baselines (LightGBM, XGBoost, Random Forest, DNN).\n\n+ **Table 7: Interpretable Model Performance**: Check `Result Comparison` cell, the MAE scores of all baselines are listed.\n\n+ **Figure 13 (a): Throughput Predict Performance**: Check `Prediction Visualization` cell (or `Venus_throughput.pdf` output file), both the real and predicted throughput are plotted. Generated figures should have similar patterns as the paper. The difference is because we release the *Venus Job* throughput prediction code but we plot *Saturn Job* throughput prediction in our paper.\n\n+ **Figure 7 (a)(b): Global Model Interpretation and Learned Shape Function**: Check `Model Interpretation` cell (or `interpret_Venus_throughput.pdf` \u0026 `interpret_Venus_shapefunc.pdf` output files). Generated figures should have similar patterns as the paper. The difference is because we release the *Venus Job* throughput prediction code but we plot *Saturn GPU* throughput prediction in our paper.\n\n\nMore model training codes are also provided (`estimator/estimator_lucid.ipynb` and `analyzer/analyzer.py`).\n\n\n#### 3. Reproduce Baseline Results\n\nUse the following command to run all baselines simultaneously\n\n```bash\ncd simulation\npython simulator.py --sweep \n```\n\nThe output of this script looks like this:\n```\n2022 Oct 08 14:32:57 | MainProcess | Total Job Number in Cluster Training: 23859\n2022 Oct 08 14:32:59 | ForkPoolWorker-1 | vcEwI | Time: 13220000 | Total Job: 7603 | End job: 13 | Running job: 2 | Pending job: 0\n2022 Oct 08 14:32:59 | ForkPoolWorker-2 | vcWoR | Time: 13220000 | Total Job: 2826 | End job: 0 | Running job: 0 | Pending job: 0\n2022 Oct 08 14:32:59 | ForkPoolWorker-1 | vcEwI | Time: 13230000 | Total Job: 7603 | End job: 120 | Running job: 4 | Pending job: 0\n2022 Oct 08 14:32:59 | ForkPoolWorker-2 | vcWoR | Time: 13230000 | Total Job: 2826 | End job: 0 | Running job: 1 | Pending job: 0\n2022 Oct 08 14:32:59 | ForkPoolWorker-1 | vcEwI | Time: 13240000 | Total Job: 7603 | End job: 120 | Running job: 4 | Pending job: 0\n2022 Oct 08 14:32:59 | ForkPoolWorker-3 | vcHvQ | Time: 13220000 | Total Job: 2654 | End job: 1 | Running job: 1 | Pending job: 0\n2022 Oct 08 14:32:59 | ForkPoolWorker-2 | vcWoR | Time: 13240000 | Total Job: 2826 | End job: 0 | Running job: 1 | Pending job: 0\n2022 Oct 08 14:32:59 | ForkPoolWorker-1 | vcEwI | Time: 13250000 | Total Job: 7603 | End job: 121 | Running job: 4 | Pending job: 0\n2022 Oct 08 14:32:59 | ForkPoolWorker-4 | vcvGl | Time: 13220000 | Total Job: 1452 | End job: 0 | Running job: 0 | Pending job: 0\n2022 Oct 08 14:32:59 | ForkPoolWorker-2 | vcWoR | Time: 13250000 | Total Job: 2826 | End job: 0 | Running job: 2 | Pending job: 0\n2022 Oct 08 14:32:59 | ForkPoolWorker-3 | vcHvQ | Time: 13230000 | Total Job: 2654 | End job: 2 | Running job: 0 | Pending job: 0\n2022 Oct 08 14:32:59 | ForkPoolWorker-1 | vcEwI | Time: 13260000 | Total Job: 7603 | End job: 162 | Running job: 9 | Pending job: 0\n2022 Oct 08 14:32:59 | ForkPoolWorker-5 | vc8Gr | Time: 13220000 | Total Job: 710 | End job: 0 | Running job: 0 | Pending job: 0\n2022 Oct 08 14:32:59 | ForkPoolWorker-4 | vcvGl | Time: 13230000 | Total Job: 1452 | End job: 1 | Running job: 2 | Pending job: 0\n2022 Oct 08 14:32:59 | ForkPoolWorker-5 | vc8Gr | Time: 13230000 | Total Job: 710 | End job: 0 | Running job: 1 | Pending job: 0\n```\n\n#### 4. Reproduce Lucid Results\n\nSimilarly, use the following command to run all baselines simultaneously\n\n```bash\npython simulator.py -s lucid\n```\n\nThe output of this script looks like this:\n```\n2022 Oct 08 14:45:07 | MainProcess | Total Job Number in Cluster Training: 23859\n2022 Oct 08 14:45:08 | MainProcess | profvc | Time: 13220000 | Total Job: 23859 | End job: 17 | Running job: 1 | Pending job: 0 | Avail Nodes: 2\n2022 Oct 08 14:45:08 | MainProcess | profvc | Time: 13230000 | Total Job: 23859 | End job: 134 | Running job: 0 | Pending job: 0 | Avail Nodes: 2\n2022 Oct 08 14:45:08 | MainProcess | profvc | Time: 13240000 | Total Job: 23859 | End job: 134 | Running job: 0 | Pending job: 0 | Avail Nodes: 2\n2022 Oct 08 14:45:08 | MainProcess | profvc | Time: 13250000 | Total Job: 23859 | End job: 136 | Running job: 0 | Pending job: 0 | Avail Nodes: 2\n2022 Oct 08 14:45:08 | MainProcess | profvc | Time: 13260000 | Total Job: 23859 | End job: 249 | Running job: 3 | Pending job: 4 | Avail Nodes: 1\n2022 Oct 08 14:45:08 | MainProcess | profvc | Time: 13270000 | Total Job: 23859 | End job: 385 | Running job: 3 | Pending job: 2 | Avail Nodes: 1\n2022 Oct 08 14:45:08 | MainProcess | profvc | Time: 13280000 | Total Job: 23859 | End job: 589 | Running job: 2 | Pending job: 0 | Avail Nodes: 1\n2022 Oct 08 14:45:08 | MainProcess | profvc | Time: 13290000 | Total Job: 23859 | End job: 780 | Running job: 2 | Pending job: 0 | Avail Nodes: 2\n```\n\nAfter the program is executed, you can check the result in the `log` folder. The job log and time sequence of each VC are provided separately.\n\n\n#### 5. Visualize the Key Results\n\nWe provide simulation analysis and plot scripts to generate the figures shown in our paper. Please follow below steps: \n\n+ Enter `plot` folder and open `result_plot.ipynb` file\n\n+ Run all cells inside the notebook. \n\n+ **Table 4: Scheduling Performance**: Check `Table 4: Result Summary` cell (or `result_summary.csv` output file), the Average JCT, Average Queuing Delay and Queuing Delay 99.9 Quantile of all policies are listed.\n\n+ **Table 5: Scheduling Performance (workload analysis)**: Check `Table 5: Result Summary of Different Scales of Workloads` cell, the Average JCT, Average Queuing Delay of large and small jobs are listed.\n\n\n+ **Figure 8: CDF of JCT**: Check `Plot Result 8: JCT` cell (or `result_cdf_jct.pdf` output file), JCT CDF of all policies are plotted.\n\n+ **Figure 9: Queue Time in each VC**: Check `Plot Result 9: Queue Time in each VC` cell (or `result_bar_queue.pdf` output file), queuing delay of all policies are plotted.\n\n\n\n# Workloads Profiling\n\nThis part `profile` contains code for profiling metrics of multiple workloads.\n\n## Directory\nNote that `./result/` will be created when `main_co.py` or `main_single.py` is launched.\n\n## Basic Usage\nRun `main_co.py` will generate the colocated jobs' metrics under `./result/colocate`. Run `main_single.py` will generate single jobs' metrics under `./result/`. Some specific settings can be set in each workload's profiling file, e.g.`profile_cifar.py`. The output will be like this:\n```\nimagenet + imagenet\nco-locate:\n==\u003e Training mobilenet_v3_small model with 32 batchsize, 0 mp..\n==\u003e Training mobilenet_v3_small model with 32 batchsize, 0 mp..\nco-locate:\n==\u003e Training mobilenet_v3_small model with 32 batchsize, 0 mp..\n==\u003e Training mobilenet_v3_small model with 32 batchsize, 1 mp..\nco-locate:\n==\u003e Training mobilenet_v3_small model with 32 batchsize, 1 mp..\n==\u003e Training mobilenet_v3_small model with 32 batchsize, 1 mp..\nimagenet + cifar10\nco-locate:\nFiles already downloaded and verified\n==\u003e Training ResNet18 model with 32 batchsize, 0 mp..\n==\u003e Training mobilenet_v3_small model with 32 batchsize, 0 mp..\n...\n```\n\n## Datasets\nThe data path storing all datasets is specified in `./workloads/settings.py` as `data_dir`. You can also specify the total runtime of some workloads by changing `total_runtime`.\n\n\n- CIFAR-10: The cifar10 dataset will be downloaded automatically(if not exist) when `./workloads/cifar/profile_cifar.py` is run.\n\n- ImageNet: The dataset is generated automatically in `./workloads/imagenet/profile_imagenet.py`.\n\n- LSUN: The dataset is generated automatically in `./workloads/dcgan/profile_dcgan.py`. You can change the custom image size of generated data via `--imageSize`. The default value is 64.\n\n- ShapeNet: Use the following command to download dataset under directory `data_dir/shapenetcore/`:\n\n    ```bash\n    wget https://shapenet.cs.stanford.edu/ericyi/shapenetcore_partanno_segmentation_benchmark_v0.zip --no-check-certificate\n    unzip shapenetcore_partanno_segmentation_benchmark_v0.zipcollect_metric/workloadspointnet.pytorch.\n\n- SQuAD: The data can be downloaded with the following link and should be saved under `data_dir/SQUAD_DIR/` directory.\n\n    [train-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json)\n\n- Wikitext2: The dataset can be downloaded from \n\n    [wikitext-2](https://github.com/pytorch/examples/tree/main/word_language_model/data/wikitext-2)\n\n    File `test.txt`, `train.txt` and `valid.txt` should be saved in `data_dir/wikitext-2/` directory.\n\n- Multi30k: First download the Moses tokenizer(http://www.statmt.org/moses/) for data preparation:\n    ```bash\n    wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/tokenizer/tokenizer.perl\n    wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/share/nonbreaking_prefixes/nonbreaking_prefix.de\n    wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/share/nonbreaking_prefixes/nonbreaking_prefix.en\n    sed -i \"s/$RealBin\\/..\\/share\\/nonbreaking_prefixes//\" tokenizer.perl\n    wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/generic/multi-bleu.perl\n    ```\n    These files should be downloaded in `./workloads/translation/`.\n\n    Then download data in `data_dir/multi30k/`:\n    ```bash\n    mkdir -p data/multi30k\n    wget http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/training.tar.gz \u0026\u0026  tar -xf training.tar.gz -C data/multi30k \u0026\u0026 rm training.tar.gz\n    wget http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/validation.tar.gz \u0026\u0026 tar -xf validation.tar.gz -C data/multi30k \u0026\u0026 rm validation.tar.gz\n    wget http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/mmt16_task1_test.tar.gz \u0026\u0026 tar -xf mmt16_task1_test.tar.gz -C data/multi30k \u0026\u0026 rm mmt16_task1_test.tar.gz\n    ```\n    Preprocess the data:\n    ```bash\n    for l in en de; do for f in ~/data/multi30k/*.$l; do if [[ \"$f\" != *\"test\"* ]]; then sed -i \"$ d\" $f; fi;  done; done\n    for l in en de; do for f in ~/data/multi30k/*.$l; do perl tokenizer.perl -a -no-escape -l $l -q  \u003c $f \u003e $f.atok; done; done\n    python preprocess.py -train_src ~/data/multi30k/train.en.atok -train_tgt ~/data/multi30k/train.de.atok -valid_src ~/data/multi30k/val.en.atok -valid_tgt ~/data/multi30k/val.de.atok -save_data ~/data/multi30k.atok.low.pt\n    ```\n    Referenced from: https://github.com/Eathoublu/attention-is-all-you-need-pytorch.\n\n- MovieLens: Use the following command to download the dataset in `data_dir/ml-1m/`:\n    ```bash\n    wget https://github.com/hexiangnan/neural_collaborative_filtering/raw/master/Data/ml-1m.test.negative\n    wget https://github.com/hexiangnan/neural_collaborative_filtering/raw/master/Data/ml-1m.test.rating\n    wget https://github.com/hexiangnan/neural_collaborative_filtering/raw/master/Data/ml-1m.train.rating\n    ```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMachineLearningSystem%2FLucid","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FMachineLearningSystem%2FLucid","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMachineLearningSystem%2FLucid/lists"}