{"id":13605514,"url":"https://github.com/MachineLearningSystem/Driple","last_synced_at":"2025-04-12T05:33:48.983Z","repository":{"id":185461730,"uuid":"594602284","full_name":"MachineLearningSystem/Driple","owner":"MachineLearningSystem","description":"🚨 Prediction of the Resource Consumption of Distributed Deep Learning Systems","archived":false,"fork":true,"pushed_at":"2022-06-10T06:52:48.000Z","size":3216,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2024-08-02T19:37:47.592Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"gsyang33/Driple","license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MachineLearningSystem.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2023-01-29T03:30:45.000Z","updated_at":"2022-10-22T02:30:34.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/MachineLearningSystem/Driple","commit_stats":null,"previous_names":["machinelearningsystem/driple"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2FDriple","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2FDriple/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2FDriple/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2FDriple/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MachineLearningSystem","download_url":"https://codeload.github.com/MachineLearningSystem/Driple/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223497922,"owners_count":17155220,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T19:00:59.617Z","updated_at":"2024-11-07T10:30:49.230Z","avatar_url":"https://github.com/MachineLearningSystem.png","language":null,"readme":"# ***Driple*** \n\n\n- [***Driple***](#driple)\n  - [**Overview**](#overview)\n  - [***Driple* inspector training**](#driple-inspector-training)\n    - [**Environment setup**](#environment-setup)\n    - [**Execute training**](#execute-training)\n  - [**Training dataset generation**](#training-dataset-generation)\n    - [**Input and output feature records**](#input-and-output-feature-records)\n    - [**Training dataset generation**](#training-dataset-generation-1)\n  - [**Reference**](#reference)\n\n\n\n\u003cimg src=\"https://raw.githubusercontent.com/gsyang33/driple/master/others/structure.jpg\" alt=\"*Driple* structure\" width=\"700\"/\u003e\n\n\n---\n## **Overview**\n*Driple* is introduced in ACM SIGMETRICS 2022. Please refer to the following papers for more details.\n - [Full paper](https://doi.org/10.1145/3530895)\n - [2 page abstract](https://doi.org/10.1145/3489048.3530962)\n\n*Driple* trains a machine learning model, called *Driple* inspector, for predicting 12 metrics in terms of resource consumption. In particular, *Driple* predicts 1) burst duration, 2) idle duration, and 3) burst consumption for each 1) GPU utilization, 2) GPU memory utilization, 3) network TX throughput, and 4) network RX throughput.\n\n*Driple* applies two key designs:\n  - **Graph neural network (GNN)**: Machine learning libraries, such as TensorFlow and PyTorch, converts the given code for training into a computational graph. We take the graph as an input of *Driple* inspector to take a broad spectrum of training workloads.\n  - **Transfer learning**: *Driple* inspector is built for each DT setting (i.e., number of PS and workers, network interconnect types, and GPU types) of a specific machine learning library (e.g., TensorFlow). We leverage transfer learning to reduce the training time and dataset size required for training.\n\n---\n## ***Driple* inspector training**\n\nThe implementation for training *Driple* inspector is in `training`. In `training`, you can find the following directories.\n- `/training/driple`: python codes for executing training of inspectors.\n- `/training/models`: Python implementation of GNN models, such as graph convolutional network (GCN), graph isomorphism network (GIN), message passing neural network (MPNN), and graph attention network (GAT).\n\n*Driple* inspector can be trained with or without transfer learning. We provide the pre-trained model we use for transfer learning. Note that we provide the pre-trained model only for GCN algorithm.\n\n### **Environment setup**\nWe implement and test the training part of *Driple* inspector in conda environment.\nThe dependencies and requirements of our conda setting are given in \"driple_training_requirement.txt\". You can set a similar conda environment through the following command.\n```\nconda install -n \u003cenv_name\u003e driple_training_requirement.txt\n```\n\n### **Execute training**\nTo execute training, please follow the commands below. \n\n- Command for training **without TL**\n  - The command below is for training a model with the design choices (hyperparameters) best for *Driple*. You can easily change hyperparameters with the command line arguments.\n  - Please enter the dataset for training for `--data`.\n```\npython3 -m driple.train.gcn --variable --gru --epochs=100000 --patience=1000 --variable_conv_layers=Nover2 --only_graph --hidden=64 --mlp_layers=3 --data=[Dataset].pkl \n```\n\n\n- Command for training **with TL**\n  - To enable TL, add `--transfer`. Also, please specify the pre-trained model through `--pre-trained`.\n  - For TL, you should set the hyperparameters identical to the pre-trained model.\n```\npython3 -m driple.train.gcn --variable --gru --epochs=100000 --patience=1000 --variable_conv_layers=Nover2 --only_graph --hidden=64 --mlp_layers=3 --pre_trained=training/pre-train.pkl --data=[Dataset].pkl --transfer\n```\n\n\n\n---\n## **Training dataset generation**\n\nWe first provide 14 datasets used in this paper (`/dataset/examples`). Look at \"details of the dataset below\" for checking the detailed DT setting that each dataset is built.\n\n**\u003cdetails\u003e\u003csummary\u003eDetails of the dataset\u003c/summary\u003e**\n\n\n|          Name           |      GPU        | DP \u003cbr\u003etopology   |   Network   | # of GPU\u003cbr\u003emachines  |         Name          |      GPU      | DP \u003cbr\u003etopology   | Network   | # of GPU\u003cbr\u003emachines  |\n|:---------------------:  |:-------------:  |:---------------:  |:----------: |:--------------------: |:--------------------: |:------------: |:---------------:  |:-------:  |:--------------------: |\n|   V100-P1w2/ho-PCIe     |      V100       |   PS1/w2/homo     | Co-located  |           1           |  2080Ti-P4w4/he-40G   |    2080Ti     |  PS4/w4/hetero    |  40 GbE   |           2           |\n|   V100-P2w2/ho-PCIe     |      V100       |   PS2/w2/homo     | Co-located  |           1           | TitanRTX-P4w4/he-40G  | Titan\u003cbr\u003eRTX  |  PS4/w4/hetero    |  40 GbE   |           2           |\n|  2080Ti-P1w2/ho-PCIe    |     2080Ti      |   PS1/w2/homo     | Co-located  |           1           |    V100-P5w5/he-1G    |     V100      |  PS5/w5/hetero    |  1 GbE    |           5           |\n|  2080Ti-P1w3/ho-PCIe    |     2080Ti      |   PS1/w3/homo     | Co-located  |           1           |   2080Ti-P5w5/he-1G   |    2080Ti     |  PS5/w5/hetero    |  1 GbE    |           5           |\n|  2080Ti-P2w2/he-PCIe    |     2080Ti      |  PS2/w2/hetero    | Co-located  |           1           |    V100-P5w5/he-1G    |     V100      |  PS5/w10/hetero   |  1 GbE    |           5           |\n| TitanRTX-P2w2/he-PCIe   | Titan \u003cbr\u003eRTX   |  PS2/w2/hetero    | Co-located  |           1           |  2080Ti-P5w10/he-1G   |    2080Ti     |  PS5/w10/hetero   |  1 GbE    |           5           |\n|   2080Ti-P2w2/he-40G    |     2080Ti      |  PS2/w2/hetero    |   40 GbE    |           2           |                       |               |                   |           |                       |\n|  TitanRTX-P2w2/he-40G   |  Titan\u003cbr\u003eRTX   | PS2/w2/hetero     | 40 GbE      |           2           |                       |               |                   |           |                       |\n\n\n\u003c/details\u003e\n\n\nThe dataset consists of representative image classification and natural language processing models. We use tf_cnn_benchmark and OpenNMT for running the models. \n\nFor developers who want to create their datasets, we provide an example of dataset generation below.\n\n\n### **Input and output feature records**\nTo be updated soon.\n\n### **Training dataset generation**\n\nWe convert computational graphs into adjacency and feature matrices. Also, we produce the training dataset composed of the converted matrices and output features.\n\n\n- Command\n  - Give the path for the resource consumption measurement by `--perf_result` parameter.\n  - Enter the path to save the dataset to be created with `--save_path`, and the file name with `--dataset_name`.\n  - To get more information about parameters, use `--help` option.\n```\npython3 dataset_builder/generate_dataset.py --perf_result=[Result].csv --batch_size=32 --num_of_groups=100 --num_of_graphs=320 --save_path=[Path] --dataset_name=[Dataset].pkl\n```\n\n\n---\n## **Reference**\n\n - Gyeongsik Yang, Changyong Shin, Jeunghwan Lee, Yeonho Yoo, and Chuck Yoo. 2022. Prediction of the Resource Consumption of Distributed Deep Learning Systems. \u003ci\u003eProc. ACM Meas. Anal. Comput. Syst.\u003c/i\u003e 6, 2, Article 29 (June 2022), 25 pages. https://doi.org/10.1145/3530895\n - Gyeongsik Yang, Changyong Shin, Jeunghwan Lee, Yeonho Yoo, and Chuck Yoo. 2022. Prediction of the Resource Consumption of Distributed Deep Learning Systems. In \u003ci\u003eAbstract Proceedings of the 2022 ACM SIGMETRICS/IFIP PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems\u003c/i\u003e (\u003ci\u003eSIGMETRICS/PERFORMANCE '22\u003c/i\u003e). Association for Computing Machinery, New York, NY, USA, 69–70. https://doi.org/10.1145/3489048.3530962\n","funding_links":[],"categories":["Paper-Code"],"sub_categories":["Misc"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMachineLearningSystem%2FDriple","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FMachineLearningSystem%2FDriple","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMachineLearningSystem%2FDriple/lists"}