Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/pengyanghua/dl2
a deep learning-driven scheduler for elastic training in deep learning clusters
https://github.com/pengyanghua/dl2
cluster-scheduler deep-neural-networks
Last synced: about 2 months ago
JSON representation
a deep learning-driven scheduler for elastic training in deep learning clusters
- Host: GitHub
- URL: https://github.com/pengyanghua/dl2
- Owner: pengyanghua
- Created: 2021-01-12T05:03:22.000Z (about 4 years ago)
- Default Branch: main
- Last Pushed: 2021-01-14T12:31:56.000Z (about 4 years ago)
- Last Synced: 2024-11-29T22:11:44.186Z (about 2 months ago)
- Topics: cluster-scheduler, deep-neural-networks
- Language: Python
- Homepage:
- Size: 146 KB
- Stars: 27
- Watchers: 2
- Forks: 4
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# DL2
DL2 is a deep learning-driven scheduler for elastic training in deep learning clusters. DL2 advocates a joint supervised learning and reinforcement learning approach: a neural network is warmed up via offline supervised learning based on job traces produced by the existing cluster scheduler; then the neural network is plugged into the live DL cluster, fine-tuned by reinforcement learning carried out throughout the training progress of the DL jobs, and used for deciding job resource allocation in an online fashion.Check [this figure](./workflow.pdf) for the overall workflow illustration.
## Prerequisites
We use TensorFlow to train a model. Make sure you have have installed a 1.x version:```
pip install tensorflow-gpu==1.13.1
```## Training
To train model, run the following command. It will start multiple processes to train a centralized model.```
python train.py
```Check [parameters.py](./parameters.py) if you want to change some hyper-parameters. For ease of comparison, we also provide a script [experiment.py](./experiment.py) and you can choose different configurations.
## Trace
We put some traces collected from our testbed in [config_speed.txt](./config_speed.txt). You may need to collect your own trace if running on a different setup. For k8s setup, please check [Optimus](https://github.com/pengyanghua/optimus).## Elastic Scaling
Please check the [MXNet repo](https://github.com/pengyanghua/mxnet) for the implementation of elastic resource scaling. We have modified the communication library including KVStore and pslite.
## Publication
To Add.