Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/MachineLearningSystem/Tiresias
A GPU Cluster Manager for Distributed Deep Learning Training
https://github.com/MachineLearningSystem/Tiresias
Last synced: 9 days ago
JSON representation
A GPU Cluster Manager for Distributed Deep Learning Training
- Host: GitHub
- URL: https://github.com/MachineLearningSystem/Tiresias
- Owner: MachineLearningSystem
- License: apache-2.0
- Fork: true (SymbioticLab/Tiresias)
- Created: 2022-05-24T01:08:41.000Z (over 2 years ago)
- Default Branch: master
- Last Pushed: 2020-05-07T01:45:03.000Z (over 4 years ago)
- Last Synced: 2024-08-02T19:36:14.639Z (4 months ago)
- Homepage:
- Size: 74.2 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-AI-system - Tiresias -- A GPU Cluster Manager for Distributed Deep Learning Training without complete job information NSDI'19
README
Tiresias -- A GPU Cluster Manager for Distributed Deep Learning Training without complete job information
====Tiresias is a GPU cluster resource manager that aims at minimizing distributed deep learning (DDL) jobs’ completion times with partial or no a priori knowledge. It does not rely on any intermediate DL algorithm states (e.g., training loss values) or framework specifics (e.g., tensors-to-parameter server mapping).
DDL training jobs bring some unique challenges to the cluster manager:
1. unpredictable training time
2. over-aggressive job consolidation
3. all-or-nothing resource allocation
4. inflexibility in GPU sharing (job preemption and resumption)Tiresias tackles those challenges with the **Discretized-2DAS** (two-dimensional age/attained-service based) scheduler and the model profile-based job placement scheme.
The *2DAS* scheduler, which considers both the spatial (GPU requirements) and temporal (job's executed time) aspects of DDL jobs, has two scheduling algorithms (*Discretized 2D-LAS* and *Discretized 2D-Gittins Index*). They can minimize the average JCT with no and partial job knowledge, respectively.
The profile-based job placement scheme can appropriately relax the consolidation constraints and maintain the resource (GPU) utilization of cluster without hurting jobs’ performance.Out testbed experiments and large-scale trace-driven simulations show
that Tiresias improves the average JCT by up to 5.5x (2x) over current production solutions (state-of-the-art DDL cluster scheduler),
and it performs comparably to the solution using perfect knowledge of all job characteristics.Detailed design and performance are available in our [NSDI'19 paper](https://www.usenix.org/conference/nsdi19/presentation/gu).
What's in this repository?
-----------1. Discrete-time simulator of GPU cluster manager for DL training jobs (with both the job scheduler and placement scheme)
**Coming soon ...**
2. Network(RDMA)-level message profiler for DL models
3. ...
Others
-----------
1. What's **LAS** (Least-Attained Service) algorithm?
Nuyens, Misja, and Adam Wierman. "The foreground–background queue: a survey." Performance evaluation 65.3-4 (2008): 286-307.2. What's **Gittins Index** policy?
Gittins, John, Kevin Glazebrook, and Richard Weber. Multi-armed bandit allocation indices. John Wiley & Sons, 2011.