https://github.com/MachineLearningSystem/Tiresias

A GPU Cluster Manager for Distributed Deep Learning Training
https://github.com/MachineLearningSystem/Tiresias

Last synced: 8 months ago
JSON representation

A GPU Cluster Manager for Distributed Deep Learning Training

Host: GitHub
URL: https://github.com/MachineLearningSystem/Tiresias
Owner: MachineLearningSystem
License: apache-2.0
Fork: true (SymbioticLab/Tiresias)
Created: 2022-05-24T01:08:41.000Z (over 3 years ago)
Default Branch: master
Last Pushed: 2020-05-07T01:45:03.000Z (over 5 years ago)
Last Synced: 2024-08-02T19:36:14.639Z (over 1 year ago)
Homepage:
Size: 74.2 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-AI-system - Tiresias -- A GPU Cluster Manager for Distributed Deep Learning Training without complete job information NSDI'19

README

Tiresias -- A GPU Cluster Manager for Distributed Deep Learning Training without complete job information
====

Tiresias is a GPU cluster resource manager that aims at minimizing distributed deep learning (DDL) jobs’ completion times with partial or no a priori knowledge. It does not rely on any intermediate DL algorithm states (e.g., training loss values) or framework specifics (e.g., tensors-to-parameter server mapping).

DDL training jobs bring some unique challenges to the cluster manager:
1. unpredictable training time
2. over-aggressive job consolidation
3. all-or-nothing resource allocation
4. inflexibility in GPU sharing (job preemption and resumption)

Tiresias tackles those challenges with the **Discretized-2DAS** (two-dimensional age/attained-service based) scheduler and the model profile-based job placement scheme.
The *2DAS* scheduler, which considers both the spatial (GPU requirements) and temporal (job's executed time) aspects of DDL jobs, has two scheduling algorithms (*Discretized 2D-LAS* and *Discretized 2D-Gittins Index*). They can minimize the average JCT with no and partial job knowledge, respectively.
The profile-based job placement scheme can appropriately relax the consolidation constraints and maintain the resource (GPU) utilization of cluster without hurting jobs’ performance.

Out testbed experiments and large-scale trace-driven simulations show
that Tiresias improves the average JCT by up to 5.5x (2x) over current production solutions (state-of-the-art DDL cluster scheduler),
and it performs comparably to the solution using perfect knowledge of all job characteristics.

Detailed design and performance are available in our [NSDI'19 paper](https://www.usenix.org/conference/nsdi19/presentation/gu).

What's in this repository?
-----------

1. Discrete-time simulator of GPU cluster manager for DL training jobs (with both the job scheduler and placement scheme)

**Coming soon ...**

2. Network(RDMA)-level message profiler for DL models

3. ...

Others
-----------
1. What's **LAS** (Least-Attained Service) algorithm?
Nuyens, Misja, and Adam Wierman. "The foreground–background queue: a survey." Performance evaluation 65.3-4 (2008): 286-307.

2. What's **Gittins Index** policy?
Gittins, John, Kevin Glazebrook, and Richard Weber. Multi-armed bandit allocation indices. John Wiley & Sons, 2011.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/MachineLearningSystem/Tiresias

Awesome Lists containing this project

README