{"id":47189314,"url":"https://github.com/heyfey/vodascheduler","last_synced_at":"2026-03-13T10:34:37.296Z","repository":{"id":37854785,"uuid":"357297024","full_name":"heyfey/vodascheduler","owner":"heyfey","description":"GPU scheduler for elastic/distributed deep learning workloads in Kubernetes cluster (IC2E'23)","archived":false,"fork":false,"pushed_at":"2023-11-11T17:49:52.000Z","size":22970,"stargazers_count":31,"open_issues_count":1,"forks_count":3,"subscribers_count":3,"default_branch":"main","last_synced_at":"2024-11-15T07:52:20.808Z","etag":null,"topics":["deep-learning","distributed-computing","horovod","kubeflow","kubernetes","machine-learning","mlops","pytorch","scheduling","tensorflow"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/heyfey.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-04-12T18:23:17.000Z","updated_at":"2024-06-20T10:07:37.000Z","dependencies_parsed_at":"2024-06-20T00:03:04.651Z","dependency_job_id":"e5b1bc53-7e89-4e2e-b253-3cf6608ad37d","html_url":"https://github.com/heyfey/vodascheduler","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/heyfey/vodascheduler","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/heyfey%2Fvodascheduler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/heyfey%2Fvodascheduler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/heyfey%2Fvodascheduler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/heyfey%2Fvodascheduler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/heyfey","download_url":"https://codeload.github.com/heyfey/vodascheduler/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/heyfey%2Fvodascheduler/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30465461,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-13T06:34:02.089Z","status":"ssl_error","status_checked_at":"2026-03-13T06:33:49.182Z","response_time":60,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","distributed-computing","horovod","kubeflow","kubernetes","machine-learning","mlops","pytorch","scheduling","tensorflow"],"created_at":"2026-03-13T10:34:36.701Z","updated_at":"2026-03-13T10:34:37.277Z","avatar_url":"https://github.com/heyfey.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"---\ntags: voda-scheduler\n---\n\n# Voda Scheduler\n\n\u003e Note that everything is experimental and may change significantly at any time.\n\nVoda scheduler is a GPU scheduler for elastic deep learning workloads based on [Kubernetes](https://github.com/kubernetes/kubernetes), [Kubeflow Training Operator](https://github.com/kubeflow/training-operator) and [Horovod](https://github.com/horovod/horovod).\n\n\nVoda Scheduler is designed to be easily deployed in any Kubernetes cluster. For more architectural details, see [design](https://github.com/heyfey/vodascheduler/blob/main/doc/design/voda-scheduler-design.md).\n\n---\n\nContents\n- [Why Elastic Training?](#Why-Elastic-Training)\n- [Why Voda Scheduler?](#Why-Voda-Scheduler)\n- [Demo](#Demo)\n- [Get Started](#Get-Started)\n- [Scheduling Algorithms](#Scheduling-Algorithms)\n- [Docker Images](#Docker-Images)\n- [Prometheus Metrics Exposed](#Prometheus-Metrics-Exposed)\n- [Related Projects](#Related-Projects)\n- [Reference](#Reference)\n\n## Why Elastic Training?\n\nElastic training enables the distributed training jobs to be scaled up and down dynamically at runtime, without interrupting the training process.\n\nWith elastic training, the scheduler can make training jobs utilize idle resources if there are any and make the most efficient resource allocations if the cluster is heavily-loaded, thus increasing cluster throughput and reducing overall training time.\n\nFor more information about elastic training, see [Elastic Horovod](https://horovod.readthedocs.io/en/stable/elastic_include.html), [Torch Distributed Elastic](https://pytorch.org/docs/stable/distributed.elastic.html) or [Elastic Training](https://github.com/skai-x/elastic-training).\n\n## Why Voda Scheduler?\n\nVoda Scheduler provides several critical features for elastic deep learning workloads as follows:\n\n- Rich [Scheduling Algorithms](#Scheduling-Algorithms) (with resource elasticity) to choose from\n- [Topology-Aware Scheduling \u0026 Worker Migration](https://github.com/heyfey/vodascheduler/blob/main/doc/design/placement-management.md)\n    -  Actively consolidate resources to maximize cluster throughput\n    -  Particularly important for elastic training since resource allocations can be dynamically adjusted\n- Node Addition/Deletion Awareness\n    - Co-works with existing autoscaler\n    - Makes the best use of spot instances that may come and go with little warning\n    - Tolerates failing nodes\n- Fault-Tolerance\n\n## Demo\n\nCheckout the [demo](https://youtu.be/M1sUd_-0LnQ) to see how resource allocations are dynamically adjusted (and how worker pods are migrated) to maximize cluster throughput\n\n## Prerequisite\n\nA Kubernetes cluster, on-cloud or on-premise, that can [schedule GPUs](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/). Voda Scheduler is tested with `v1.20`\n\n## [Get Started](https://github.com/heyfey/vodascheduler/blob/main/doc/get-started.md)\n\n1. [Config Scheduler](https://github.com/heyfey/vodascheduler/blob/main/doc/get-started.md#Config-Scheduler)\n2. [Deploy Scheduler](https://github.com/heyfey/vodascheduler/blob/main/doc/get-started.md#Deploy-Scheduler)\n3. [Submit Training Job to Scheduler](https://github.com/heyfey/vodascheduler/blob/main/doc/get-started.md#Submit-Training-Job-to-Scheduler)\n4. [API Endpoints](https://github.com/heyfey/vodascheduler/blob/main/doc/apis.md)\n\n\n## Scheduling Algorithms\n\n\n| Algorithm | Elastic | Reference |\n| -------- | -------- | -------- |\n| FIFO   |     |      |\n| Elastic-FIFO (default)     | :heavy_check_mark:    |      |\n| SRJF             |     |      |\n| Elastic-SRJF     | :heavy_check_mark:    |      |\n| Tiresias         |     | Gu, Juncheng, et al. \"Tiresias: A GPU cluster manager for distributed deep learning.\" 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). 2019. https://www.usenix.org/conference/nsdi19/presentation/gu     |\n| Elastic-Tiresias | :heavy_check_mark:    | Wu, Yidi, et al. \"Elastic Deep Learning in Multi-Tenant GPU Clusters.\" IEEE Transactions on Parallel and Distributed Systems (2021). https://ieeexplore.ieee.org/abstract/document/9373916     |\n| FfDL Optimizer   | :heavy_check_mark:    | Saxena, Vaibhav, et al. \"Effective elastic scaling of deep learning workloads.\" 2020 28th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). IEEE, 2020. https://ieeexplore.ieee.org/abstract/document/9285954     |\n| AFS-L            | :heavy_check_mark:    | Shin, Jinwoo, and KyoungSoo Park. \"Elastic Resource Sharing for Distributed Deep Learning.\" (2021) https://www.usenix.org/system/files/nsdi21-hwang.pdf     |\n\n\n## Docker Images\n\n- [Voda Scheduler Docker Images](https://github.com/heyfey/vodascheduler/tree/main/docker)\n\n## Prometheus Metrics Exposed\n\n- [Prometheus Metrics Exposed](https://github.com/heyfey/vodascheduler/tree/main/doc/prometheus-metrics-exposed.md)\n\n## Related Projects\n\n- [kubeflow/training-operator](https://github.com/kubeflow/training-operator): Training operators on Kubernetes.\n- [kubeflow/mpi-operator](https://github.com/kubeflow/mpi-operator): Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)\n- [horovod/horovod](https://github.com/horovod/horovod): Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.\n- [heyfey/munkres](https://github.com/heyfey/munkres): Hungarian algorithm used in the placement algorithm.\n- [heyfey/nvidia_smi_exporter](https://github.com/heyfey/nvidia_smi_exporter): nvidia-smi exporter for Prometheus. For monitoring GPUs in the cluster.\n\n## Reference\n\nT. -T. Hsieh and C. -R. Lee, \"Voda: A GPU Scheduling Platform for Elastic Deep Learning in Kubernetes Clusters,\" 2023 IEEE International Conference on Cloud Engineering (IC2E), Boston, MA, USA, 2023, pp. 131-140, doi: 10.1109/IC2E59103.2023.00023. [https://ieeexplore.ieee.org/document/10305838](https://ieeexplore.ieee.org/document/10305838)\n\n```\n@INPROCEEDINGS{10305838,\n  author={Hsieh, Tsung-Tso and Lee, Che-Rung},\n  booktitle={2023 IEEE International Conference on Cloud Engineering (IC2E)}, \n  title={Voda: A GPU Scheduling Platform for Elastic Deep Learning in Kubernetes Clusters}, \n  year={2023},\n  volume={},\n  number={},\n  pages={131-140},\n  doi={10.1109/IC2E59103.2023.00023}}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fheyfey%2Fvodascheduler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fheyfey%2Fvodascheduler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fheyfey%2Fvodascheduler/lists"}