{"id":13604736,"url":"https://github.com/MachineLearningSystem/shockwave","last_synced_at":"2025-04-12T02:31:35.659Z","repository":{"id":185461946,"uuid":"545407476","full_name":"MachineLearningSystem/shockwave","owner":"MachineLearningSystem","description":"Code for \"Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning\" [NSDI '23]","archived":false,"fork":true,"pushed_at":"2022-10-04T00:17:53.000Z","size":6050,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2024-08-02T19:35:47.220Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"uw-mad-dash/shockwave","license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MachineLearningSystem.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2022-10-04T10:09:03.000Z","updated_at":"2022-10-04T08:30:30.000Z","dependencies_parsed_at":null,"dependency_job_id":"cf559d6c-6763-4a70-b266-bfc19d2b1a26","html_url":"https://github.com/MachineLearningSystem/shockwave","commit_stats":null,"previous_names":["machinelearningsystem/shockwave"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2Fshockwave","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2Fshockwave/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2Fshockwave/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2Fshockwave/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MachineLearningSystem","download_url":"https://codeload.github.com/MachineLearningSystem/shockwave/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223489629,"owners_count":17153791,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T19:00:50.671Z","updated_at":"2024-11-07T09:30:48.917Z","avatar_url":"https://github.com/MachineLearningSystem.png","language":null,"funding_links":[],"categories":["Paper-Code"],"sub_categories":["GPU Cluster Management"],"readme":"# Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning\n\nThis repository contains the source code implementation of the [NSDI '23](https://www.usenix.org/conference/nsdi23) paper [Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning](https://arxiv.org/abs/2210.00093).\n\nWe built our implementation atop [Gavel](https://github.com/stanford-futuredata/gavel), the open-sourced codebase of the [OSDI '20](https://www.usenix.org/conference/osdi20) paper [Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads](https://www.usenix.org/conference/osdi20/presentation/narayanan-deepak). We would like to thank the Gavel authors for open-sourcing their implementation!\n\n## Release notes\n\nSep 2022: We have released the first version of Shockwave! Please see the documentation below to get started. In the upcoming months, we will gradually make the following updates:\n\n* Add shell scripts and documentation for deploying Shockwave on a physical cluster\n* Add shell scripts and documentation for more simulation experiments\n* Add bibtex information and hyperlinks for the arXiv release\n* Make cleanups to the Shockwave codebase for better readability\n* Add plotting scripts\n\n## Directory Structure\n\n### `scheduler`\nCode for the scheduler, including the scheduling mechanism and simulator (`scheduler.py`), implementations of scheduling policies (`policies/`), `GavelIterator` as a Python module, and a communication stack between the scheduler and workers that uses [gRPC](https://grpc.io/) (`runtime/`).\n\n### `workloads`\nImplementations of target workloads in PyTorch, including changes needed to integrate with the `GavelIterator`.\n\n### `accordion_workloads` and `gns_workloads`\nWorkload scripts built on top of those in [`workloads`](workloads), with respective dynamic adaptation optimizations implemented, namely [Accordion](https://github.com/uw-mad-dash/Accordion) and [Gradient Noise Scale (GNS)](https://openai.com/blog/science-of-ai/).\n\n\n## Setting up the Software Dependencies\n\nShockwave/Gavel is implemented in Python. We have tested Shockwave/Gavel on Ubuntu 18.04 with Python 3.6.9.\nPython can be installed using [Miniconda](https://docs.conda.io/en/latest/miniconda.html).\n\nRequired software dependencies can be installed using:\n\n```bash\napt-get -y install cmake g++ gcc libnuma-dev make numactl zlib1g-dev\npip install -r scheduler/requirements.txt\ncd scheduler; make\n```\n\nIn addition to the software dependencies required to run [Gavel](https://github.com/stanford-futuredata/gavel), running Shockwave also requires the [Gurobi Optimizer](https://www.gurobi.com/). An academic license can be requested [here](https://www.gurobi.com/downloads/end-user-license-agreement-academic/). Note that you might need to connect to your university's network or use a VPN to download Gurobi. Please see the Gurobi website for more details.\n\n## Getting Started\n\nGavel's policies (including Shockwave) and scheduling mechanism can be evaluated either in simulation or on a physical cluster.\n\nTo reproduce our canonical results in simulation in ~10 minutes, run [`scheduler/reproduce/tacc_32gpus.sh`](scheduler/reproduce/tacc_32gpus.sh). For detailed instructions on how to reproduce more results from the NSDI paper, see [EXPERIMENTS.md](EXPERIMENTS.md).\n\n\n## References\n\n```\n@misc{zheng2022shockwave,\n      title={Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning}, \n      author={Pengfei Zheng and Rui Pan and Tarannum Khan and Shivaram Venkataraman and Aditya Akella},\n      year={2022},\n      eprint={2210.00093},\n      archivePrefix={arXiv},\n      primaryClass={cs.DC}\n}\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMachineLearningSystem%2Fshockwave","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FMachineLearningSystem%2Fshockwave","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMachineLearningSystem%2Fshockwave/lists"}