{"id":13671406,"url":"https://github.com/uw-mad-dash/shockwave","last_synced_at":"2025-04-27T18:31:23.261Z","repository":{"id":87774243,"uuid":"524695597","full_name":"uw-mad-dash/shockwave","owner":"uw-mad-dash","description":"Code for \"Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning\" [NSDI '23]","archived":false,"fork":false,"pushed_at":"2022-11-24T02:49:27.000Z","size":6052,"stargazers_count":38,"open_issues_count":1,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-11-11T09:43:43.457Z","etag":null,"topics":["cloud-computing","cluster-scheduler","deep-learning","distributed-systems","distributed-training","machine-learning","pytorch"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/uw-mad-dash.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2022-08-14T14:34:02.000Z","updated_at":"2024-07-11T01:29:10.000Z","dependencies_parsed_at":null,"dependency_job_id":"4bf405b9-48cf-448f-9860-8c8d3295008f","html_url":"https://github.com/uw-mad-dash/shockwave","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/uw-mad-dash%2Fshockwave","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/uw-mad-dash%2Fshockwave/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/uw-mad-dash%2Fshockwave/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/uw-mad-dash%2Fshockwave/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/uw-mad-dash","download_url":"https://codeload.github.com/uw-mad-dash/shockwave/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251187140,"owners_count":21549593,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cloud-computing","cluster-scheduler","deep-learning","distributed-systems","distributed-training","machine-learning","pytorch"],"created_at":"2024-08-02T09:01:08.823Z","updated_at":"2025-04-27T18:31:18.250Z","avatar_url":"https://github.com/uw-mad-dash.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning\n\nThis repository contains the source code implementation of the [NSDI '23](https://www.usenix.org/conference/nsdi23) paper [Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning](https://arxiv.org/abs/2210.00093).\n\nWe built our implementation atop [Gavel](https://github.com/stanford-futuredata/gavel), the open-sourced codebase of the [OSDI '20](https://www.usenix.org/conference/osdi20) paper [Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads](https://www.usenix.org/conference/osdi20/presentation/narayanan-deepak). We would like to thank the Gavel authors for open-sourcing their implementation!\n\n## Release notes\n\nSep 2022: We have released the first version of Shockwave! Please see the documentation below to get started. In the upcoming months, we will gradually make the following updates:\n\n* Add shell scripts and documentation for deploying Shockwave on a physical cluster\n* Add shell scripts and documentation for more simulation experiments\n* Add bibtex information and hyperlinks for the arXiv release\n* Make cleanups to the Shockwave codebase for better readability\n* Add plotting scripts\n\n## Directory Structure\n\n### `scheduler`\nCode for the scheduler, including the scheduling mechanism and simulator (`scheduler.py`), implementations of scheduling policies (`policies/`), `GavelIterator` as a Python module, and a communication stack between the scheduler and workers that uses [gRPC](https://grpc.io/) (`runtime/`).\n\n### `workloads`\nImplementations of target workloads in PyTorch, including changes needed to integrate with the `GavelIterator`.\n\n### `accordion_workloads` and `gns_workloads`\nWorkload scripts built on top of those in [`workloads`](workloads), with respective dynamic adaptation optimizations implemented, namely [Accordion](https://github.com/uw-mad-dash/Accordion) and [Gradient Noise Scale (GNS)](https://openai.com/blog/science-of-ai/).\n\n\n## Setting up the Software Dependencies\n\nShockwave/Gavel is implemented in Python. We have tested Shockwave/Gavel on Ubuntu 18.04 with Python 3.6.9.\nPython can be installed using [Miniconda](https://docs.conda.io/en/latest/miniconda.html).\n\nRequired software dependencies can be installed using:\n\n```bash\napt-get -y install cmake g++ gcc libnuma-dev make numactl zlib1g-dev\npip install -r scheduler/requirements.txt\ncd scheduler; make\n```\n\nIn addition to the software dependencies required to run [Gavel](https://github.com/stanford-futuredata/gavel), running Shockwave also requires the [Gurobi Optimizer](https://www.gurobi.com/). An academic license can be requested [here](https://www.gurobi.com/features/academic-named-user-license/). Note that you might need to connect to your university's network or use a VPN to download Gurobi. Please see the Gurobi website for more details.\n\n## Getting Started\n\nGavel's policies (including Shockwave) and scheduling mechanism can be evaluated either in simulation or on a physical cluster.\n\nTo reproduce our canonical results in simulation in ~10 minutes, run [`scheduler/reproduce/tacc_32gpus.sh`](scheduler/reproduce/tacc_32gpus.sh). For detailed instructions on how to reproduce more results from the NSDI paper, see [EXPERIMENTS.md](EXPERIMENTS.md).\n\n\n## References\n\n```\n@misc{zheng2022shockwave,\n      title={Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning}, \n      author={Pengfei Zheng and Rui Pan and Tarannum Khan and Shivaram Venkataraman and Aditya Akella},\n      year={2022},\n      eprint={2210.00093},\n      archivePrefix={arXiv},\n      primaryClass={cs.DC}\n}\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fuw-mad-dash%2Fshockwave","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fuw-mad-dash%2Fshockwave","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fuw-mad-dash%2Fshockwave/lists"}