{"id":13604781,"url":"https://github.com/MachineLearningSystem/gavel","last_synced_at":"2025-04-12T02:31:44.658Z","repository":{"id":185461781,"uuid":"437827901","full_name":"MachineLearningSystem/gavel","owner":"MachineLearningSystem","description":"Code for \"Heterogenity-Aware Cluster Scheduling Policies for Deep Learning Workloads\", which appeared at OSDI 2020","archived":false,"fork":true,"pushed_at":"2021-05-05T06:47:15.000Z","size":109301,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2024-08-02T19:36:08.858Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"stanford-futuredata/gavel","license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MachineLearningSystem.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2021-12-13T10:24:04.000Z","updated_at":"2021-12-13T10:24:05.000Z","dependencies_parsed_at":null,"dependency_job_id":"8b1021f4-f3c6-4a69-8085-6e2c55cc6aec","html_url":"https://github.com/MachineLearningSystem/gavel","commit_stats":null,"previous_names":["machinelearningsystem/gavel"],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2Fgavel","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2Fgavel/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2Fgavel/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2Fgavel/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MachineLearningSystem","download_url":"https://codeload.github.com/MachineLearningSystem/gavel/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223489654,"owners_count":17153795,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T19:00:51.201Z","updated_at":"2024-11-07T09:30:56.983Z","avatar_url":"https://github.com/MachineLearningSystem.png","language":null,"readme":"# Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads\n\nThis repository contains the source code implementation of the OSDI paper\n\"Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads\".\n\n## Directory Structure\n\n### `scheduler`\nCode for the scheduler, including the scheduling mechanism and simulator\n(`scheduler.py`), implementations of performance-aware policies (`policies/`),\n`GavelIterator` as a Python module, and a communication stack between the scheduler\nand workers that uses [gRPC](https://grpc.io/) (`runtime/`).\n\n`scheduler/notebooks` contains parsing and plotting code to analyze experiment\nruns.\n\n### `workloads`\nImplementations of target workloads in PyTorch, including changes needed to\nintegrate with the `GavelIterator`.\n\n\n## Setup\n\n### Software Dependencies\n\nGavel is implemented in Python. We have tested Gavel on Ubuntu 16.04 with Python 3.8.\nPython 3.8 can be installed using [Miniconda](https://docs.conda.io/en/latest/miniconda.html).\n\nRequired software dependencies can be installed using,\n\n```bash\napt-get -y install cmake g++ gcc libnuma-dev make numactl zlib1g-dev\npip install -r scheduler/requirements.txt\ncd scheduler; make\n```\n\nThese software dependencies have already been installed on the following\nAMI on Amazon EC2,\n\n| Field  | Value |\n| -------------  | ------------- |\n| Cloud Provider | AWS |\n| Region         | us-east-1  |\n| AMI ID         | ami-03e41a79bb745ce18  |\n| AMI Name       | gavel |\n\nSee [this link](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/finding-an-ami.html)\nfor how to find and launch a public AMI (this assumes you have a valid billable AWS account setup).\n\n## Getting Started\n\nGavel's heterogeneity-aware policies and scheduling mechanism can be evaluated\neither in simulation or on a physical cluster.\n\nTo evaluate variants of the LAS policy (`max_min_fairness*`) in\nsimulation, one can use the following command line (this sweep script runs\nthe different policies for multiple _continuous_ traces, generated using\ndifferent seeds and Poisson arrival rates):\n\n```bash\npython -u scripts/sweeps/run_sweep_continuous.py -s 4000 -e 5000 -l /path/to/log/directory -j 6 -p max_min_fairness max_min_fairness_perf --seeds 0 1 2 -c 36:36:36 -a 0.0 -b 1.0 -n 5\n```\n\nOther arguments for the `run_sweep_continuous.py` script are\nshown using the `-h` option:\n\n```bash\nusage: run_sweep_continuous.py [-h] [-l LOG_DIR] [-s WINDOW_START] [-e WINDOW_END] [-t TIMEOUT] [-j PROCESSES] [-p POLICIES [POLICIES ...]] [-c CLUSTER_SPEC [CLUSTER_SPEC ...]]\n                               [--num_gpus_per_server NUM_GPUS_PER_SERVER] [--seeds SEEDS [SEEDS ...]] [-i INTERVAL] [-f FIXED_JOB_DURATION]\n                               [--cutoff-throughputs-file CUTOFF_THROUGHPUTS_FILE] [--throughputs-file THROUGHPUTS_FILE] [-m] [--generate-multi-priority-jobs]\n                               [--simulate-steady-state] [--solver {ECOS,GUROBI,SCS}] [-v] [--checkpoint-threshold CHECKPOINT_THRESHOLD]\n                               [--profiling_percentages PROFILING_PERCENTAGES [PROFILING_PERCENTAGES ...]] [--num_reference_models NUM_REFERENCE_MODELS [NUM_REFERENCE_MODELS ...]]\n                               [--ideal] [-a THROUGHPUT_LOWER_BOUND] [-b THROUGHPUT_UPPER_BOUND] [-n NUM_DATA_POINTS] [-u UTILIZATION_THRESHOLD]\n\nSweep through lambda values\n\noptional arguments:\n  -h, --help            show this help message and exit\n  -l LOG_DIR, --log-dir LOG_DIR\n                        Log directory\n  -s WINDOW_START, --window-start WINDOW_START\n                        Measurement window start (job ID)\n  -e WINDOW_END, --window-end WINDOW_END\n                        Measurement window end (job ID)\n  -t TIMEOUT, --timeout TIMEOUT\n                        Timeout (in seconds) for each run\n  -j PROCESSES, --processes PROCESSES\n                        Number of processes to use in pool (use as many as available if not specified)\n  -p POLICIES [POLICIES ...], --policies POLICIES [POLICIES ...]\n                        List of policies to sweep\n  -c CLUSTER_SPEC [CLUSTER_SPEC ...], --cluster-spec CLUSTER_SPEC [CLUSTER_SPEC ...]\n                        Cluster specification in the form of #v100s:#p100s:#k80s\n  --num_gpus_per_server NUM_GPUS_PER_SERVER\n                        Cluster specification in the form of #v100s:#p100s:#k80s\n  --seeds SEEDS [SEEDS ...]\n                        List of random seeds\n  -i INTERVAL, --interval INTERVAL\n                        Interval length (in seconds)\n  -f FIXED_JOB_DURATION, --fixed-job-duration FIXED_JOB_DURATION\n                        If set, fixes the duration of all jobs to the specified value (in seconds)\n  --cutoff-throughputs-file CUTOFF_THROUGHPUTS_FILE\n                        If set, uses the attached cutoff_throughputs JSON file in sweep to limit args run\n  --throughputs-file THROUGHPUTS_FILE\n                        Oracle throughputs file\n  -m, --generate-multi-gpu-jobs\n                        If set, generates multi-GPU jobs according to a pre-defined distribution\n  --generate-multi-priority-jobs\n                        If set, generates some jobs with higher priority\n  --simulate-steady-state\n                        If set, adds as many jobs as there are workers before beginning the simulation.\n  --solver {ECOS,GUROBI,SCS}\n                        CVXPY solver\n  -v, --verbose         Verbose\n  --checkpoint-threshold CHECKPOINT_THRESHOLD\n                        Checkpoint threshold, None if checkpointing is disabled. Checkpoint is created after this job ID is added.\n  --profiling_percentages PROFILING_PERCENTAGES [PROFILING_PERCENTAGES ...]\n                        Percentages of machines dedicated to profiling co-located job pairs\n  --num_reference_models NUM_REFERENCE_MODELS [NUM_REFERENCE_MODELS ...]\n                        Number of reference models to use when estimating throughputs\n  --ideal               Run allocations 100% ideally\n\nAutomatic sweep:\n  -u UTILIZATION_THRESHOLD, --utilization-threshold UTILIZATION_THRESHOLD\n                        Utilization threshold to use when automatically sweeping lambdas\n\nSweep over fixed range:\n  -a THROUGHPUT_LOWER_BOUND, --throughput-lower-bound THROUGHPUT_LOWER_BOUND\n                        Lower bound for throughput interval to sweep\n  -b THROUGHPUT_UPPER_BOUND, --throughput-upper-bound THROUGHPUT_UPPER_BOUND\n                        Upper bound for throughput interval to sweep\n  -n NUM_DATA_POINTS, --num-data-points NUM_DATA_POINTS\n                        Number of data points to sweep through\n```\n\nTo evaluate policies on static traces (jobs only added to the cluster at the start\nof the trace), one can use the `scripts/sweeps/run_sweep_static.py` script, which\nruns different policies on multiple _static_ traces, generated using different\nseeds and numbers of jobs.\n\nFor more detailed instructions on how to reproduce results from the OSDI paper,\nsee [EXPERIMENTS.md](EXPERIMENTS.md).\n","funding_links":[],"categories":["Paper-Code"],"sub_categories":["GPU Cluster Management"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMachineLearningSystem%2Fgavel","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FMachineLearningSystem%2Fgavel","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMachineLearningSystem%2Fgavel/lists"}