{"id":13737770,"url":"https://github.com/ucbrise/hypersched","last_synced_at":"2025-04-11T12:31:50.382Z","repository":{"id":69968638,"uuid":"221598413","full_name":"ucbrise/hypersched","owner":"ucbrise","description":"Deadline-based hyperparameter tuning on RayTune.","archived":false,"fork":false,"pushed_at":"2020-01-16T19:13:58.000Z","size":13918,"stargazers_count":31,"open_issues_count":3,"forks_count":2,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-03-08T22:24:46.006Z","etag":null,"topics":["distributed","hyperparameter-optimization","python","pytorch","ray"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ucbrise.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2019-11-14T02:55:07.000Z","updated_at":"2024-01-04T16:39:34.000Z","dependencies_parsed_at":"2023-02-21T21:15:48.581Z","dependency_job_id":null,"html_url":"https://github.com/ucbrise/hypersched","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ucbrise%2Fhypersched","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ucbrise%2Fhypersched/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ucbrise%2Fhypersched/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ucbrise%2Fhypersched/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ucbrise","download_url":"https://codeload.github.com/ucbrise/hypersched/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248401973,"owners_count":21097328,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["distributed","hyperparameter-optimization","python","pytorch","ray"],"created_at":"2024-08-03T03:02:00.370Z","updated_at":"2025-04-11T12:31:48.307Z","avatar_url":"https://github.com/ucbrise.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n   \u003cp align=\"center\"\u003e \u003cimg src=\"figs/hypersched-logo.png\" height=240p weight=320px\u003e\u003cbr\u003e\u003c/p\u003e\n\u003c/div\u003e\n\n# HyperSched\n\nAn experimental scheduler for accelerated hyperparameter tuning.\n\n**People**: Richard Liaw, Romil Bhardwaj, Lisa Dunlap, Yitian Zou, Joseph E. Gonzalez, Ion Stoica, Alexey Tumanov\n\nFor questions, open an issue or email rliaw [at] berkeley.edu\n\n**Please open an issue if you run into errors running the code!**\n\n\n## Overview\n\nHyperSched a dynamic application-level resource scheduler to track, identify, and preferentially allocate resources to the best performing trials to maximize accuracy by the deadline.\n\nHyperSched is implemented as a `TrialScheduler` of [Ray Tune](http://tune.io/).\n\n\u003cdiv align=\"center\"\u003e\n   \u003cp align=\"center\"\u003e \u003cimg src=\"figs/scheduler.png\" height=240p\u003e\u003cbr\u003e\u003c/p\u003e\n\u003c/div\u003e\n\n## Terminology:\n\n**Trial**: One training run of a (randomly sampled) hyperparameter configuration\n\n**Experiment**: A collection of trials.\n\n\n## Results:\n\nHyperSched will allocate resources to the top performing trial\n\u003cdiv align=\"center\"\u003e\n   \u003cp align=\"center\"\u003e \u003cimg src=\"figs/allocation.png\" height=240p\u003e\u003cbr\u003e\u003c/p\u003e\n\u003c/div\u003e\n\nHyperSched can perform better than ASHA under time pressure.\n\n\u003cdiv align=\"center\"\u003e\n   \u003cp align=\"center\"\u003e \u003cimg src=\"figs/results-deadline.png\" height=240p\u003e\u003cbr\u003e\u003c/p\u003e\n\u003c/div\u003e\n\n## Quick Start\n\nThis code has been tested with PyTorch 1.13 and Ray 0.7.6.\n\nIt is suggested that you install this on a cluster (and not your laptop).  You can easily spin up a Ray cluster using the [Ray cluster Launcher](https://ray.readthedocs.io/en/latest/autoscaling.html).\n\nInstall with:\n\n```bash\npip install ray==0.7.6\ngit clone https://github.com/ucbrise/hypersched \u0026\u0026 cd hypersched\npip install -e .\n```\n\nThen, you can run CIFAR with a 1800 second deadline, as below:\n\n```bash\n\npython scripts/evaluate_dynamic_asha.py \\\n    --num-atoms=8 \\\n    --num-jobs=100 \\\n    --seed=1 \\\n    --sched hyper \\\n    --result-file=\"some-test.log\" \\\n    --max-t=200 \\\n    --global-deadline=1800 \\\n    --trainable-id pytorch \\\n    --model-string resnet18 \\\n    --data cifar\n```\nSee `scripts` for more usage examples.\n\nExample Ray cluster configurations are provided in `scripts/cluster_cfg`.\n\n## Advanced Usage\n\n#### Configuring HyperSched\n\n```python\n# trainable.metric = \"mean_accuracy\"\nsched = HyperSched(\n    num_atoms,\n    scaling_dict=get_scaling(\n        args.trainable_id, args.model_string, args.data\n    ),  # optional model for scaling\n    deadline=args.global_deadline,\n    resource_policy=\"UNIFORM\",\n    time_attr=multijob_config[\"time_attr\"],\n    mode=\"max\",\n    metric=trainable.metric,\n    grace_period=config[\"min_allocation\"],\n    max_t= config[\"max_allocation\"],\n)\n\nsummary = Summary(trainable.metric)\n\nanalysis = tune.run(\n  trainable,\n  name=f\"{uuid.uuid4().hex[:8]}\",\n  num_samples=args.num_jobs,\n  config=config,\n  verbose=1,\n  local_dir=args.result_path\n  if args.result_path and os.path.exists(args.result_path)\n  else None,\n  global_checkpoint_period=600,  # avoid checkpointing completely.\n  scheduler=sched,\n  resources_per_trial=trainable.to_resources(1)._asdict(),  # initial resources\n  trial_executor=ResourceExecutor(\n      deadline_s=args.global_deadline, hooks=[summary]\n  )\n)\n```\n#### Viewing Results\nThe `hypersched.tune.Summary` object will log both a text file and also a CSV for \"experiment-level\" statistics.\n\n#### HyperSched Imagenet Training on AWS\n\n1. Create an EBS volume with ImageNet (https://github.com/pytorch/examples/tree/master/imagenet)\n2. Set the EBS volume for all nodes of your cluster. For example, as seen in `scripts/imagenet.yaml`;\n\n```yaml\nhead_node:\n    InstanceType: p3.16xlarge\n    ImageId: ami-0d96d570269578cd7\n    BlockDeviceMappings:\n      - DeviceName: \"/dev/sdm\"\n        Ebs:\n          VolumeType: \"io1\"\n          Iops: 10000\n          DeleteOnTermination: True\n          VolumeSize: 250\n          SnapshotId: \"snap-01838dca0cbffad5c\"\n\n```\n\n3. Launch the cluster. If you modify the yaml, you can then launch a cluster using `ray up scripts/imagenet.yaml`. Beware, this will cost some money. If you use the YAML, cluster will then setup a Ray cluster among the nodes launched.\n\n3. Run the following command:\n\n```bash\npython ~/sosp2019/scripts/evaluate_dynamic_asha.py \\\n    --redis-address=\"localhost:6379\" \\\n    --num-atoms=16 \\\n    --num-jobs=200 \\\n    --seed=0 \\\n    --sched hyper \\\n    --result-file=\"~/MY_LOG_FILE.log\" \\\n    --max-t=500 \\\n    --global-deadline=7200 \\\n    --trainable-id pytorch \\\n    --model-string resnet50 \\\n    --data imagenet \\\n```\n\nYou can use the autoscaler to launch the experiment.\n\n```\nray exec [CLUSTER.YAML] \"\u003cyour python command here\u003e\"\n```\n\n**Note**: You may see that for imagenet, HyperSched does not isolate trials effectively (2 trials running by deadline). This is because we set the following parameters:\n\n```python\n    if args.data == \"imagenet\":\n        worker_config = {}\n        worker_config.update(\n            data_loader_pin=True,\n            data_loader_workers=4,\n            max_train_steps=100,\n            max_val_steps=20,\n            decay=True,\n        )\n        config.update(worker_config=worker_config)\n```\n\nThis indicates that for the ImageNet experiment, 1 \"Trainable iteration\" is defined as 100 SGD updates. HyperSched depends on the ASHA adaptive allocation to terminate trials, and a particular setup of ImageNet will not trigger the ASHA termination. Feel free to push a patch for this (or raise an issue if you want me to fix it :).\n\n## TODOs\n\n- [ ] Move PyTorch Trainable onto `ray.experimental.sgd`\n\n## Talks\n\n[Slides presented at SOCC](assets/hypersched-socc-presentation.pdf)\n\n## Cite\n\nThe proper citation for this work is:\n```\n@inproceedings{Liaw:2019:HDR:3357223.3362719,\n author = {Liaw, Richard and Bhardwaj, Romil and Dunlap, Lisa and Zou, Yitian and Gonzalez, Joseph E. and Stoica, Ion and Tumanov, Alexey},\n title = {HyperSched: Dynamic Resource Reallocation for Model Development on a Deadline},\n booktitle = {Proceedings of the ACM Symposium on Cloud Computing},\n series = {SoCC '19},\n year = {2019},\n isbn = {978-1-4503-6973-2},\n location = {Santa Cruz, CA, USA},\n pages = {61--73},\n numpages = {13},\n url = {http://doi.acm.org/10.1145/3357223.3362719},\n doi = {10.1145/3357223.3362719},\n acmid = {3362719},\n publisher = {ACM},\n address = {New York, NY, USA},\n keywords = {Distributed Machine Learning, Hyperparameter Optimization, Machine Learning Scheduling},\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fucbrise%2Fhypersched","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fucbrise%2Fhypersched","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fucbrise%2Fhypersched/lists"}