Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ucbrise/hypersched
Deadline-based hyperparameter tuning on RayTune.
https://github.com/ucbrise/hypersched
distributed hyperparameter-optimization python pytorch ray
Last synced: 3 months ago
JSON representation
Deadline-based hyperparameter tuning on RayTune.
- Host: GitHub
- URL: https://github.com/ucbrise/hypersched
- Owner: ucbrise
- Created: 2019-11-14T02:55:07.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2020-01-16T19:13:58.000Z (over 4 years ago)
- Last Synced: 2024-01-22T03:45:42.980Z (6 months ago)
- Topics: distributed, hyperparameter-optimization, python, pytorch, ray
- Language: Python
- Homepage:
- Size: 13.3 MB
- Stars: 31
- Watchers: 6
- Forks: 2
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
Lists
- awesome-stars - ucbrise/hypersched - Deadline-based hyperparameter tuning on RayTune. (Python)
README
# HyperSched
An experimental scheduler for accelerated hyperparameter tuning.
**People**: Richard Liaw, Romil Bhardwaj, Lisa Dunlap, Yitian Zou, Joseph E. Gonzalez, Ion Stoica, Alexey Tumanov
For questions, open an issue or email rliaw [at] berkeley.edu
**Please open an issue if you run into errors running the code!**
## Overview
HyperSched a dynamic application-level resource scheduler to track, identify, and preferentially allocate resources to the best performing trials to maximize accuracy by the deadline.
HyperSched is implemented as a `TrialScheduler` of [Ray Tune](http://tune.io/).
## Terminology:
**Trial**: One training run of a (randomly sampled) hyperparameter configuration
**Experiment**: A collection of trials.
## Results:
HyperSched will allocate resources to the top performing trial
HyperSched can perform better than ASHA under time pressure.
## Quick Start
This code has been tested with PyTorch 1.13 and Ray 0.7.6.
It is suggested that you install this on a cluster (and not your laptop). You can easily spin up a Ray cluster using the [Ray cluster Launcher](https://ray.readthedocs.io/en/latest/autoscaling.html).
Install with:
```bash
pip install ray==0.7.6
git clone https://github.com/ucbrise/hypersched && cd hypersched
pip install -e .
```Then, you can run CIFAR with a 1800 second deadline, as below:
```bash
python scripts/evaluate_dynamic_asha.py \
--num-atoms=8 \
--num-jobs=100 \
--seed=1 \
--sched hyper \
--result-file="some-test.log" \
--max-t=200 \
--global-deadline=1800 \
--trainable-id pytorch \
--model-string resnet18 \
--data cifar
```
See `scripts` for more usage examples.Example Ray cluster configurations are provided in `scripts/cluster_cfg`.
## Advanced Usage
#### Configuring HyperSched
```python
# trainable.metric = "mean_accuracy"
sched = HyperSched(
num_atoms,
scaling_dict=get_scaling(
args.trainable_id, args.model_string, args.data
), # optional model for scaling
deadline=args.global_deadline,
resource_policy="UNIFORM",
time_attr=multijob_config["time_attr"],
mode="max",
metric=trainable.metric,
grace_period=config["min_allocation"],
max_t= config["max_allocation"],
)summary = Summary(trainable.metric)
analysis = tune.run(
trainable,
name=f"{uuid.uuid4().hex[:8]}",
num_samples=args.num_jobs,
config=config,
verbose=1,
local_dir=args.result_path
if args.result_path and os.path.exists(args.result_path)
else None,
global_checkpoint_period=600, # avoid checkpointing completely.
scheduler=sched,
resources_per_trial=trainable.to_resources(1)._asdict(), # initial resources
trial_executor=ResourceExecutor(
deadline_s=args.global_deadline, hooks=[summary]
)
)
```
#### Viewing Results
The `hypersched.tune.Summary` object will log both a text file and also a CSV for "experiment-level" statistics.#### HyperSched Imagenet Training on AWS
1. Create an EBS volume with ImageNet (https://github.com/pytorch/examples/tree/master/imagenet)
2. Set the EBS volume for all nodes of your cluster. For example, as seen in `scripts/imagenet.yaml`;```yaml
head_node:
InstanceType: p3.16xlarge
ImageId: ami-0d96d570269578cd7
BlockDeviceMappings:
- DeviceName: "/dev/sdm"
Ebs:
VolumeType: "io1"
Iops: 10000
DeleteOnTermination: True
VolumeSize: 250
SnapshotId: "snap-01838dca0cbffad5c"```
3. Launch the cluster. If you modify the yaml, you can then launch a cluster using `ray up scripts/imagenet.yaml`. Beware, this will cost some money. If you use the YAML, cluster will then setup a Ray cluster among the nodes launched.
3. Run the following command:
```bash
python ~/sosp2019/scripts/evaluate_dynamic_asha.py \
--redis-address="localhost:6379" \
--num-atoms=16 \
--num-jobs=200 \
--seed=0 \
--sched hyper \
--result-file="~/MY_LOG_FILE.log" \
--max-t=500 \
--global-deadline=7200 \
--trainable-id pytorch \
--model-string resnet50 \
--data imagenet \
```You can use the autoscaler to launch the experiment.
```
ray exec [CLUSTER.YAML] ""
```**Note**: You may see that for imagenet, HyperSched does not isolate trials effectively (2 trials running by deadline). This is because we set the following parameters:
```python
if args.data == "imagenet":
worker_config = {}
worker_config.update(
data_loader_pin=True,
data_loader_workers=4,
max_train_steps=100,
max_val_steps=20,
decay=True,
)
config.update(worker_config=worker_config)
```This indicates that for the ImageNet experiment, 1 "Trainable iteration" is defined as 100 SGD updates. HyperSched depends on the ASHA adaptive allocation to terminate trials, and a particular setup of ImageNet will not trigger the ASHA termination. Feel free to push a patch for this (or raise an issue if you want me to fix it :).
## TODOs
- [ ] Move PyTorch Trainable onto `ray.experimental.sgd`
## Talks
[Slides presented at SOCC](assets/hypersched-socc-presentation.pdf)
## Cite
The proper citation for this work is:
```
@inproceedings{Liaw:2019:HDR:3357223.3362719,
author = {Liaw, Richard and Bhardwaj, Romil and Dunlap, Lisa and Zou, Yitian and Gonzalez, Joseph E. and Stoica, Ion and Tumanov, Alexey},
title = {HyperSched: Dynamic Resource Reallocation for Model Development on a Deadline},
booktitle = {Proceedings of the ACM Symposium on Cloud Computing},
series = {SoCC '19},
year = {2019},
isbn = {978-1-4503-6973-2},
location = {Santa Cruz, CA, USA},
pages = {61--73},
numpages = {13},
url = {http://doi.acm.org/10.1145/3357223.3362719},
doi = {10.1145/3357223.3362719},
acmid = {3362719},
publisher = {ACM},
address = {New York, NY, USA},
keywords = {Distributed Machine Learning, Hyperparameter Optimization, Machine Learning Scheduling},
}
```