Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/ucbrise/hypersched

Deadline-based hyperparameter tuning on RayTune.
https://github.com/ucbrise/hypersched

distributed hyperparameter-optimization python pytorch ray

Last synced: 3 months ago
JSON representation

Deadline-based hyperparameter tuning on RayTune.

Host: GitHub
URL: https://github.com/ucbrise/hypersched
Owner: ucbrise
Created: 2019-11-14T02:55:07.000Z (over 4 years ago)
Default Branch: master
Last Pushed: 2020-01-16T19:13:58.000Z (over 4 years ago)
Last Synced: 2024-01-22T03:45:42.980Z (6 months ago)
Topics: distributed, hyperparameter-optimization, python, pytorch, ray
Language: Python
Homepage:
Size: 13.3 MB
Stars: 31
Watchers: 6
Forks: 2
Open Issues: 3
Metadata Files:
- Readme: README.md

Lists

awesome-stars - ucbrise/hypersched - Deadline-based hyperparameter tuning on RayTune. (Python)

README

        


    




# HyperSched

An experimental scheduler for accelerated hyperparameter tuning.

**People**: Richard Liaw, Romil Bhardwaj, Lisa Dunlap, Yitian Zou, Joseph E. Gonzalez, Ion Stoica, Alexey Tumanov

For questions, open an issue or email rliaw [at] berkeley.edu

**Please open an issue if you run into errors running the code!**

## Overview

HyperSched a dynamic application-level resource scheduler to track, identify, and preferentially allocate resources to the best performing trials to maximize accuracy by the deadline.

HyperSched is implemented as a `TrialScheduler` of [Ray Tune](http://tune.io/).



    




## Terminology:

**Trial**: One training run of a (randomly sampled) hyperparameter configuration

**Experiment**: A collection of trials.

## Results:

HyperSched will allocate resources to the top performing trial



    




HyperSched can perform better than ASHA under time pressure.



    




## Quick Start

This code has been tested with PyTorch 1.13 and Ray 0.7.6.

It is suggested that you install this on a cluster (and not your laptop).  You can easily spin up a Ray cluster using the [Ray cluster Launcher](https://ray.readthedocs.io/en/latest/autoscaling.html).

Install with:

```bash

pip install ray==0.7.6

git clone https://github.com/ucbrise/hypersched && cd hypersched

pip install -e .

```

Then, you can run CIFAR with a 1800 second deadline, as below:

```bash

python scripts/evaluate_dynamic_asha.py \

    --num-atoms=8 \

    --num-jobs=100 \

    --seed=1 \

    --sched hyper \

    --result-file="some-test.log" \

    --max-t=200 \

    --global-deadline=1800 \

    --trainable-id pytorch \

    --model-string resnet18 \

    --data cifar

```

See `scripts` for more usage examples.

Example Ray cluster configurations are provided in `scripts/cluster_cfg`.

## Advanced Usage

#### Configuring HyperSched

```python

# trainable.metric = "mean_accuracy"

sched = HyperSched(

    num_atoms,

    scaling_dict=get_scaling(

        args.trainable_id, args.model_string, args.data

    ),  # optional model for scaling

    deadline=args.global_deadline,

    resource_policy="UNIFORM",

    time_attr=multijob_config["time_attr"],

    mode="max",

    metric=trainable.metric,

    grace_period=config["min_allocation"],

    max_t= config["max_allocation"],

)

summary = Summary(trainable.metric)

analysis = tune.run(

  trainable,

  name=f"{uuid.uuid4().hex[:8]}",

  num_samples=args.num_jobs,

  config=config,

  verbose=1,

  local_dir=args.result_path

  if args.result_path and os.path.exists(args.result_path)

  else None,

  global_checkpoint_period=600,  # avoid checkpointing completely.

  scheduler=sched,

  resources_per_trial=trainable.to_resources(1)._asdict(),  # initial resources

  trial_executor=ResourceExecutor(

      deadline_s=args.global_deadline, hooks=[summary]

  )

)

```

#### Viewing Results

The `hypersched.tune.Summary` object will log both a text file and also a CSV for "experiment-level" statistics.

#### HyperSched Imagenet Training on AWS

1. Create an EBS volume with ImageNet (https://github.com/pytorch/examples/tree/master/imagenet)

2. Set the EBS volume for all nodes of your cluster. For example, as seen in `scripts/imagenet.yaml`;

```yaml

head_node:

    InstanceType: p3.16xlarge

    ImageId: ami-0d96d570269578cd7

    BlockDeviceMappings:

      - DeviceName: "/dev/sdm"

        Ebs:

          VolumeType: "io1"

          Iops: 10000

          DeleteOnTermination: True

          VolumeSize: 250

          SnapshotId: "snap-01838dca0cbffad5c"

```

3. Launch the cluster. If you modify the yaml, you can then launch a cluster using `ray up scripts/imagenet.yaml`. Beware, this will cost some money. If you use the YAML, cluster will then setup a Ray cluster among the nodes launched.

3. Run the following command:

```bash

python ~/sosp2019/scripts/evaluate_dynamic_asha.py \

    --redis-address="localhost:6379" \

    --num-atoms=16 \

    --num-jobs=200 \

    --seed=0 \

    --sched hyper \

    --result-file="~/MY_LOG_FILE.log" \

    --max-t=500 \

    --global-deadline=7200 \

    --trainable-id pytorch \

    --model-string resnet50 \

    --data imagenet \

```

You can use the autoscaler to launch the experiment.

```

ray exec [CLUSTER.YAML] ""

```

**Note**: You may see that for imagenet, HyperSched does not isolate trials effectively (2 trials running by deadline). This is because we set the following parameters:

```python

    if args.data == "imagenet":

        worker_config = {}

        worker_config.update(

            data_loader_pin=True,

            data_loader_workers=4,

            max_train_steps=100,

            max_val_steps=20,

            decay=True,

        )

        config.update(worker_config=worker_config)

```

This indicates that for the ImageNet experiment, 1 "Trainable iteration" is defined as 100 SGD updates. HyperSched depends on the ASHA adaptive allocation to terminate trials, and a particular setup of ImageNet will not trigger the ASHA termination. Feel free to push a patch for this (or raise an issue if you want me to fix it :).

## TODOs

- [ ] Move PyTorch Trainable onto `ray.experimental.sgd`

## Talks

[Slides presented at SOCC](assets/hypersched-socc-presentation.pdf)

## Cite

The proper citation for this work is:

```

@inproceedings{Liaw:2019:HDR:3357223.3362719,

 author = {Liaw, Richard and Bhardwaj, Romil and Dunlap, Lisa and Zou, Yitian and Gonzalez, Joseph E. and Stoica, Ion and Tumanov, Alexey},

 title = {HyperSched: Dynamic Resource Reallocation for Model Development on a Deadline},

 booktitle = {Proceedings of the ACM Symposium on Cloud Computing},

 series = {SoCC '19},

 year = {2019},

 isbn = {978-1-4503-6973-2},

 location = {Santa Cruz, CA, USA},

 pages = {61--73},

 numpages = {13},

 url = {http://doi.acm.org/10.1145/3357223.3362719},

 doi = {10.1145/3357223.3362719},

 acmid = {3362719},

 publisher = {ACM},

 address = {New York, NY, USA},

 keywords = {Distributed Machine Learning, Hyperparameter Optimization, Machine Learning Scheduling},

}

```