https://github.com/MachineLearningSystem/baechi

Baechi (SoCC '20)
https://github.com/MachineLearningSystem/baechi

Last synced: 2 months ago
JSON representation

Baechi (SoCC '20)

Host: GitHub
URL: https://github.com/MachineLearningSystem/baechi
Owner: MachineLearningSystem
License: other
Fork: true (beomyeol/baechi)
Created: 2022-11-03T03:17:23.000Z (over 2 years ago)
Default Branch: master
Last Pushed: 2022-06-06T23:12:03.000Z (about 3 years ago)
Last Synced: 2024-08-02T19:37:01.331Z (11 months ago)
Homepage:
Size: 133 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: License.txt

Awesome Lists containing this project

awesome-AI-system - Baechi: Fast Device Placement on Machine Learning Graphs SOCC'20

README

# Baechi: Fast Device Placement on Machine Learning Graphs (SoCC 2020)

## Install dependencies
* Install dependencies with Anaconda
```
$ conda install -y python=3.6 numpy=1.16 tensorflow-gpu=1.12 bazel=0.20.0 \
networkx future matplotlib cvxopt scikit-learn
```
* Mosek
```
$ pip install -f https://download.mosek.com/stable/wheel/index.html Mosek==8.1.82
```

Our code runs [MOSEK](https://www.mosek.com/) as an LP solver for SCT.
MOSEK provides a free personal academic license.
You can request a license at https://www.mosek.com/products/academic-licenses.
The license file (`mosek.lic`) should be placed at `$HOME/mosek`.

## Example usage
This example generates the placement of 4-layer GNMT v2 with a batch size of 128, a maximum sequence length of 40, and a vocabulary size of 30000.

* Build a Python program to place operators of an ML model.
```
$ bazel build :train
```

* Generate profiles.

```
$ ./bazel-bin/train \
--costgen \
--cost_path=/tmp/cost.pkl \
--optimizer=adam \
--batch_size=128 \
--model_name=gnmt_v2 \
--vocab_size=30000 \
--max_seq_length=40 \
--rnn_unit_type=lstm \
--rnn_units=512 \
--num_layers=4 \
--encoder_type=gnmt \
--num_gpus=4 \
--residual \
--colocate_grads_with_ops \
--only_forward
```
This generates profiles of the forward pass and stores them at `/tmp/cost.pkl`.

* Generate a communication cost function between GPUs through the linear regression.

```
$ bazel build //utils:communication_benchmark
$ ./bazel-bin/utils/communication_benchmark
```

This runs a benchmark that transfers tensors between different GPUs for various tensor sizes.
By default, the benchmark transfers tensors from `GPU:0` to `GPU:1` with tensor sizes in the range [2⁰, 2²⁹].
After the benchmark finishes, it prints out a generated communication cost function that
should be given as the `--comm_cost_coeffs` argument value for the placement.

An example output would be the following.
```
...
Communication cost function: 0.0001754 x + 134
```

* Place operators of GNMT v2 and measure average step times.

```
$ ./bazel-bin/train \
--cost_path=/tmp/cost.pkl \
--optimizer=adam \
--batch_size=128 \
--model_name=gnmt_v2 \
--vocab_size=30000 \
--max_seq_length=40 \
--rnn_unit_type=lstm \
--rnn_units=512 \
--num_layers=4 \
--encoder_type=gnmt \
--num_gpus=4 \
--residual \
--colocate_grads_with_ops \
--only_forward \
--placement_method=m_etf \
--placer_type=fusion \
--grouper=coplace \
--comm_cost_coeffs=0.0001754,134 \
--memory_fraction=1.0
```

This runs the placement of GNMT v2 operators using m-ETF based on the forward operators.
When the placement is done, this measures the average step time of the placement results and prints it out.

## Docker image

A Docker image with all dependencies installed is available.

```
$ docker pull beomyeol/baechi
$ docker run -it --rm --gpus all beomyeol/baechi /bin/bash
```

This gives you direct access to the container with all GPUs enabled.
You can follow the example usage within the container.

## License
University of Illinois/NCSA Open Source License

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/MachineLearningSystem/baechi

Awesome Lists containing this project

README