https://github.com/MachineLearningSystem/Chimera

Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines.
https://github.com/MachineLearningSystem/Chimera

Last synced: 3 months ago
JSON representation

Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines.

Host: GitHub
URL: https://github.com/MachineLearningSystem/Chimera
Owner: MachineLearningSystem
License: gpl-3.0
Fork: true (Shigangli/Chimera)
Created: 2022-05-15T12:04:23.000Z (about 3 years ago)
Default Branch: main
Last Pushed: 2022-03-11T20:45:08.000Z (over 3 years ago)
Last Synced: 2024-11-07T08:42:30.510Z (8 months ago)
Homepage:
Size: 722 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-AI-system - Chimera: efficiently training large-scale neural networks with bidirectional pipelines SC'21

README

        
## Chimera: efficiently training large-scale neural networks with bidirectional pipelines

Chimera is novel pipeline parallelism approach, which is proposed for efficiently training large-scale neural network models (e.g., BERT, GPT-2/3) on parallel machines (e.g., GPU clusters). The key idea of Chimera is to reduce the number of bubbles in the pipeline, **without** introducing staleness in the training process.

Our implementation (SC'21) was based on PyTorch and adapted from the PipeDream. We use GLOO as the distributed backend.

**A new (concise and also fully-fledged) verion of Chimera will be added** in the [Chimera-BERT branch](https://github.com/Shigangli/Chimera/tree/Chimera-BERT).

## Directory Structure

`chimera/chimera_bert`

Bert in Chimera.

`chimera/chimera_gpt2` 

GPT-2 in Chimera.

`chimera/chimera_pipes` 

Chimera generalized to more than two pipelines.

`chimera/performance_model`

Performance modelling for communications.

## Run the Experiments

To install the required Python modules: 

`conda create --name py37 python=3.7`

`source activate py37`

`pip install -r requirements.txt`

We run experiments on GPU clusters with SLURM job scheduler. For example, one can submit a job to the job queue by

`cd ./job_scripts`

`sbatch daint_bert48_32nodes_chimera_4w8d.sh`

## Publication

Chimera is pulished in SC'21, **Best Paper Finalist**. See the [paper](https://dl.acm.org/doi/abs/10.1145/3458817.3476145) and the [video talk](https://dl.acm.org/doi/abs/10.1145/3458817.3476145#sec-supp) for more details. To cite our work:

```bibtex

@inproceedings{li143,

  author = {Li, Shigang and Hoefler, Torsten},

  title = {Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines},

  year = {2021},

  isbn = {9781450384421},

  publisher = {Association for Computing Machinery},

  address = {New York, NY, USA},

  url = {https://doi.org/10.1145/3458817.3476145},

  doi = {10.1145/3458817.3476145},

  booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis},

  articleno = {27},

  numpages = {14},

  location = {St. Louis, Missouri},

  series = {SC '21}

}

```

## License

See [LICENSE](LICENSE).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/MachineLearningSystem/Chimera

Awesome Lists containing this project

README