https://github.com/shigangli/eager-sgd

Eager-SGD is a decentralized asynchronous SGD. It utilizes novel partial collectives operations to accumulate the gradients across all the processes.
https://github.com/shigangli/eager-sgd

distributed-deep-learning gradient-averaging partial-allreduce

Last synced: 11 months ago
JSON representation

Eager-SGD is a decentralized asynchronous SGD. It utilizes novel partial collectives operations to accumulate the gradients across all the processes.

Host: GitHub
URL: https://github.com/shigangli/eager-sgd
Owner: Shigangli
License: apache-2.0
Created: 2019-11-30T20:53:07.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2021-11-18T16:14:30.000Z (over 4 years ago)
Last Synced: 2023-10-20T23:06:23.717Z (over 2 years ago)
Topics: distributed-deep-learning, gradient-averaging, partial-allreduce
Language: Python
Homepage:
Size: 1.31 MB
Stars: 7
Watchers: 3
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

Eager-SGD
---------

**Eager-SGD** is a **decentralized asynchronous SGD** for distributed deep learning training based on **gradient averaging**. It utilizes novel partial collectives operations (partial allreduce) to accumulate the gradients across all the processes. Different from the traditional collectives operations (such as MPI, NCCL), a partial collective is an asynchronous operation where a subset of the processes can trigger and contribute the latest data to the collective operation.

Eager-SGD may bring staleness to the gradients. Thanks to our sophisticated implementation of solo-allreduce and majority-allreduce, the **staleness is bounded** and therefore eager-SGD is stale-synchronous. Due to the asynchrony feature of eager-SGD, it can better handle the deep learning training with load imbalance. To the best of our knowledge, this is the first work that implements asynchronous and stale-synchronous decentralized SGD where the messages propagate to all nodes in one step.

Demo
---------
A script to run eager-SGD on ResNet-50/ImageNet with SLURM job scheduler can be found [here](https://github.com/Shigangli/eager-SGD/blob/master/test-models/tf-models-r1.11/official/resnet/test_scripts_imagenet/daint_eagersgd_imagenet.sh).
Generally, to evaluate other neural network models with the [customized optimizers](https://github.com/Shigangli/eager-SGD/blob/master/test-models/tf-models-r1.11/official/utils/) (e.g., gradient averaging using solo/majority-allreduce), one can simply wrap the default optimizer using the customized optimizers. See the example for ResNet-50 [here](https://github.com/Shigangli/eager-SGD/blob/master/test-models/tf-models-r1.11/official/resnet/resnet_run_loop_solo_imagenet_300.py#L384).

Publication
-----------
The work of eager-SGD is pulished in PPoPP'20, **Best Paper Finalist**. See the [paper](https://shigangli.github.io/files/ppopp20-eager-SGD-paper.pdf) for details. If you use eager-SGD, cite us:
```bibtex
@inproceedings{li2020taming,
title={Taming unbalanced training workloads in deep learning with partial collective operations},
author={Li, Shigang and Ben-Nun, Tal and Girolamo, Salvatore Di and Alistarh, Dan and Hoefler, Torsten},
booktitle={Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming},
pages={45--61},
year={2020}
}
```

License
-------
See [LICENSE](LICENSE).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/shigangli/eager-sgd

Awesome Lists containing this project

README