Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/microsoft/admin-torch
Understanding the Difficulty of Training Transformers
https://github.com/microsoft/admin-torch
Last synced: 14 days ago
JSON representation
Understanding the Difficulty of Training Transformers
- Host: GitHub
- URL: https://github.com/microsoft/admin-torch
- Owner: microsoft
- License: mit
- Created: 2022-03-30T00:57:33.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2022-10-30T00:40:11.000Z (about 2 years ago)
- Last Synced: 2024-05-09T17:51:36.216Z (6 months ago)
- Language: Python
- Size: 3.91 MB
- Stars: 45
- Watchers: 6
- Forks: 4
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: HISTORY.rst
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Security: SECURITY.md
Awesome Lists containing this project
README
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/very-deep-transformers-for-neural-machine/machine-translation-on-wmt2014-english-french)](https://paperswithcode.com/sota/machine-translation-on-wmt2014-english-french?p=very-deep-transformers-for-neural-machine)
![PyTorch](https://img.shields.io/badge/PyTorch-%23EE4C2C.svg?style=flat&logo=PyTorch&logoColor=white)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/admin-torch)
![GitHub](https://img.shields.io/github/license/microsoft/admin-Torch)
[![Maintenance](https://img.shields.io/badge/doc-yes-success.svg)](https://microsoft.github.io/admin-torch/)
![PyPI](https://img.shields.io/pypi/v/admin-torch)Admin-Torch
Transformers Training **Stabilized**
What's New? •
Key Idea •
How To Use •
Docs •
Examples •
Citation •
LicenseHere, we provide a plug-in-and-play implementation of [Admin](https://arxiv.org/abs/2004.08249),
which stabilizes previously-diverged Transformer training and achieves better performance,
**without introducing additional hyper-parameters**. The design of Admin is half-precision
friendly and can be **reparameterized into the original Transformer**.______________________________________________________________________
## What's New?Beyond the [original admin implementation](https://github.com/LiyuanLucasLiu/Transformer-Clinic):
1. `admin-torch` removed the profilling stage and is **plug-in-and-play**.
2. `admin-torch`'s implementation is **more robust** (see below).Comparison w. the [DeepNet Init](https://arxiv.org/abs/2203.00555) and the [Original Admin Init](https://github.com/LiyuanLucasLiu/Transformer-Clinic)
(on WMT'17 En-De).| | Regular batch size (8x4096) | Huge batch size (128x4096) |
|---------------|--------------------|------------------|
| [Original Admin](https://github.com/LiyuanLucasLiu/Transformer-Clinic)| ✅ | ❌ |
| [DeepNet](https://arxiv.org/abs/2203.00555) | ❌ | ✅ |
| `admin-torch` | ✅ | ✅ |More details can be found in [our example](https://github.com/microsoft/admin-torch/tree/main/example).
## Key Idea
What complicates Transformer training?
For Transformer f, input x, randomly initialized weight w, we describe its stability (``output_change_scale``) as
In [our study](https://arxiv.org/abs/2004.08249), we show that, an original n-layer Transformer's
``output_change_scale`` is ``O(n)``, which unstabilizes its training. Admin stabilize Transformer's
training by regulating this scale to ``O(logn)`` or ``O(1)``.
More details can be found in our [paper](https://arxiv.org/abs/2004.08249).## How to use?
### install
```
pip install admin-torch==0.1.0
```### import
```
import admin_torch
```### enjoy
```diff
def __init__(self, ...):
...
+(self.residual = admin_torch.as_module(self, self.number_of_sub_layers))+
...def forward(self, ...):
...
-!x = x + self.f(x)!-
+(x = self.residual(x, self.f(x)))+
x = self.LN(x)
...
```An elaborated example can be found at [our doc](https://microsoft.github.io/admin-torch/), and a real working example can be found at [LiyuanLucasLiu/fairseq](https://github.com/LiyuanLucasLiu/fairseq/commit/33ad76ae5dc927bc32b9594f9728a367c45680bb) (training recipe is available at [our example](https://github.com/microsoft/admin-torch/tree/main/example)).
## Citation
Please cite the following papers if you found our model useful. Thanks!>Liyuan Liu, Xiaodong Liu, Jianfeng Gao, Weizhu Chen, and Jiawei Han (2020). Understanding the Difficulty of Training Transformers. Proc. 2020 Conf. on Empirical Methods in Natural Language Processing (EMNLP'20).
```
@inproceedings{liu2020admin,
title={Understanding the Difficulty of Training Transformers},
author = {Liu, Liyuan and Liu, Xiaodong and Gao, Jianfeng and Chen, Weizhu and Han, Jiawei},
booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020)},
year={2020}
}
```
> Xiaodong Liu, Kevin Duh, Liyuan Liu, and Jianfeng Gao (2020). Very Deep Transformers for Neural Machine Translation. arXiv preprint arXiv:2008.07772 (2020).
```
@inproceedings{liu_deep_2020,
author = {Liu, Xiaodong and Duh, Kevin and Liu, Liyuan and Gao, Jianfeng},
booktitle = {arXiv:2008.07772 [cs]},
title = {Very Deep Transformers for Neural Machine Translation},
year = {2020}
}
```