https://github.com/microsoft/admin-torch

Understanding the Difficulty of Training Transformers
https://github.com/microsoft/admin-torch

Last synced: about 1 month ago
JSON representation

Understanding the Difficulty of Training Transformers

Host: GitHub
URL: https://github.com/microsoft/admin-torch
Owner: microsoft
License: mit
Created: 2022-03-30T00:57:33.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2022-10-30T00:40:11.000Z (almost 3 years ago)
Last Synced: 2025-09-09T00:37:58.487Z (about 2 months ago)
Language: Python
Size: 3.91 MB
Stars: 45
Watchers: 5
Forks: 2
Open Issues: 1
Metadata Files:
- Readme: README.md
- Changelog: HISTORY.rst
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Security: SECURITY.md

Awesome Lists containing this project

README

          [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/very-deep-transformers-for-neural-machine/machine-translation-on-wmt2014-english-french)](https://paperswithcode.com/sota/machine-translation-on-wmt2014-english-french?p=very-deep-transformers-for-neural-machine)

![PyTorch](https://img.shields.io/badge/PyTorch-%23EE4C2C.svg?style=flat&logo=PyTorch&logoColor=white)

![PyPI - Python Version](https://img.shields.io/pypi/pyversions/admin-torch) 

![GitHub](https://img.shields.io/github/license/microsoft/admin-Torch) 

[![Maintenance](https://img.shields.io/badge/doc-yes-success.svg)](https://microsoft.github.io/admin-torch/) 

![PyPI](https://img.shields.io/pypi/v/admin-torch) 

Admin-Torch

Transformers Training **Stabilized**




  What's New? •

  Key Idea •

  How To Use •

  Docs •

  Examples •

  Citation •

  License



Here, we provide a plug-in-and-play implementation of [Admin](https://arxiv.org/abs/2004.08249),

which stabilizes previously-diverged Transformer training and achieves better performance, 

**without introducing additional hyper-parameters**. The design of Admin is half-precision 

friendly and can be **reparameterized into the original Transformer**. 

______________________________________________________________________

## What's New?

Beyond the [original admin implementation](https://github.com/LiyuanLucasLiu/Transformer-Clinic):

1.  `admin-torch` removed the profilling stage and is **plug-in-and-play**. 

2.  `admin-torch`'s implementation is **more robust** (see below).

Comparison w. the [DeepNet Init](https://arxiv.org/abs/2203.00555) and the [Original Admin Init](https://github.com/LiyuanLucasLiu/Transformer-Clinic) 

(on WMT'17 En-De).

|               | Regular batch size (8x4096) |  Huge batch size (128x4096) |

|---------------|--------------------|------------------|

| [Original Admin](https://github.com/LiyuanLucasLiu/Transformer-Clinic)| ✅ | ❌ |

| [DeepNet](https://arxiv.org/abs/2203.00555) | ❌ | ✅ |

| `admin-torch` | ✅ | ✅ |

More details can be found in [our example](https://github.com/microsoft/admin-torch/tree/main/example).

## Key Idea

What complicates Transformer training?


For Transformer f, input x, randomly initialized weight w, we describe its stability (``output_change_scale``) as 



 



In [our study](https://arxiv.org/abs/2004.08249), we show that, an original n-layer Transformer's 

``output_change_scale`` is ``O(n)``, which unstabilizes its training. Admin stabilize Transformer's

training by regulating this scale to ``O(logn)`` or ``O(1)``. 



 

More details can be found in our [paper](https://arxiv.org/abs/2004.08249).

## How to use?

### install 

```

pip install admin-torch==0.1.0

```

### import

```

import admin_torch

```

### enjoy

```diff

def __init__(self, ...):

...

+(self.residual = admin_torch.as_module(self, self.number_of_sub_layers))+

...

def forward(self, ...):

...

-!x = x + self.f(x)!-

+(x = self.residual(x, self.f(x)))+

x = self.LN(x)

...

```

An elaborated example can be found at [our doc](https://microsoft.github.io/admin-torch/), and a real working example can be found at [LiyuanLucasLiu/fairseq](https://github.com/LiyuanLucasLiu/fairseq/commit/33ad76ae5dc927bc32b9594f9728a367c45680bb) (training recipe is available at [our example](https://github.com/microsoft/admin-torch/tree/main/example)).

## Citation

Please cite the following papers if you found our model useful. Thanks!

>Liyuan Liu, Xiaodong Liu, Jianfeng Gao, Weizhu Chen, and Jiawei Han (2020). Understanding the Difficulty of Training Transformers. Proc. 2020 Conf. on Empirical Methods in Natural Language Processing (EMNLP'20).

```

@inproceedings{liu2020admin,

  title={Understanding the Difficulty of Training Transformers},

  author = {Liu, Liyuan and Liu, Xiaodong and Gao, Jianfeng and Chen, Weizhu and Han, Jiawei},

  booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020)},

  year={2020}

}

```

> Xiaodong Liu, Kevin Duh, Liyuan Liu, and Jianfeng Gao (2020). Very Deep Transformers for Neural Machine Translation. arXiv preprint arXiv:2008.07772 (2020).

```

@inproceedings{liu_deep_2020,

 author = {Liu, Xiaodong and Duh, Kevin and Liu, Liyuan and Gao, Jianfeng},

 booktitle = {arXiv:2008.07772 [cs]},

 title = {Very Deep Transformers for Neural Machine Translation},

 year = {2020}

}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/microsoft/admin-torch

Awesome Lists containing this project

README

Admin-Torch

Transformers Training Stabilized

What complicates Transformer training?