{"id":13478348,"url":"https://github.com/microsoft/admin-torch","last_synced_at":"2025-09-23T01:09:36.975Z","repository":{"id":38842899,"uuid":"475669650","full_name":"microsoft/admin-torch","owner":"microsoft","description":"Understanding the Difficulty of Training Transformers","archived":false,"fork":false,"pushed_at":"2022-10-30T00:40:11.000Z","size":4103,"stargazers_count":45,"open_issues_count":1,"forks_count":2,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-09-09T00:37:58.487Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/microsoft.png","metadata":{"files":{"readme":"README.md","changelog":"HISTORY.rst","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null}},"created_at":"2022-03-30T00:57:33.000Z","updated_at":"2025-08-07T12:27:15.000Z","dependencies_parsed_at":"2023-01-19T18:46:58.046Z","dependency_job_id":null,"html_url":"https://github.com/microsoft/admin-torch","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/microsoft/admin-torch","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Fadmin-torch","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Fadmin-torch/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Fadmin-torch/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Fadmin-torch/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/microsoft","download_url":"https://codeload.github.com/microsoft/admin-torch/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Fadmin-torch/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":275213742,"owners_count":25424886,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-15T02:00:09.272Z","response_time":75,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-31T16:01:55.837Z","updated_at":"2025-09-23T01:09:36.955Z","avatar_url":"https://github.com/microsoft.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/very-deep-transformers-for-neural-machine/machine-translation-on-wmt2014-english-french)](https://paperswithcode.com/sota/machine-translation-on-wmt2014-english-french?p=very-deep-transformers-for-neural-machine)\n![PyTorch](https://img.shields.io/badge/PyTorch-%23EE4C2C.svg?style=flat\u0026logo=PyTorch\u0026logoColor=white)\n![PyPI - Python Version](https://img.shields.io/pypi/pyversions/admin-torch) \n![GitHub](https://img.shields.io/github/license/microsoft/admin-Torch) \n[![Maintenance](https://img.shields.io/badge/doc-yes-success.svg)](https://microsoft.github.io/admin-torch/) \n![PyPI](https://img.shields.io/pypi/v/admin-torch) \n\n\u003ch2 align=\"center\"\u003eAdmin-Torch\u003c/h2\u003e\n\u003ch5 align=\"center\"\u003eTransformers Training **Stabilized**\u003c/h5\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"#whats-new\"\u003eWhat's New?\u003c/a\u003e •\n  \u003ca href=\"#key-idea\"\u003eKey Idea\u003c/a\u003e •\n  \u003ca href=\"#how-to-use\"\u003eHow To Use\u003c/a\u003e •\n  \u003ca href=\"https://microsoft.github.io/admin-torch/\"\u003eDocs\u003c/a\u003e •\n  \u003ca href=\"https://github.com/microsoft/admin-torch/tree/main/example\"\u003eExamples\u003c/a\u003e •\n  \u003ca href=\"#citation\"\u003eCitation\u003c/a\u003e •\n  \u003ca href=\"https://github.com/microsoft/admin-torch/tree/main/LICENSE\"\u003eLicense\u003c/a\u003e\n\u003c/p\u003e\n\nHere, we provide a plug-in-and-play implementation of [Admin](https://arxiv.org/abs/2004.08249),\nwhich stabilizes previously-diverged Transformer training and achieves better performance, \n**without introducing additional hyper-parameters**. The design of Admin is half-precision \nfriendly and can be **reparameterized into the original Transformer**. \n\n______________________________________________________________________\n## What's New?\n\nBeyond the [original admin implementation](https://github.com/LiyuanLucasLiu/Transformer-Clinic):\n1.  `admin-torch` removed the profilling stage and is **plug-in-and-play**. \n2.  `admin-torch`'s implementation is **more robust** (see below).\n\nComparison w. the [DeepNet Init](https://arxiv.org/abs/2203.00555) and the [Original Admin Init](https://github.com/LiyuanLucasLiu/Transformer-Clinic) \n(on WMT'17 En-De).\n\n|               | Regular batch size (8x4096) |  Huge batch size (128x4096) |\n|---------------|--------------------|------------------|\n| [Original Admin](https://github.com/LiyuanLucasLiu/Transformer-Clinic)| ✅ | ❌ |\n| [DeepNet](https://arxiv.org/abs/2203.00555) | ❌ | ✅ |\n| `admin-torch` | ✅ | ✅ |\n\nMore details can be found in [our example](https://github.com/microsoft/admin-torch/tree/main/example).\n\n## Key Idea\n\u003ch5 align=\"center\"\u003e\u003ci\u003eWhat complicates Transformer training?\u003c/i\u003e\u003c/h5\u003e\n\nFor Transformer f, input x, randomly initialized weight w, we describe its stability (``output_change_scale``) as \n\n\u003cp align=\"center\"\u003e\n\u003c!-- $E[|f(x, w) - f(x, w + \\delta)|_2^2]$ --\u003e \u003cimg style=\"transform: translateY(0.1em); background: white;\" src=\"https://render.githubusercontent.com/render/math?math=E%5B%7Cf(x%2C%20w)%20-%20f(x%2C%20w%20%2B%20%5Cdelta)%7C_2%5E2%5D\"\u003e\n\u003c/p\u003e\n\nIn [our study](https://arxiv.org/abs/2004.08249), we show that, an original n-layer Transformer's \n``output_change_scale`` is ``O(n)``, which unstabilizes its training. Admin stabilize Transformer's\ntraining by regulating this scale to ``O(logn)`` or ``O(1)``. \n\n\u003cp align=\"center\"\u003e\u003cimg width=\"60%\" src=\"doc/source/_static/output_change.png\"/\u003e\u003c/p\u003e\n \nMore details can be found in our [paper](https://arxiv.org/abs/2004.08249).\n\n\n## How to use?\n\n### install \n```\npip install admin-torch==0.1.0\n```\n\n### import\n```\nimport admin_torch\n```\n\n### enjoy\n\n```diff\ndef __init__(self, ...):\n...\n+(self.residual = admin_torch.as_module(self, self.number_of_sub_layers))+\n...\n\ndef forward(self, ...):\n...\n-!x = x + self.f(x)!-\n+(x = self.residual(x, self.f(x)))+\nx = self.LN(x)\n...\n```\n\nAn elaborated example can be found at [our doc](https://microsoft.github.io/admin-torch/), and a real working example can be found at [LiyuanLucasLiu/fairseq](https://github.com/LiyuanLucasLiu/fairseq/commit/33ad76ae5dc927bc32b9594f9728a367c45680bb) (training recipe is available at [our example](https://github.com/microsoft/admin-torch/tree/main/example)).\n\n## Citation\nPlease cite the following papers if you found our model useful. Thanks!\n\n\u003eLiyuan Liu, Xiaodong Liu, Jianfeng Gao, Weizhu Chen, and Jiawei Han (2020). Understanding the Difficulty of Training Transformers. Proc. 2020 Conf. on Empirical Methods in Natural Language Processing (EMNLP'20).\n```\n@inproceedings{liu2020admin,\n  title={Understanding the Difficulty of Training Transformers},\n  author = {Liu, Liyuan and Liu, Xiaodong and Gao, Jianfeng and Chen, Weizhu and Han, Jiawei},\n  booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020)},\n  year={2020}\n}\n```\n\u003e Xiaodong Liu, Kevin Duh, Liyuan Liu, and Jianfeng Gao (2020). Very Deep Transformers for Neural Machine Translation. arXiv preprint arXiv:2008.07772 (2020).\n```\n@inproceedings{liu_deep_2020,\n author = {Liu, Xiaodong and Duh, Kevin and Liu, Liyuan and Gao, Jianfeng},\n booktitle = {arXiv:2008.07772 [cs]},\n title = {Very Deep Transformers for Neural Machine Translation},\n year = {2020}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2Fadmin-torch","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmicrosoft%2Fadmin-torch","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2Fadmin-torch/lists"}