{"id":13442593,"url":"https://github.com/microsoft/torchscale","last_synced_at":"2025-05-14T02:08:05.410Z","repository":{"id":63720048,"uuid":"567185522","full_name":"microsoft/torchscale","owner":"microsoft","description":"Foundation Architecture for (M)LLMs","archived":false,"fork":false,"pushed_at":"2024-04-11T13:58:57.000Z","size":370,"stargazers_count":3074,"open_issues_count":31,"forks_count":217,"subscribers_count":44,"default_branch":"main","last_synced_at":"2025-05-07T23:47:57.800Z","etag":null,"topics":["computer-vision","machine-learning","multimodal","natural-language-processing","pretrained-language-model","speech-processing","transformer","translation"],"latest_commit_sha":null,"homepage":"https://aka.ms/GeneralAI","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/microsoft.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":"SUPPORT.md","governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-11-17T08:55:59.000Z","updated_at":"2025-05-07T03:37:30.000Z","dependencies_parsed_at":"2024-01-14T18:10:48.053Z","dependency_job_id":"f53ae3da-00f4-4743-aafb-7777d75bc012","html_url":"https://github.com/microsoft/torchscale","commit_stats":{"total_commits":57,"total_committers":9,"mean_commits":6.333333333333333,"dds":"0.49122807017543857","last_synced_commit":"4ae3b248ee1eb6f75afca7cdbc9f6608c4cef207"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Ftorchscale","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Ftorchscale/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Ftorchscale/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Ftorchscale/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/microsoft","download_url":"https://codeload.github.com/microsoft/torchscale/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254052947,"owners_count":22006717,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["computer-vision","machine-learning","multimodal","natural-language-processing","pretrained-language-model","speech-processing","transformer","translation"],"created_at":"2024-07-31T03:01:47.725Z","updated_at":"2025-05-14T02:08:00.399Z","avatar_url":"https://github.com/microsoft.png","language":"Python","funding_links":[],"categories":["Python","English-centric","Transformer库与优化","文本生成","Repos"],"sub_categories":["ChatGPT"],"readme":"# TorchScale - A Library of Foundation Architectures\n\n\u003cp\u003e\n  \u003ca href=\"https://github.com/microsoft/torchscale/blob/main/LICENSE\"\u003e\u003cimg alt=\"MIT License\" src=\"https://img.shields.io/badge/license-MIT-blue.svg\" /\u003e\u003c/a\u003e\n  \u003ca href=\"https://pypi.org/project/torchscale\"\u003e\u003cimg alt=\"MIT License\" src=\"https://badge.fury.io/py/torchscale.svg\" /\u003e\u003c/a\u003e\n\u003c/p\u003e\n\nTorchScale is a PyTorch library that allows researchers and developers to scale up Transformers efficiently and effectively.\n\nFundamental research to develop new architectures for foundation models and A(G)I, focusing on modeling generality and capability, as well as training stability and efficiency.\n- Stability - [**DeepNet**](https://arxiv.org/abs/2203.00555): scaling Transformers to 1,000 Layers and beyond\n- Generality - [**Foundation Transformers (Magneto)**](https://arxiv.org/abs/2210.06423): towards true general-purpose modeling across tasks and modalities (including language, vision, speech, and multimodal)\n- Capability - A [**Length-Extrapolatable**](https://arxiv.org/abs/2212.10554) Transformer\n- Efficiency - [**X-MoE**](https://arxiv.org/abs/2204.09179): scalable \u0026 finetunable sparse Mixture-of-Experts (MoE)\n\n### The Revolution of Model Architecture\n- [**BitNet**](https://arxiv.org/abs/2310.11453): 1-bit Transformers for Large Language Models\n- [**RetNet**](https://arxiv.org/abs/2307.08621): Retentive Network: A Successor to Transformer for Large Language Models\n- [**LongNet**](https://arxiv.org/abs/2307.02486): Scaling Transformers to 1,000,000,000 Tokens\n\n## News\n\n- December, 2023: [LongNet](torchscale/model/LongNet.py) and [LongViT](examples/longvit/README.md) released\n- October, 2023: Update RMSNorm and SwiGLU as the default module in RetNet\n- November, 2022: TorchScale 0.1.1 released [[Paper](https://arxiv.org/abs/2211.13184)] [[PyPI](https://pypi.org/project/torchscale/)]\n\n## Installation\n\nTo install:\n```\npip install torchscale\n```\n\nAlternatively, you can develop it locally:\n```\ngit clone https://github.com/microsoft/torchscale.git\ncd torchscale\npip install -e .\n```\n\nFor faster training install [Flash Attention](https://github.com/Dao-AILab/flash-attention) for Turing, Ampere, Ada, or Hopper GPUs:\n```\npip install flash-attn\n```\nor [xFormers](https://github.com/facebookresearch/xformers) for Volta, Turing, Ampere, Ada, or Hopper GPUs:\n```\n# cuda 11.8 version\npip3 install -U xformers --index-url https://download.pytorch.org/whl/cu118\n# cuda 12.1 version\npip3 install -U xformers --index-url https://download.pytorch.org/whl/cu121\n```\n\n## Getting Started\n\nIt takes only several lines of code to create a model with the above fundamental research features enabled. Here is how to quickly obtain a BERT-like encoder:\n\n```python\n\u003e\u003e\u003e from torchscale.architecture.config import EncoderConfig\n\u003e\u003e\u003e from torchscale.architecture.encoder import Encoder\n\n\u003e\u003e\u003e config = EncoderConfig(vocab_size=64000)\n\u003e\u003e\u003e model = Encoder(config)\n\n\u003e\u003e\u003e print(model)\n```\n\nWe also support the `Decoder` architecture and the `EncoderDecoder` architecture:\n\n```python\n# Creating a decoder model\n\u003e\u003e\u003e from torchscale.architecture.config import DecoderConfig\n\u003e\u003e\u003e from torchscale.architecture.decoder import Decoder\n\n\u003e\u003e\u003e config = DecoderConfig(vocab_size=64000)\n\u003e\u003e\u003e decoder = Decoder(config)\n\u003e\u003e\u003e print(decoder)\n\n# Creating a encoder-decoder model\n\u003e\u003e\u003e from torchscale.architecture.config import EncoderDecoderConfig\n\u003e\u003e\u003e from torchscale.architecture.encoder_decoder import EncoderDecoder\n\n\u003e\u003e\u003e config = EncoderDecoderConfig(vocab_size=64000)\n\u003e\u003e\u003e encdec = EncoderDecoder(config)\n\u003e\u003e\u003e print(encdec)\n```\n\nIt takes only several lines of code to create a RetNet model:\n\n```python\n# Creating a RetNet model\n\u003e\u003e\u003e import torch\n\u003e\u003e\u003e from torchscale.architecture.config import RetNetConfig\n\u003e\u003e\u003e from torchscale.architecture.retnet import RetNetDecoder\n\n\u003e\u003e\u003e config = RetNetConfig(vocab_size=64000)\n\u003e\u003e\u003e retnet = RetNetDecoder(config)\n\n\u003e\u003e\u003e print(retnet)\n```\n\nFor LongNet models ([Flash Attention](https://github.com/Dao-AILab/flash-attention) required):\n```python\n\u003e\u003e\u003e import torch\n\u003e\u003e\u003e from torchscale.architecture.config import EncoderConfig, DecoderConfig\n\u003e\u003e\u003e from torchscale.model.longnet import LongNetEncoder, LongNetDecoder\n\n# Creating a LongNet encoder with the dilated pattern of segment_length=[2048,4096] and dilated_ratio=[1,2]\n\u003e\u003e\u003e config = EncoderConfig(vocab_size=64000, segment_length='[2048,4096]', dilated_ratio='[1,2]', flash_attention=True)\n\u003e\u003e\u003e longnet = LongNetEncoder(config)\n\n# Creating a LongNet decoder with the dilated pattern of segment_length=[2048,4096] and dilated_ratio=[1,2]\n\u003e\u003e\u003e config = DecoderConfig(vocab_size=64000, segment_length='[2048,4096]', dilated_ratio='[1,2]', flash_attention=True)\n\u003e\u003e\u003e longnet = LongNetDecoder(config)\n```\n\n## Key Features\n\n- [DeepNorm to improve the training stability of Post-LayerNorm Transformers](https://arxiv.org/abs/2203.00555)\n  * enabled by setting *deepnorm=True* in the `Config` class. \n  * It adjusts both the residual connection and the initialization method according to the model architecture (i.e., encoder, decoder, or encoder-decoder).\n\n- [SubLN for the model generality and the training stability](https://arxiv.org/abs/2210.06423)\n  * enabled by *subln=True*. This is enabled by default. \n  * It introduces another LayerNorm to each sublayer and adjusts the initialization according to the model architecture.\n  * Note that SubLN and DeepNorm cannot be used in one single model.\n\n- [X-MoE: efficient and finetunable sparse MoE modeling](https://arxiv.org/abs/2204.09179)\n  * enabled by *use_xmoe=True*. \n  * It replaces every *'moe_freq'* `FeedForwardNetwork` layers with the X-MoE layers.\n\n- [Multiway architecture for multimodality](https://arxiv.org/abs/2208.10442)\n  * enabled by *multiway=True*.\n  * It provides a pool of Transformer's parameters used for different modalities.\n\n- [Extrapolatable position embedding (Xpos)](https://arxiv.org/abs/2212.10554)\n  * enabled by *xpos_rel_pos=True*.\n\n- [Relative position bias](https://arxiv.org/abs/1910.10683)\n  * enabled by adjusting *rel_pos_buckets* and *max_rel_pos*.\n\n- [SparseClip: improving the gradient clipping for sparse MoE models](https://arxiv.org/abs/2211.13184)\n  * we provide a [sample code](examples/fairseq/utils/sparse_clip.py) that can be easily adapted to the FairSeq (or other) repo.\n\n- [Retentive Network: A Successor to Transformer for Large Language Models](https://arxiv.org/abs/2307.08621)\n  * created by `config = RetNetConfig(vocab_size=64000)` and `retnet = RetNetDecoder(config)`.\n\n- [LongNet: Scaling Transformers to 1,000,000,000 Tokens](https://arxiv.org/abs/2307.02486)\n  \nMost of the features above can be used by simply passing the corresponding parameters to the config. For example:\n\n```python\n\u003e\u003e\u003e from torchscale.architecture.config import EncoderConfig\n\u003e\u003e\u003e from torchscale.architecture.encoder import Encoder\n\n\u003e\u003e\u003e config = EncoderConfig(vocab_size=64000, deepnorm=True, multiway=True)\n\u003e\u003e\u003e model = Encoder(config)\n\n\u003e\u003e\u003e print(model)\n```\n\n## Examples\n\nWe have examples of how to use TorchScale in the following scenarios/tasks:\n\n- Language\n\n  * [Decoder/GPT](examples/fairseq/README.md#example-gpt-pretraining)\n\n  * [Encoder-Decoder/Neural Machine Translation](examples/fairseq/README.md#example-machine-translation)\n\n  * [Encoder/BERT](examples/fairseq/README.md#example-bert-pretraining)\n\n- Vision\n\n  * [LongViT](examples/longvit/README.md)\n\n  * ViT/BEiT [In progress]\n\n- Speech\n\n- Multimodal\n\n  * [Multiway Transformers/BEiT-3](https://github.com/microsoft/unilm/tree/master/beit3)\n\nWe plan to provide more examples regarding different tasks (e.g. vision pretraining and speech recognition) and various deep learning toolkits (e.g. [DeepSpeed](https://github.com/microsoft/DeepSpeed) and [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)). Any comments or PRs are welcome!\n\n\n## Acknowledgments\n\nSome implementations in TorchScale are either adapted from or inspired by the [FairSeq](https://github.com/facebookresearch/fairseq) repository and the [UniLM](https://github.com/microsoft/unilm) repository.\n\n## Citations\n\nIf you find this repository useful, please consider citing our work:\n\n```\n@article{torchscale,\n  author    = {Shuming Ma and Hongyu Wang and Shaohan Huang and Wenhui Wang and Zewen Chi and Li Dong and Alon Benhaim and Barun Patra and Vishrav Chaudhary and Xia Song and Furu Wei},\n  title     = {{TorchScale}: {Transformers} at Scale},\n  journal   = {CoRR},\n  volume    = {abs/2211.13184},\n  year      = {2022}\n}\n```\n\n```\n@article{deepnet,\n  author    = {Hongyu Wang and Shuming Ma and Li Dong and Shaohan Huang and Dongdong Zhang and Furu Wei},\n  title     = {{DeepNet}: Scaling {Transformers} to 1,000 Layers},\n  journal   = {CoRR},\n  volume    = {abs/2203.00555},\n  year      = {2022},\n}\n```\n\n```\n@article{magneto,\n  author    = {Hongyu Wang and Shuming Ma and Shaohan Huang and Li Dong and Wenhui Wang and Zhiliang Peng and Yu Wu and Payal Bajaj and Saksham Singhal and Alon Benhaim and Barun Patra and Zhun Liu and Vishrav Chaudhary and Xia Song and Furu Wei},\n  title     = {Foundation {Transformers}},\n  journal   = {CoRR},\n  volume    = {abs/2210.06423},\n  year      = {2022}\n}\n```\n\n```\n@inproceedings{xmoe,\n  title={On the Representation Collapse of Sparse Mixture of Experts},\n  author={Zewen Chi and Li Dong and Shaohan Huang and Damai Dai and Shuming Ma and Barun Patra and Saksham Singhal and Payal Bajaj and Xia Song and Xian-Ling Mao and Heyan Huang and Furu Wei},\n  booktitle={Advances in Neural Information Processing Systems},\n  year={2022},\n  url={https://openreview.net/forum?id=mWaYC6CZf5}\n}\n```\n\n```\n@article{retnet,\n  author={Yutao Sun and Li Dong and Shaohan Huang and Shuming Ma and Yuqing Xia and Jilong Xue and Jianyong Wang and Furu Wei},\n  title     = {Retentive Network: A Successor to {Transformer} for Large Language Models},\n  journal   = {ArXiv},\n  volume    = {abs/2307.08621},\n  year      = {2023}\n}\n```\n\n```\n@article{longnet,\n  author={Jiayu Ding and Shuming Ma and Li Dong and Xingxing Zhang and Shaohan Huang and Wenhui Wang and Nanning Zheng and Furu Wei},\n  title     = {{LongNet}: Scaling Transformers to 1,000,000,000 Tokens},\n  journal   = {ArXiv},\n  volume    = {abs/2307.02486},\n  year      = {2023}\n}\n```\n\n```\n@article{longvit,\n  title     = {When an Image is Worth 1,024 x 1,024 Words: A Case Study in Computational Pathology},\n  author    = {Wenhui Wang and Shuming Ma and Hanwen Xu and Naoto Usuyama and Jiayu Ding and Hoifung Poon and Furu Wei},\n  journal   = {ArXiv},\n  volume    = {abs/2312.03558},\n  year      = {2023}\n}\n```\n\n## Contributing\n\nThis project welcomes contributions and suggestions.  Most contributions require you to agree to a\nContributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us\nthe rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.\n\nWhen you submit a pull request, a CLA bot will automatically determine whether you need to provide\na CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions\nprovided by the bot. You will only need to do this once across all repos using our CLA.\n\nThis project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).\nFor more information, see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or\ncontact [Furu Wei](mailto:fuwei@microsoft.com) and [Shuming Ma](mailto:shumma@microsoft.com) with any additional questions or comments.\n\n## Trademarks\n\nThis project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow [Microsoft's Trademark \u0026 Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).\nUse of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.\nAny use of third-party trademarks or logos is subject to those third-party's policies.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2Ftorchscale","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmicrosoft%2Ftorchscale","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2Ftorchscale/lists"}