{"id":13512588,"url":"https://github.com/THUDM/SwissArmyTransformer","last_synced_at":"2025-03-31T00:30:40.751Z","repository":{"id":37044369,"uuid":"414203752","full_name":"THUDM/SwissArmyTransformer","owner":"THUDM","description":"SwissArmyTransformer is a flexible and powerful library to develop your own Transformer variants.","archived":false,"fork":false,"pushed_at":"2024-12-26T13:23:58.000Z","size":22098,"stargazers_count":1071,"open_issues_count":43,"forks_count":100,"subscribers_count":31,"default_branch":"main","last_synced_at":"2025-03-28T11:00:53.667Z","etag":null,"topics":["pretrained-models","pytorch","transformer"],"latest_commit_sha":null,"homepage":"https://THUDM.github.io/SwissArmyTransformer","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/THUDM.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGE_LOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-10-06T12:30:39.000Z","updated_at":"2025-03-26T07:27:52.000Z","dependencies_parsed_at":"2023-09-22T22:14:07.710Z","dependency_job_id":"35ce1eb9-f88f-4cd2-b90b-05c9b18e5e2b","html_url":"https://github.com/THUDM/SwissArmyTransformer","commit_stats":{"total_commits":533,"total_committers":28,"mean_commits":"19.035714285714285","dds":0.6472795497185742,"last_synced_commit":"63dc23aeb40b5b4ee580de00a8961a324a103abf"},"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2FSwissArmyTransformer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2FSwissArmyTransformer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2FSwissArmyTransformer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2FSwissArmyTransformer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/THUDM","download_url":"https://codeload.github.com/THUDM/SwissArmyTransformer/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246399816,"owners_count":20770907,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["pretrained-models","pytorch","transformer"],"created_at":"2024-08-01T04:00:23.463Z","updated_at":"2025-03-31T00:30:39.053Z","avatar_url":"https://github.com/THUDM.png","language":"Python","funding_links":[],"categories":["Lo nuevo que está dando vuelta..","Python"],"sub_categories":["CogView2"],"readme":"# Introduction\n`sat`(`SwissArmyTransformer`) is a flexible and powerful library to develop your own Transformer variants.\n\n`sat` is named after \"swiss army knife\", meaning that all the models (e.g. BERT, GPT, T5, GLM, CogView, ViT...) **share the same backbone code** and cater for versatile usages with some extra light-weight mixins. \n\n`sat` is powered by `deepspeed-ZeRO` and model parallelism, aiming to provide the best practice for pretraining and finetuning large models (100M\\~20B parameters). \n\n## Install\n```\n    pip install SwissArmyTransformer\n```\n# Features\n\n* **Add model-agnostic components**, e.g. prefix-tuning, in just *ONE* line! \n\n    - [Prefix-tuning](https://arxiv.org/pdf/2101.00190) (or [P-tuning](https://arxiv.org/abs/2103.10385)) improves finetuning via adding trainable parameters in each attention layer. To apply it to a [GLM](https://arxiv.org/pdf/2103.10360.pdf) classification (or any other) model is easy with our library.\n\n    ```python\n        class ClassificationModel(GLMModel): # can also be BertModel, RobertaModel, etc. \n            def __init__(self, args, transformer=None, **kwargs):\n                super().__init__(args, transformer=transformer, **kwargs)\n                self.add_mixin('classification_head', MLPHeadMixin(args.hidden_size, 2048, 1))\n                # Arm an arbitrary model with Prefix-tuning with this line!\n                self.add_mixin('prefix-tuning', PrefixTuningMixin(args.num_layers, args.hidden_size // args.num_attention_heads, args.num_attention_heads, args.prefix_len))\n    ```\n\n    - GPT and other auto-regressive models act differently during training and inference. During inference, text is generated token-by-token and we need to cache previous states for efficiency. With our lib, you only need to consider the behavior during training (teacher-forcing) and transform it to a cached auto-regressive model via adding a mixin:\n\n    ```python\n        model, args = AutoModel.from_pretrained('glm-10b-chinese', args)\n        model.add_mixin('auto-regressive', CachedAutoregressiveMixin())\n        # Generate a sequence with beam search\n        from sat.generation.autoregressive_sampling import filling_sequence\n        from sat.generation.sampling_strategies import BeamSearchStrategy\n        output, *mems = filling_sequence(model, input_seq,\n                        batch_size=args.batch_size,\n                        strategy=BeamSearchStrategy(args.batch_size))\n    ```     \n\n\n* **Build your Transformer-based model with minimal codes**. We mentioned [GLM](https://arxiv.org/pdf/2103.10360.pdf), which only differs from standard transformer (called BaseModel) on position embedding (and training losses). We only need to focus on the related part when coding.\n\n    \u003cdetails\u003e\u003csummary\u003eExtend the whole definition: \u003c/summary\u003e\u003cp\u003e\n\n    ```python\n    class BlockPositionEmbeddingMixin(BaseMixin):\n        # Here define parameters for the mixin\n        def __init__(self, max_sequence_length, hidden_size, init_method_std=0.02):\n            super(BlockPositionEmbeddingMixin, self).__init__()\n            self.max_sequence_length = max_sequence_length\n            self.hidden_size = hidden_size\n            self.block_position_embeddings = torch.nn.Embedding(max_sequence_length, hidden_size)\n            torch.nn.init.normal_(self.block_position_embeddings.weight, mean=0.0, std=init_method_std)\n        \n        # Here define the method for the mixin\n        def position_embedding_forward(self, position_ids, **kwargs):\n            position_ids, block_position_ids = position_ids[:, 0], position_ids[:, 1]\n            position_embeddings = self.transformer.position_embeddings(position_ids)\n            block_position_embeddings = self.block_position_embeddings(block_position_ids)\n            return position_embeddings + block_position_embeddings\n\n    class GLMModel(BaseModel):\n        def __init__(self, args, transformer=None):\n            super().__init__(args, transformer=transformer)\n            self.add_mixin('block_position_embedding', \n                BlockPositionEmbeddingMixin(args.max_sequence_length, args.hidden_size)\n            ) # Add the mixin for GLM\n    ```\n\n*  **Comprehensive supports for training**. `sat` aims to provide the best practice for pretraining and finetuning, where you only need to finish `forward_step` and `create_dataset_function` but with hyperparameters to alter useful training configurations.\n    - Extend the training to multiple GPUs or nodes by specifying `--num_nodes`, `--num_gpus` and a simple `hostfile`. \n    - DeepSpeed and Model parallelism.\n    - Better integration of ZeRO-2 and activation checkpointing.\n    - Automatic extending and shuffling training data and `memmap`. \n    - Successfully support the training of [CogView2](http://github.com/THUDM/CogView2) and [CogVideo](https://github.com/THUDM/cogvideo).\n    - The only open-source codebase supporting finetuning [T5-10B](https://arxiv.org/abs/1910.10683) on GPUs currently.\n\n\u003c/p\u003e\u003c/details\u003e\n\n\n# Quick Tour\n\nThe most typical python file to use `Bert` in sat (for inference) is as follows:\n```python\n# @File: inference_bert.py\nfrom sat import get_args, get_tokenizer, AutoModel\n# Parse args, initialize the environment. This is necessary.\nargs = get_args() \n# Automatically download and load model. Will also dump model-related hyperparameters to args.\nmodel, args = AutoModel.from_pretrained('bert-base-uncased', args) \n# Get the BertTokenizer according to args.tokenizer_type (automatically set).\ntokenizer = get_tokenizer(args) \n# Here to use bert as you want!\n# ...\n```\nThen we can run the code via\n```bash\n    SAT_HOME=/path/to/download python inference_bert.py --mode inference\n```\nAll officially supported model names are in [urls.py](sat/resources/urls.py).\n\nTo finetune or pretrain a transformer is also extremely easy!\n```python\n# @File: finetune_bert.py\nfrom sat import get_args, get_tokenizer, AutoModel\nfrom sat.model.mixins import MLPHeadMixin\n\ndef create_dataset_function(path, args):\n    # Here to load the dataset\n    # ...\n    assert isinstance(dataset, torch.utils.data.Dataset)\n    return dataset\n\ndef forward_step(data_iterator, model, args, timers):\n    inputs = next(data_iterator) # from the dataset of create_dataset_function.\n    loss, *others = model(inputs)\n    return loss\n    \n# Parse args, initialize the environment. This is necessary.\nargs = get_args() \nmodel, args = AutoModel.from_pretrained('bert-base-uncased', args) \ntokenizer = get_tokenizer(args) \n# Here to use bert as you want!\nmodel.del_mixin('bert-final')\nmodel.add_mixin('classification_head', MLPHeadMixin(args.hidden_size, 2048, 1))\n# ONE LINE to train! \n# args already includes hyperparams such as lr, train-iters, zero-stage ...\ntraining_main(args, \n    model_cls=model, \n    forward_step_function=forward_step, # user define\n    create_dataset_function=create_dataset_function # user define\n)\n```\nThen we can run the code via\n```shell\ndeepspeed --include localhost:0,1 finetune_bert.py \\\n    --experiment-name ftbert \\\n    --mode finetune --train-iters 1000 --save /path/to/save \\\n    --train-data /path/to/train --valid-data /path/to/valid \\\n    --lr 0.00002 --batch-size 8 --zero-stage 1 --fp16\n```\nHere we use data-parallel on GPUs 0,1. We can also launch the training on many inter-connected machines via `--hostfile /path/to/hostfile`. See the tutorial for more details.\n\nTo write your own model, you only need to consider the difference between the standard Transformer. For example, if you have a idea to improve the attention operation:\n```python\nfrom sat.model import BaseMixin\nclass MyAttention(BaseMixin):\n    def __init__(self, hidden_size):\n        super(MyAttention, self).__init__()\n        # MyAttention may needs some new params, e.g. a learnable alpha.\n        self.learnable_alpha = torch.nn.Parameter(torch.ones(hidden_size))\n    \n    # This is a hook function, the name `attention_fn` is special.\n    def attention_fn(q, k, v, mask, dropout=None, **kwargs):\n        # Code for my attention.\n        # ...\n        return attention_results\n```\nHere `attention_fn` is a hook function, replacing the default action by the new function. All available hooks are in [transformer_defaults.py](/sat/transformer_defaults.py). \nNow we can use `add_mixin` to apply our change to all the transformers, such as BERT, Vit and CogView. See the tutorial for more details. \n\n## Tutorials \n* [How to use pretrained models collected in sat?](tutorials/model_usage)\n* [Why and how to train models in sat?](tutorials/training)\n\n# Citation\nCurrently we don't have a paper, so you don't need to formally cite us!~ \n\nIf this project helps your research or engineering, use `\\footnote{https://github.com/THUDM/SwissArmyTransformer}` to mention us and recommend `SwissArmyTransformer` to others.\n\nThe tutorial for contributing sat is on the way!\n\nThe project is based on (a user of) DeepSpeed, Megatron-LM and Huggingface transformers. Thanks for their awesome work.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FTHUDM%2FSwissArmyTransformer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FTHUDM%2FSwissArmyTransformer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FTHUDM%2FSwissArmyTransformer/lists"}