{"id":13752806,"url":"https://github.com/microsoft/fastformers","last_synced_at":"2025-05-14T23:02:12.267Z","repository":{"id":38613767,"uuid":"290981320","full_name":"microsoft/fastformers","owner":"microsoft","description":"FastFormers - highly efficient transformer models for NLU","archived":false,"fork":false,"pushed_at":"2025-03-21T21:56:58.000Z","size":27375,"stargazers_count":706,"open_issues_count":2,"forks_count":53,"subscribers_count":17,"default_branch":"main","last_synced_at":"2025-05-07T23:46:49.891Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/microsoft.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-08-28T07:32:22.000Z","updated_at":"2025-04-18T07:51:39.000Z","dependencies_parsed_at":"2024-08-03T09:14:45.477Z","dependency_job_id":null,"html_url":"https://github.com/microsoft/fastformers","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Ffastformers","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Ffastformers/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Ffastformers/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Ffastformers/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/microsoft","download_url":"https://codeload.github.com/microsoft/fastformers/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254243353,"owners_count":22038044,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T09:01:11.245Z","updated_at":"2025-05-14T23:02:12.160Z","avatar_url":"https://github.com/microsoft.png","language":"Python","funding_links":[],"categories":["Transformer库与优化"],"sub_categories":[],"readme":"# FastFormers\n\n**FastFormers** provides a set of recipes and methods to achieve highly efficient inference of Transformer models for Natural Language Understanding (NLU) including the demo models showing **233.87x speed-up** (Yes, 233x on CPU with the multi-head self-attentive Transformer architecture. This is not an LSTM or an RNN). The details of the methods and analyses are described in the paper *FastFormers: Highly Efficient Transformer Models for Natural Language Understanding* [paper](https://arxiv.org/abs/2010.13382).\n\n\n### Notes\n\n- **(June 3, 2021) The public onnxruntime (v1.8.0) now supports all FastFormers models.** Special thanks to @yufenglee and onnxruntime team.\n- (Nov. 4, 2020) We are actively working with Hugging Face and onnxruntime team so that you can utilize the features out of the box of huggingface's transformers and onnxruntime. Please stay tuned.\n- With this repository, you can replicate the results presented in the *FastFormers* paper.\n- The demo models of *FastFormers* are implemented with [SuperGLUE](https://super.gluebenchmark.com/) benchmark. Data processing pipeline is based on Alex Wang's implementation [reference code](https://github.com/W4ngatang/transformers/tree/superglue) for [SustaiNLP](https://sites.google.com/view/sustainlp2020/home) which is a fork from HuggingFace's [transformers](https://github.com/huggingface/transformers) repository. \n- This repository is built on top of several open source projects including [transformers](https://github.com/huggingface/transformers) from HuggingFace, [onnxruntime](https://github.com/Microsoft/onnxruntime), [transformers](https://github.com/W4ngatang/transformers/tree/superglue) from Alex Wang, [FBGEMM](https://github.com/pytorch/FBGEMM), [TinyBERT](https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/TinyBERT) and etc.\n\n\n## Requirements\n\n- *FastFormers* currently only supports Linux operating systems.\n- CPU requirements:\n  * CPUs equipped with at least one, or both of `AVX2` and `AVX512` instruction sets are required. To get the full speed improvements and accuracy, `AVX512` instruction set is required. We have tested our runtime on Intel CPUs.\n- GPU requirements:\n  * To utilize 16-bit floating point speed-up, GPUs with Volta or later architectures are required.\n- onnxruntime v1.8.0+ is required to run *FastFormers* models.\n- This repository is a branch of [transformers](https://github.com/huggingface/transformers), so you need to uninstall pre-existing transformers in your python environment.\n\n\n## Installation\n\nThis repo is tested on Python 3.6 and 3.7, PyTorch 1.5.0+.\n\nYou need to uninstall pre-existing transformers package as this repository uses customized versions of it.\n\nYou need to install PyTorch 1.5.0+. Then, execute following bash commands. You need to install onnxruntime 1.8.0+.\n\n\n```bash\npip install onnxruntime==1.8.0 --user --upgrade --no-deps --force-reinstall\npip uninstall transformers -y\ngit clone https://github.com/microsoft/fastformers\ncd fastformers\npip install .\n```\n\n\n## Run the demo systems\n\nAll the models used to benchmark Table 3 in the paper are publicly shared. You can use below commands to reproduce the results. Table 3 measurement was done on one of the Azure F16s_v2 instances.\n\n![Table3](examples/fastformers/table3.png)\n\nThe [installation step](#installation) needs to be done before proceeding.\n\n0. Download [SuperGLUE](https://super.gluebenchmark.com/) dataset and decompress.\n\n1. Download demo model files and decompress.\n```bash\nwget https://github.com/microsoft/fastformers/releases/download/v0.1-model/teacher-bert-base.tar.gz\nwget https://github.com/microsoft/fastformers/releases/download/v0.2-model/student-4L-312.tar.gz\nwget https://github.com/microsoft/fastformers/releases/download/v0.2-model/student-pruned-8h-600.tar.gz\nwget https://github.com/microsoft/fastformers/releases/download/v0.2-model/student-pruned-9h-900.tar.gz\n```\n\n2. Run the teacher model (BERT-base) baseline\n```bash\npython3 examples/fastformers/run_superglue.py \\\n        --model_type bert --model_name_or_path ${teacher_model} \\\n        --task_name BoolQ --output_dir ${out_dir} --do_eval  \\\n        --data_dir ${data_dir} --per_instance_eval_batch_size 1 \\\n        --use_fixed_seq_length --do_lower_case --max_seq_length 512 \\\n        --no_cuda\n```\n\n3. Run the teacher model (BERT-base) with dynamic sequence length\n```bash\npython3 examples/fastformers/run_superglue.py \\\n        --model_type bert --model_name_or_path ${teacher_model} \\\n        --task_name BoolQ --output_dir ${out_dir} --do_eval  \\\n        --data_dir ${data_dir} --per_instance_eval_batch_size 1 \\\n        --do_lower_case --max_seq_length 512 --no_cuda\n```\n\n4. Run the distilled student model (PyTorch)\n```bash\npython3 examples/fastformers/run_superglue.py \\\n        --model_type bert --model_name_or_path ${student_model} \\\n        --task_name BoolQ --output_dir ${out_dir} --do_eval  \\\n        --data_dir ${data_dir} --per_instance_eval_batch_size 1 \\\n        --do_lower_case --max_seq_length 512 --no_cuda\n```\n\n5. Run the distilled student with 8-bit quantization (onnxruntime)\n```bash\npython3 examples/fastformers/run_superglue.py \\\n        --model_type bert --model_name_or_path ${student_model} \\\n        --task_name BoolQ --output_dir ${out_dir} --do_eval \\\n        --data_dir ${data_dir} --per_instance_eval_batch_size 1 \\\n        --do_lower_case --max_seq_length 512 --use_onnxrt --no_cuda\n```\n\n6. Run the distilled student with 8-bit quantization + multi-intance inference (onnxruntime)\n```bash\nOMP_NUM_THREADS=1 python3 examples/fastformers/run_superglue.py \\\n                          --model_type bert \\\n                          --model_name_or_path ${student_model} \\\n                          --task_name BoolQ --output_dir ${out_dir} --do_eval \\\n                          --data_dir ${data_dir} --per_instance_eval_batch_size 1 \\\n                          --do_lower_case --max_seq_length 512 --use_onnxrt \\\n                          --threads_per_instance 1 --no_cuda\n```\n\n7. Run the distilled + pruned student with 8-bit quantization + multi-intance inference (onnxruntime)\n```bash\nOMP_NUM_THREADS=1 python3 examples/fastformers/run_superglue.py \\\n                          --model_type bert \\\n                          --model_name_or_path ${pruned_student_model} \\\n                          --task_name BoolQ --output_dir ${out_dir} --do_eval \\\n                          --data_dir ${data_dir} --per_instance_eval_batch_size 1 \\\n                          --do_lower_case --max_seq_length 512 --use_onnxrt \\\n                          --threads_per_instance 1 --no_cuda\n```\n\n\n## How to create FastFormers\n\n### Training models\n\nThis is used for fine-tuning of pretrained or general distilled model (*task-agnostic distillation*) to the downstream tasks.\nCurrently, BERT and RoBERTa models are supported.\n\n*Tip 1.* This repository is based on transformers, so you can use huggingface's pre-trained models. (e.g. set `distilroberta-base` for --model_name_or_path to use [distilroberta-base](https://huggingface.co/distilroberta-base))\n\n*Tip 2.* Before fine-tuning models, you can change the activation functions to **ReLU** to get better inference speed. To do this, you can download the config file of your model and manually change it to `relu` (`hidden_act` in case of BERT and ReBERTa models). Then, you can specify the config file by adding parameter (--config_name).\n\n*Tip 3.* Depending on the task and the models used, you can add --do_lower_case if it give a better accuracy.\n\n```bash\npython3 examples/fastformers/run_superglue.py \\\n        --data_dir ${data_dir} --task_name ${task} \\\n        --output_dir ${out_dir} --model_type ${model_type} \\\n        --model_name_or_path ${model} \\\n        --use_gpuid ${gpuid} --seed ${seed} \\\n        --do_train --max_seq_length ${seq_len_train} \\\n        --do_eval --eval_and_save_steps ${eval_freq} --save_only_best \\\n        --learning_rate 0.00001 \\\n        --warmup_ratio 0.06 --weight_decay 0.01 \\\n        --per_gpu_train_batch_size 4 \\\n        --gradient_accumulation_steps 1 \\\n        --logging_steps 100 --num_train_epochs 10 \\\n        --overwrite_output_dir --per_instance_eval_batch_size 8\n```\n\n### Distilling models\n\nThis is used for distilling fine-tuned teacher models into smaller student models (*task-specific distillation*) on the downstream tasks. As described in the paper, it is critical to initialize student models with general distilled models such as *distilbert-*, *distilroberta-base* and *TinyBERT*.\n\nThis command is also used to distill non-pruned models into pruned models.\n\nThis command always uses task specific logit loss between teacher and student models for the student training. You can add addtional losses for hidden states (including token mbedding) and attentions between teacher and student. To use hidden states and attentions distillation, the number of teacher layers should be multiples of the number of student layers.\n\n```bash\npython3 examples/fastformers/run_superglue.py \\\n        --data_dir ${data_dir} --task_name ${task} \\\n        --output_dir ${out_dir} --teacher_model_type ${teacher_model_type} \\\n        --teacher_model_name_or_path ${teacher_model} \\\n        --model_type ${student_model_type} --model_name_or_path ${student_model} \\\n        --use_gpuid ${gpuid} --seed ${seed} \\\n        --do_train --max_seq_length ${seq_len_train} \\\n        --do_eval --eval_and_save_steps ${eval_freq} --save_only_best \\\n        --learning_rate 0.00001 \\\n        --warmup_ratio 0.06 --weight_decay 0.01 \\\n        --per_gpu_train_batch_size 4 \\\n        --gradient_accumulation_steps 1 \\\n        --logging_steps 100 --num_train_epochs 10 \\\n        --overwrite_output_dir --per_instance_eval_batch_size 8 \\\n        --state_loss_ratio 0.1\n```\n\n### Pruning models\n\nThis command performs structured pruning on the models described in the paper. It reduces the number of heads and the intermediate hidden states of FFN as set in the options. When the pruning is done on GPU, only 1 GPU is utilized (no multi-GPU).\n\nTo get better accuracy, you can do another round of knowledge distillation after the pruning.\n\n```bash\npython3 examples/fastformers/run_superglue.py \\\n        --data_dir ${data_dir} --task_name ${task} \\\n        --output_dir ${out_dir} --model_type ${model_type} \\\n        --model_name_or_path ${model} --do_eval \\\n        --do_prune --max_seq_length ${seq_len_train} \\\n        --per_instance_eval_batch_size 1 \\\n        --target_num_heads 8 --target_ffn_dim 600\n```\n\n### Optimizing models on CPU (8-bit integer quantization + onnxruntime)\n\nThis command convert your PyTorch transformers models into optimized onnx format with 8-bit quantization. The converted ONNX model is saved in the directory which the original PyTorch model is located.\n\n```bash\npython3 examples/fastformers/run_superglue.py \\\n        --task_name ${task} \\\n        --model_type ${model_type} \\\n        --model_name_or_path ${model} \\\n        --convert_onnx\n```\n\n### Optimizing models on GPU (16-bit floating point conversion)\n\nThis command convert your PyTorch transformers models into 16-bit floating point model (PyTorch). This creates a new directory named `fp16` in the directory the original model is located. Then, the converted fp16 model and all necessary files are saved to the directory.\n\n```bash\npython3 examples/fastformers/run_superglue.py \\\n        --task_name ${task} \\\n        --model_type ${model_type} \\\n        --model_name_or_path ${model} \\\n        --convert_fp16\n```\n\n### Evaluating models\n\nThis command evalutes various models with PyTorch or onnxruntime engine on the give tasks. For more detailed usage, please refer to the [demo section](#run-the-demo-systems).\n\n```bash\nOMP_NUM_THREADS=1 python3 examples/fastformers/run_superglue.py \\\n                          --model_type bert \\\n                          --model_name_or_path ${pruned_student_model} \\\n                          --task_name BoolQ --output_dir ${out_dir} --do_eval \\\n                          --data_dir ${data_dir} --per_instance_eval_batch_size 1 \\\n                          --do_lower_case --max_seq_length 512 --use_onnxrt \\\n                          --threads_per_instance 1 --no_cuda\n```\n\n## Code of Conduct\n\nThis project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).\n\n## License\n\nThis project is licensed under the [MIT License](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2Ffastformers","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmicrosoft%2Ffastformers","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2Ffastformers/lists"}