Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/NetEase-FuXi/EET
Easy and Efficient Transformer : Scalable Inference Solution For Large NLP model
https://github.com/NetEase-FuXi/EET
bert bert-inference-performance eet gpt2 gpt2-inference-performance
Last synced: about 2 months ago
JSON representation
Easy and Efficient Transformer : Scalable Inference Solution For Large NLP model
- Host: GitHub
- URL: https://github.com/NetEase-FuXi/EET
- Owner: NetEase-FuXi
- License: apache-2.0
- Created: 2021-03-23T02:20:56.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2024-08-06T09:21:43.000Z (5 months ago)
- Last Synced: 2024-08-06T11:18:23.166Z (5 months ago)
- Topics: bert, bert-inference-performance, eet, gpt2, gpt2-inference-performance
- Language: Python
- Homepage:
- Size: 43.5 MB
- Stars: 257
- Watchers: 6
- Forks: 46
- Open Issues: 6
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- StarryDivineSky - NetEase-FuXi/EET - based大模型和长序列场景的高性能pytorch推理插件。高性能:设计高度优化的CUDA内核。灵活:提供包括op api、model api和pipelines应对不同需求。 使用: 几行代码即可完成。适配主流ai框架,包括fairseq和transformers。bert模型整体性能加速1.2x到7.x倍,gpt模型整体性能加速2.x到7.x倍。 (Transformer库与优化)
README
## Easy and Efficient Transformer
EET(Easy and Efficient Transformer) is a friendly Pytorch inference plugin focus on Transformer-based models to make mega-size model affordable.
## Features
- **New**🔥: Support Baichuan, LLaMA and other LLMs.
- **New**🔥: Support int8 quantization.
- Support Mega-size model with single GPU.
- Expertise in inference for multi-modal and NLP tasks (CLIP/GPT-3/Bert/Seq2seq etc.).
- High performance. Make the transformer-based model faster and faster with the effect of CUDA kernel optimization and quantization/sparsity algorithm.
- Out-of-the-box for Transformers and Fairseq. Save your pain of trivial configuration and make your model work within a few lines.
----- [Easy and Efficient Transformer](#easy-and-efficient-transformer)
- [Features](#features)
- [Model Matrix](#model-matrix)
- [Quick Start](#quick-start)
- [Environment](#environment)
- [Installation](#installation)
- [From Source](#from-source)
- [From Docker](#from-docker)
- [Run](#run)
- [Operators APIs](#operators-apis)
- [Model APIs](#model-apis)
- [Application APIs](#application-apis)
- [Performance](#performance)
- [Cite Us](#cite-us)
- [Video](#video)
- [Contact us](#contact-us)## Model Matrix
model type
Transformers
Fairseq
Quantization
SpeedUp
Since version
GPT-3✅✅✅2~8x0.0.1 beta
Bert✅✅X1~5x0.0.1 beta
ALBert✅✅X1~5x0.0.1 beta
Roberta✅XX1~5x0.0.1 beta
T5✅XX4~8x1.0
ViT✅XX1~5x1.0
CLIP(GPT+ViT)✅XX2~4x1.0
Distillbert✅XX1~2x1.0
Baichuan✅X✅1~2x2.0
LLaMA✅X✅1~2x2.0
## Quick Start
### Environment
* cuda:>=11.4
* python:>=3.7
* gcc:>= 7.4.0
* torch:>=1.12.0
* numpy:>=1.19.1
* fairseq:==0.10.0
* transformers:>=4.31.0The above environment is the minimum configuration, and it is best to use a newer version.
### Installation
Recommend using docker images.
#### From Source
If you are installing from source, you will need install the necessary [environment](#environment).Then proceed as follows:```bash
$ git clone https://github.com/NetEase-FuXi/EET.git
$ pip install .
```
Recommend using nvcr.io/nvidia/pytorch:23.04-py3 and other series of images, you can also use the provided Dockerfile file.#### From Docker
```bash
$ git clone https://github.com/NetEase-FuXi/EET.git
$ docker build -t eet_docker:0.1 .
$ nvidia-docker run -it --net=host -v /your/project/directory/:/root/workspace eet_docker:0.1 bash
```
The EET and its required environment have been installed in docker.### Run
We provide three types of APIs:
- **Operators APIs**, such as embedding, masked-multi-head-attention, ffn etc. Enable you to define your custom models.
- **Model APIs**, such as TransformerDecoder, BertEncoder etc. Enable you to integrate EET into your pytorch project.
- **Application APIs**, such as Transformers Pipeline. Enable you to run your model in a few lines.#### Operators APIs
Operators APIs are the intermediate representation of C++/CUDA and Python. We provide almost all the operators required for Transformer models. You can combine different OPs to build other model structures.
- Operators API table
| operators | python API | Remarks |
| :-------------------------: | :--------------------: | :---------------------------------------: |
| multi_head_attention | EETSelfAttention | self attention |
| masked_multi_head_attention | EETSelfMaskedAttention | causal attention |
| cross_multi_head_attention | EETCrossAttention | cross attention |
| ffn | EETFeedforward | feed forward network |
| embedding | EETBertEmbedding | correspondence to Fairseq and Transfomers |
| layernorm | EETLayerNorm | same as nn.LayerNorm |- How to use
The definition of these OPs is in the file [EET/csrc/py11/eet2py.cpp](./csrc/py11/eet2py.cpp) and
some using examples were show in the files under [python/eet](./python/eet), which tell us how to use those OPs to make up classic models.#### Model APIs
As an plugin, EET provides friendly model APIs([python/eet](./python/eet)) to integrated into Fairseq and Transformers.
All you need to do is find the corresponding class according to the tables below (usually with a prefix of 'EET') and initialize an object with the from_torch and from_pretrained function.
Note: We now only support **pre-padding** for GPT-3.
EET and fairseq class comparison table :| EET | fairseq | Remarks |
|:---------------------------:|:--------------------------------:|:-----------------------------------:|
| EETTransformerDecoder | TransformerDecoder | |
| EETTransformerDecoderLayer | TransformerDecoderLayer | |
| EETTransformerAttention | MultiheadAttention | |
| EETTransformerFeedforward | TransformerDecoderLayer | fusion of multiple small operators |
| EETTransformerEmbedding | Embedding + PositionalEmbedding | |
| EETTransformerLayerNorm | nn.LayerNorm | |EET and Transformers class comparison table :
| EET | transformers | Remarks |
|:--------------------:|:------------------------------:|:-------------------------------:|
| EETBertModel | BertModel | |
| EETBertEmbedding | BertEmbeddings | |
| EETGPT2Model | GPT2Model | |
| EETGPT2Decoder | GPT2Model | Transformers has no GPT2Decoder |
| EETGPT2DecoderLayer | Block | |
| EETGPT2Attention | Attention | |
| EETGPT2Feedforward | MLP | |
| EETGPT2Embedding | nn.Embedding | |
| EETLayerNorm | nn.LayerNorm | |In addition to the basic model types above, we have extended some task-specific APIs to support different tasks. The table below is part of our task-specific model APIs :
| EET | transformers | Remarks |
|:---------------------------------:|:------------------------------:|:----:|
| EETBertForPreTraining | BertForPreTraining | |
| EETBertLMHeadModel | BertLMHeadModel | |
| EETBertForMaskedLM | BertForMaskedLM | |
| EETBertForNextSentencePrediction | BertForNextSentencePrediction | |
| EETBertForSequenceClassification | BertForSequenceClassification | |
| EETBertForMultipleChoice | BertForMultipleChoice | |
| EETBertForTokenClassification | BertForTokenClassification | |
| EETBertForQuestionAnswering | BertForQuestionAnswering | |- How to use
This is a code snip to show how to use model APIs :
You can build your application with the model APIs directly with the task-specific APIs.
There is an example of a fill-mask:```python
from eet import EETRobertaForMaskedLM
from transformers import RobertaTokenizer
input = ["My is Sarah and I live in London"]
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
eet_roberta_model = EETRobertaForMaskedLM.from_pretrained('roberta-base',max_batch = max_batch_size,data_type = data_type)
# first step: tokenize
model_inputs = tokenizer(input,return_tensors = 'pt')
masked_index = torch.nonzero(model_inputs['input_ids'][0] == tokenizer.mask_token_id, as_tuple=False).squeeze(-1)
# second step: predict
prediction_scores = eet_roberta_model(model_inputs['input_ids'].cuda(),attention_mask = model_inputs['attention_mask'])
# third step: argmax
predicted_index = torch.argmax(prediction_scores.logits[0, masked_index]).item()
predicted_token = tokenizer.convert_ids_to_tokens(predicted_index)
```For more examples, please refer to [example/python/models](example/python/models/).
#### Application APIs
EET provides a ready-made pipelines approach to simplify your application building for different tasks without using the model APIs above.
Here is an example :
```python
import torch
from eet import pipeline
max_batch_size = 1
model_path = 'roberta-base'
data_type = torch.float16
input = ["My is Sarah and I live in London"]
nlp = pipeline("fill-mask",model = model_path,data_type = data_type,max_batch_size = max_batch_size)
out = nlp(input)
```Now we support these tasks:
| Task | Since version |
|:-------------------------------|:---:|
| text-classification | 1.0 |
| token-classification | 1.0 |
| question-answering | 1.0 |
| fill-mask | 1.0 |
| text-generation | 1.0 |
| image-classification | 1.0 |
| zero_shot_image_classification | 1.0 |For more examples, please refer to [example/python/pipelines](./example/python/pipelines).
## Performance
Detailed performance data of GPT-3 and Bert model inference can be viewed at [link](https://github.com/NetEase-FuXi/EET/blob/main/doc/benchmark.md).
* GPT-3 on A100* Bert on 2080ti
* Llama13B on 3090
## Cite Us
If you use EET in your research, please cite the following paper.
```
@misc{https://doi.org/10.48550/arxiv.2104.12470,
doi = {10.48550/ARXIV.2104.12470},
url = {https://arxiv.org/abs/2104.12470},
author = {Li, Gongzheng and Xi, Yadong and Ding, Jingzhen and Wang, Duan and Liu, Bai and Fan, Changjie and Mao, Xiaoxi and Zhao, Zeng},
keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {Easy and Efficient Transformer : Scalable Inference Solution For large NLP model},
```## Video
We have a share on ZhiYuan LIVE, link: https://event.baai.ac.cn/activities/325.## Contact us
You can post your problem with github issues.You can also contact us by email :