https://github.com/NetEase-FuXi/EET

Easy and Efficient Transformer : Scalable Inference Solution For Large NLP model
https://github.com/NetEase-FuXi/EET

bert bert-inference-performance eet gpt2 gpt2-inference-performance

Last synced: 4 months ago
JSON representation

Easy and Efficient Transformer : Scalable Inference Solution For Large NLP model

Host: GitHub
URL: https://github.com/NetEase-FuXi/EET
Owner: NetEase-FuXi
License: apache-2.0
Created: 2021-03-23T02:20:56.000Z (over 4 years ago)
Default Branch: main
Last Pushed: 2024-11-30T12:48:47.000Z (8 months ago)
Last Synced: 2024-11-30T13:35:00.476Z (8 months ago)
Topics: bert, bert-inference-performance, eet, gpt2, gpt2-inference-performance
Language: Python
Homepage:
Size: 43.5 MB
Stars: 261
Watchers: 6
Forks: 46
Open Issues: 6
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

StarryDivineSky - NetEase-FuXi/EET - based大模型和长序列场景的高性能pytorch推理插件。高性能：设计高度优化的CUDA内核。灵活：提供包括op api、model api和pipelines应对不同需求。使用：几行代码即可完成。适配主流ai框架，包括fairseq和transformers。bert模型整体性能加速1.2x到7.x倍，gpt模型整体性能加速2.x到7.x倍。 (Transformer库与优化)

README

        ## Easy and Efficient Transformer 



中文README 


 





    

        

    

    

        

    

    

        

    



EET(Easy and Efficient Transformer) is a friendly Pytorch inference plugin focus on Transformer-based models to make mega-size model affordable.

## Features

- **New**🔥: Support Baichuan, LLaMA and other LLMs.

- **New**🔥: Support int8 quantization.

- Support Mega-size model with single GPU. 

- Expertise in inference for multi-modal and NLP tasks (CLIP/GPT-3/Bert/Seq2seq etc.).

- High performance. Make the transformer-based model faster and faster with the effect of CUDA kernel optimization and quantization/sparsity algorithm. 

- Out-of-the-box for Transformers and Fairseq. Save your pain of trivial configuration and make your model work within a few lines.

----

- [Easy and Efficient Transformer](#easy-and-efficient-transformer)

- [Features](#features)

- [Model Matrix](#model-matrix)

- [Quick Start](#quick-start)

  - [Environment](#environment)

  - [Installation](#installation)

    - [From Source](#from-source)

    - [From Docker](#from-docker)

  - [Run](#run)

    - [Operators APIs](#operators-apis)

    - [Model APIs](#model-apis)

    - [Application APIs](#application-apis)

- [Performance](#performance)

- [Cite Us](#cite-us)

- [Video](#video)

- [Contact us](#contact-us)

## Model Matrix

        model type

        Transformers

        Fairseq

        Quantization

        SpeedUp

        Since version

    

    

        GPT-3✅✅✅2~8x0.0.1 beta

    

    

       Bert✅✅X1~5x0.0.1 beta 

    

    

       ALBert✅✅X1~5x0.0.1 beta 

    

    

       Roberta✅XX1~5x0.0.1 beta

    

    

       T5✅XX4~8x1.0

    

     

       ViT✅XX1~5x1.0 

    

    

       CLIP(GPT+ViT)✅XX2~4x1.0 

    

    

       Distillbert✅XX1~2x1.0 

    

    

       Baichuan✅X✅1~2x2.0 

    

    

       LLaMA✅X✅1~2x2.0 

    

## Quick Start

### Environment

* cuda:>=11.4 

* python:>=3.7 

* gcc:>= 7.4.0 

* torch:>=1.12.0 

* numpy:>=1.19.1 

* fairseq:==0.10.0

* transformers:>=4.31.0

The above environment is the minimum configuration, and it is best to use a newer version.

### Installation

Recommend using docker images.

#### From Source

If you are installing from source, you will need install the necessary [environment](#environment).Then proceed as follows: 

```bash

$ git clone https://github.com/NetEase-FuXi/EET.git

$ pip install .

```

Recommend using nvcr.io/nvidia/pytorch:23.04-py3 and other series of images, you can also use the provided Dockerfile file.

#### From Docker

```bash

$ git clone https://github.com/NetEase-FuXi/EET.git

$ docker build -t eet_docker:0.1 .

$ nvidia-docker run -it --net=host -v /your/project/directory/:/root/workspace  eet_docker:0.1 bash

```

The EET and its required environment have been installed in docker.

### Run

We provide three types of APIs:

- **Operators APIs**, such as embedding, masked-multi-head-attention, ffn etc. Enable you to define your custom models.

- **Model APIs**, such as TransformerDecoder, BertEncoder etc. Enable you to integrate EET into your pytorch project.

- **Application APIs**, such as Transformers Pipeline. Enable you to run your model in a few lines.

#### Operators APIs

Operators APIs are the intermediate representation of C++/CUDA and Python. We provide almost all the operators required for Transformer models. You can combine different OPs to build other model structures.

- Operators API table

    |          operators          |       python API       |                  Remarks                  |

    | :-------------------------: | :--------------------: | :---------------------------------------: |

    |    multi_head_attention     |    EETSelfAttention    |              self attention               |

    | masked_multi_head_attention | EETSelfMaskedAttention |             causal attention              |

    | cross_multi_head_attention  |   EETCrossAttention    |              cross attention              |

    |             ffn             |     EETFeedforward     |           feed forward network            |

    |          embedding          |    EETBertEmbedding    | correspondence to Fairseq and Transfomers |

    |          layernorm          |      EETLayerNorm      |           same as nn.LayerNorm            |

- How to use

    The definition of these OPs is in the file [EET/csrc/py11/eet2py.cpp](./csrc/py11/eet2py.cpp) and

    some using examples were show in the files under [python/eet](./python/eet), which tell us how to use those OPs to make up classic models.

#### Model APIs

As an plugin, EET provides friendly model APIs([python/eet](./python/eet)) to integrated into Fairseq and Transformers. 

All you need to do is find the corresponding class according to the tables below (usually with a prefix of 'EET') and initialize an object with the from_torch and from_pretrained function. 

Note: We now only support **pre-padding** for GPT-3.

    

EET and fairseq class comparison table :

|             EET             |             fairseq              |               Remarks               | 

|:---------------------------:|:--------------------------------:|:-----------------------------------:| 

|    EETTransformerDecoder    |        TransformerDecoder        |                                     |

| EETTransformerDecoderLayer  |     TransformerDecoderLayer      |                                     |

|   EETTransformerAttention   |        MultiheadAttention        |                                     |

|  EETTransformerFeedforward  |     TransformerDecoderLayer      | fusion of multiple small operators  |

|   EETTransformerEmbedding   | Embedding + PositionalEmbedding  |                                     |

|   EETTransformerLayerNorm   |           nn.LayerNorm           |                                     |

EET and Transformers class comparison table : 

|         EET          |          transformers          |             Remarks             | 

|:--------------------:|:------------------------------:|:-------------------------------:| 

|     EETBertModel     |           BertModel            |                                 |

|   EETBertEmbedding   |         BertEmbeddings         |                                 |

|     EETGPT2Model     |           GPT2Model            |                                 |

|    EETGPT2Decoder    |           GPT2Model            | Transformers has no GPT2Decoder |

| EETGPT2DecoderLayer  |             Block              |                                 |

|   EETGPT2Attention   |           Attention            |                                 |

|  EETGPT2Feedforward  |              MLP               |                                 |

|   EETGPT2Embedding   |          nn.Embedding          |                                 |

|     EETLayerNorm     |          nn.LayerNorm          |                                 |

  In addition to the basic model types above, we have extended some task-specific APIs to support different tasks. The table below is part of our task-specific model APIs :

|                EET                |          transformers          | Remarks | 

|:---------------------------------:|:------------------------------:|:----:| 

|       EETBertForPreTraining       |       BertForPreTraining       |      |

|        EETBertLMHeadModel         |        BertLMHeadModel         |      |

|        EETBertForMaskedLM         |        BertForMaskedLM         |      |

| EETBertForNextSentencePrediction  | BertForNextSentencePrediction  |      |

| EETBertForSequenceClassification  | BertForSequenceClassification  |      |

|     EETBertForMultipleChoice      |     BertForMultipleChoice      |      |

|   EETBertForTokenClassification   |   BertForTokenClassification   |      |

|    EETBertForQuestionAnswering    |    BertForQuestionAnswering    |      |

- How to use

This is a code snip to show how to use model APIs :

 



You can build your application with the model APIs directly with the task-specific APIs.

There is an example of a fill-mask:

```python

from eet import EETRobertaForMaskedLM

from transformers import RobertaTokenizer

input = ["My  is Sarah and I live in London"]

tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

eet_roberta_model = EETRobertaForMaskedLM.from_pretrained('roberta-base',max_batch = max_batch_size,data_type = data_type)

# first step: tokenize

model_inputs = tokenizer(input,return_tensors = 'pt')

masked_index = torch.nonzero(model_inputs['input_ids'][0] == tokenizer.mask_token_id, as_tuple=False).squeeze(-1)

# second step: predict

prediction_scores = eet_roberta_model(model_inputs['input_ids'].cuda(),attention_mask = model_inputs['attention_mask'])

# third step: argmax

predicted_index = torch.argmax(prediction_scores.logits[0, masked_index]).item()

predicted_token = tokenizer.convert_ids_to_tokens(predicted_index)

```

For more examples, please refer to [example/python/models](example/python/models/).

#### Application APIs

EET provides a ready-made pipelines approach to simplify your application building for different tasks without using the model APIs above.

Here is an example :

```python

import torch

from eet import pipeline

max_batch_size = 1

model_path = 'roberta-base'

data_type = torch.float16

input = ["My  is Sarah and I live in London"]

nlp = pipeline("fill-mask",model = model_path,data_type = data_type,max_batch_size = max_batch_size)

out = nlp(input)

```

Now we support these tasks：

| Task                 | Since version | 

|:-------------------------------|:---:|

| text-classification            | 1.0 |

| token-classification           | 1.0 | 

| question-answering             | 1.0 | 

| fill-mask                      | 1.0 |

| text-generation                | 1.0 |

| image-classification           | 1.0 |

| zero_shot_image_classification | 1.0 |

For more examples, please refer to [example/python/pipelines](./example/python/pipelines).

## Performance

Detailed performance data of GPT-3 and Bert model inference can be viewed at [link](https://github.com/NetEase-FuXi/EET/blob/main/doc/benchmark.md).

* GPT-3 on A100

 



* Bert on 2080ti

 



* Llama13B on 3090

 



## Cite Us

If you use EET in your research, please cite the following paper.

```

@misc{https://doi.org/10.48550/arxiv.2104.12470,

  doi = {10.48550/ARXIV.2104.12470},

  url = {https://arxiv.org/abs/2104.12470},

  author = {Li, Gongzheng and Xi, Yadong and Ding, Jingzhen and Wang, Duan and Liu, Bai and Fan, Changjie and Mao, Xiaoxi and Zhao, Zeng},

  keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},

  title = {Easy and Efficient Transformer : Scalable Inference Solution For large NLP model},

```

## Video

We have a share on ZhiYuan LIVE, link: https://event.baai.ac.cn/activities/325.

## Contact us

You can post your problem with github issues. 

You can also contact us by email :

[email protected], [email protected], [email protected]

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/NetEase-FuXi/EET

Awesome Lists containing this project

README