bert-in-production

A collection of resources on using BERT (https://arxiv.org/abs/1810.04805 ) and related Language Models in production environments.
https://github.com/DomHudson/bert-in-production

Last synced: 6 days ago
JSON representation

Descriptive Resources
Implementations
- pytorch/fairseq
- google-research/google-research
- hanxiao/bert-as-service
- kaushaltrivedi/fast-bert
- microsoft/onnxruntime - sourced by Microsoft; it contains several model-specific optimisations including one for transformer models. A model's architecture is compiled into the Open Neural Network Exchange (ONNX) standard and optionally optimised for a specific platform's hardware.
- google-research/bert
- huggingface/transformers - of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch. The transformers library is focussed on using publicly-available pretrained models and has wide support for many of the most popular varieties.
- huggingface/tokenizers - of-the-Art Tokenizers optimized for Research and Production
- spacy-transformers
- codertimo/BERT-pytorch
- CyberZHG/keras-bert
- pytorch/fairseq
- hanxiao/bert-as-service
Deep Analysis
General Resources
Speed
- Compression
  - Extreme Language Model Compression with Optimal Subwords and Shared Projections
  - ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
  - Compression BERT for faster prediction
  - Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT
  - PoWER-BERT: Accelerating BERT inference for Classification Tasks - BERT, for improving the inference time for the BERT model without significant loss in the accuracy. The method works by eliminating word-vectors (intermediate vector outputs) from the encoder pipeline. We design a strategy for measuring the significance of the word-vectors based on the self-attention mechanism of the encoders which helps us identify the word-vectors to be eliminated. Experimental evaluation on the standard GLUE benchmark shows that PoWER-BERT achieves up to 4.5x reduction in inference time over BERT with < 1% loss in accuracy. We show that compared to the prior inference time reduction methods, PoWER-BERT offers better trade-off between accuracy and inference time. Lastly, we demonstrate that our scheme can also be used in conjunction with ALBERT (a highly compressed version of BERT) and can attain up to 6.8x factor reduction in inference time with < 1% loss in accuracy.
  - Q8BERT: Quantized 8Bit BERT - trained Transformer based language models such as BERT and GPT, have shown great improvement in many Natural Language Processing (NLP) tasks. However, these models contain a large amount of parameters. The emergence of even larger and more accurate models such as GPT2 and Megatron, suggest a trend of large pre-trained Transformer models. However, using these large models in production environments is a complex task requiring a large amount of compute, memory and power resources. In this work we show how to perform quantization-aware training during the fine-tuning phase of BERT in order to compress BERT by 4× with minimal accuracy loss. Furthermore, the produced quantized model can accelerate inference speed if it is optimized for 8bit Integer supporting hardware.
  - Small and Practical BERT Models for Sequence Labeling - of-the-art multilingual baseline.
  - TinyBERT: Distilling BERT for Natural Language Understanding - training, such as BERT, has significantly improved the performances of many natural language processing tasks. However, pre-trained language models are usually computationally expensive and memory intensive, so it is difficult to effectively execute them on some resource-restricted devices. To accelerate inference and reduce model size while maintaining accuracy, we firstly propose a novel transformer distillation method that is a specially designed knowledge distillation (KD) method for transformer-based models. By leveraging this new KD method, the plenty of knowledge encoded in a large teacher BERT can be well transferred to a small student TinyBERT. Moreover, we introduce a new two-stage learning framework for TinyBERT, which performs transformer distillation at both the pre-training and task-specific learning stages. This framework ensures that TinyBERT can capture both the general-domain and task-specific knowledge of the teacher BERT.TinyBERT is empirically effective and achieves more than 96% the performance of teacher BERTBASE on GLUE benchmark while being 7.5x smaller and 9.4x faster on inference. TinyBERT is also significantly better than state-of-the-art baselines on BERT distillation, with only about 28% parameters and about 31% inference time of them.
- Knowledge Distillation
  - Distilling Task-Specific Knowledge from BERT into Simple Neural Networks
  - DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter - 8cf3380435b5)
Other Resources
- Compression

Programming Languages

Python 7 Rust 1 C++ 1

Categories

Descriptive Resources 119 Implementations 13 Speed 10 Other Resources 6 Deep Analysis 4 General Resources 3

Sub Categories

Compression 14 Knowledge Distillation 2

Keywords

bert 6 pytorch 6 language-model 5 nlp 5 natural-language-processing 4 deep-learning 3 machine-learning 3 natural-language-understanding 3 tensorflow 3 openai 2 google 2 onnx 2 transformer 2 python 2 pretrained-models 1 nlp-library 1 model-hub 1 language-models 1 pytorch-transformers 1 jax 1 flax 1 scikit-learn 1 neural-networks 1 hardware-acceleration 1 ai-framework 1 sentence2vec 1 sentence-encoding 1 neural-search 1 multi-modality 1 image2vec 1 cross-modality 1 cross-modal-retrieval 1 clip-model 1 clip-as-service 1 bert-as-service 1 artificial-intelligence 1 keras 1 xlnet 1 transfer-learning 1 spacy-pipeline 1 spacy-extension 1 spacy 1 pytorch-model 1 huggingface 1 gpt-2 1 transformers 1 gpt 1 speech-recognition 1 seq2seq 1