Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/huggingface/text-generation-inference
Large Language Model Text Generation Inference
https://github.com/huggingface/text-generation-inference
bloom deep-learning falcon gpt inference nlp pytorch starcoder transformer
Last synced: 2 months ago
JSON representation
Large Language Model Text Generation Inference
- Host: GitHub
- URL: https://github.com/huggingface/text-generation-inference
- Owner: huggingface
- License: other
- Created: 2022-10-08T10:26:28.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-02-13T21:52:09.000Z (4 months ago)
- Last Synced: 2024-02-13T23:00:39.980Z (4 months ago)
- Topics: bloom, deep-learning, falcon, gpt, inference, nlp, pytorch, starcoder, transformer
- Language: Python
- Homepage: http://hf.co/docs/text-generation-inference
- Size: 4.54 MB
- Stars: 6,938
- Watchers: 91
- Forks: 756
- Open Issues: 212
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Lists
- awesome-production-machine-learning - text-generation-inference - generation-inference.svg?style=social) - Large Language Model Text Generation Inference under TFOIL license. (Industry Strength NLP)
- Awesome-LLM - Text Generation Inference - A Rust, Python and gRPC server for text generation inference. Used in production at [HuggingFace](https://huggingface.co/) to power LLMs api-inference widgets, HFOIL Licence. (Deploying Tools / RLHF LLM)
- Awesome-LLM?tab=readme-ov-file - Text Generation Inference - A Rust, Python and gRPC server for text generation inference. Used in production at [HuggingFace](https://huggingface.co/) to power LLMs api-inference widgets, HFOIL Licence. (Deploying Tools / RLHF LLM)
- awesome-ml-python-packages - Text Generation Inference
- awesome-stars - huggingface/text-generation-inference - Large Language Model Text Generation Inference (Python)
- awesome-llm-list - Hugging Face's Text Generation Inference
- awesome-llama2-resources - text-generation-inference
- awesome-llmops - text-generation-inference - generation-inference.svg?style=flat-square) | (Serving / Large Model Serving)
- awesome-huggingface - TGI - source large language models. It is used as backend for [Chat UI](https://github.com/huggingface/chat-ui), Hugging Face's open alternative to OpenAI's ChatGPT website. (Software / TGI (text-generation-inference))
- Awesome_Multimodel_LLM - Text Generation Inference - A Rust, Python and gRPC server for text generation inference. Used in production at [HuggingFace](https://huggingface.co/) to power LLMs api-inference widgets. (Tools for deploying LLM)
- my-awesome-stars - huggingface/text-generation-inference - Large Language Model Text Generation Inference (Python)
- awesome-LLM-toolkit - huggingface/text-generation-inference
- my-awesome - huggingface/text-generation-inference - Large Language Model Text Generation Inference (Python)
- awesome-genai - HF Text Generation Interface
- awesome-stars - huggingface/text-generation-inference - Large Language Model Text Generation Inference (Python)
- awesome-stars - huggingface/text-generation-inference - Large Language Model Text Generation Inference (Python)
- awesome-stars - huggingface/text-generation-inference - Large Language Model Text Generation Inference (Python)
- AiTreasureBox - huggingface/text-generation-inference - 06-02_8137_1](https://img.shields.io/github/stars/huggingface/text-generation-inference.svg)|Large Language Model Text Generation Inference| (Repos)
- awesome-local-ai - text-generation-inference - Inference serving toolbox with optimized kernels for each LLM architecture | Safetensors / AWQ / GPTQ | Both | ❌ | Python/Rust | Text-Gen | (Inference Engine)
- awesome-local-llms - text-generation-inference
- Awesome-LLM-Productization - text-generation-inference - A Rust, Python and gRPC server for text generation inference. Used in production at HuggingFace to power Hugging Chat, the Inference API and Inference Endpoint (Models and Tools / LLM Deployment)
- awesome-stars - huggingface/text-generation-inference - Large Language Model Text Generation Inference (deep-learning)
- awesome-stars - huggingface/text-generation-inference - Large Language Model Text Generation Inference (pytorch)
README
# Text Generation Inference
A Rust, Python and gRPC server for text generation inference. Used in production at [HuggingFace](https://huggingface.co)
to power Hugging Chat, the Inference API and Inference Endpoint.## Table of contents
- [Get Started](#get-started)
- [API Documentation](#api-documentation)
- [Using a private or gated model](#using-a-private-or-gated-model)
- [A note on Shared Memory](#a-note-on-shared-memory-shm)
- [Distributed Tracing](#distributed-tracing)
- [Local Install](#local-install)
- [CUDA Kernels](#cuda-kernels)
- [Optimized architectures](#optimized-architectures)
- [Run Mistral](#run-a-model)
- [Run](#run)
- [Quantization](#quantization)
- [Develop](#develop)
- [Testing](#testing)Text Generation Inference (TGI) is a toolkit for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation for the most popular open-source LLMs, including Llama, Falcon, StarCoder, BLOOM, GPT-NeoX, and [more](https://huggingface.co/docs/text-generation-inference/supported_models). TGI implements many features, such as:
- Simple launcher to serve most popular LLMs
- Production ready (distributed tracing with Open Telemetry, Prometheus metrics)
- Tensor Parallelism for faster inference on multiple GPUs
- Token streaming using Server-Sent Events (SSE)
- Continuous batching of incoming requests for increased total throughput
- Optimized transformers code for inference using [Flash Attention](https://github.com/HazyResearch/flash-attention) and [Paged Attention](https://github.com/vllm-project/vllm) on the most popular architectures
- Quantization with :
- [bitsandbytes](https://github.com/TimDettmers/bitsandbytes)
- [GPT-Q](https://arxiv.org/abs/2210.17323)
- [EETQ](https://github.com/NetEase-FuXi/EETQ)
- [AWQ](https://github.com/casper-hansen/AutoAWQ)
- [Safetensors](https://github.com/huggingface/safetensors) weight loading
- Watermarking with [A Watermark for Large Language Models](https://arxiv.org/abs/2301.10226)
- Logits warper (temperature scaling, top-p, top-k, repetition penalty, more details see [transformers.LogitsProcessor](https://huggingface.co/docs/transformers/internal/generation_utils#transformers.LogitsProcessor))
- Stop sequences
- Log probabilities
- [Speculation](https://huggingface.co/docs/text-generation-inference/conceptual/speculation) ~2x latency
- [Guidance/JSON](https://huggingface.co/docs/text-generation-inference/conceptual/guidance). Specify output format to speed up inference and make sure the output is valid according to some specs..
- Custom Prompt Generation: Easily generate text by providing custom prompts to guide the model's output
- Fine-tuning Support: Utilize fine-tuned models for specific tasks to achieve higher accuracy and performance### Hardware support
- [Nvidia](https://github.com/huggingface/text-generation-inference/pkgs/container/text-generation-inference)
- [AMD](https://github.com/huggingface/text-generation-inference/pkgs/container/text-generation-inference) (-rocm)
- [Inferentia](https://github.com/huggingface/optimum-neuron/tree/main/text-generation-inference)
- [Intel GPU](https://github.com/huggingface/text-generation-inference/pull/1475)
- [Gaudi](https://github.com/huggingface/tgi-gaudi)## Get Started
### Docker
For a detailed starting guide, please see the [Quick Tour](https://huggingface.co/docs/text-generation-inference/quicktour). The easiest way of getting started is using the official Docker container:
```shell
model=HuggingFaceH4/zephyr-7b-beta
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every rundocker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.4 --model-id $model
```And then you can make requests like
```bash
curl 127.0.0.1:8080/generate \
-X POST \
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
-H 'Content-Type: application/json'
```**Note:** To use NVIDIA GPUs, you need to install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html). We also recommend using NVIDIA drivers with CUDA version 12.2 or higher. For running the Docker container on a machine with no GPUs or CUDA support, it is enough to remove the `--gpus all` flag and add `--disable-custom-kernels`, please note CPU is not the intended platform for this project, so performance might be subpar.
**Note:** TGI supports AMD Instinct MI210 and MI250 GPUs. Details can be found in the [Supported Hardware documentation](https://huggingface.co/docs/text-generation-inference/supported_models#supported-hardware). To use AMD GPUs, please use `docker run --device /dev/kfd --device /dev/dri --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.4-rocm --model-id $model` instead of the command above.
To see all options to serve your models (in the [code](https://github.com/huggingface/text-generation-inference/blob/main/launcher/src/main.rs) or in the cli):
```
text-generation-launcher --help
```### API documentation
You can consult the OpenAPI documentation of the `text-generation-inference` REST API using the `/docs` route.
The Swagger UI is also available at: [https://huggingface.github.io/text-generation-inference](https://huggingface.github.io/text-generation-inference).### Using a private or gated model
You have the option to utilize the `HUGGING_FACE_HUB_TOKEN` environment variable for configuring the token employed by
`text-generation-inference`. This allows you to gain access to protected resources.For example, if you want to serve the gated Llama V2 model variants:
1. Go to https://huggingface.co/settings/tokens
2. Copy your cli READ token
3. Export `HUGGING_FACE_HUB_TOKEN=`or with Docker:
```shell
model=meta-llama/Llama-2-7b-chat-hf
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
token=docker run --gpus all --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.4 --model-id $model
```### A note on Shared Memory (shm)
[`NCCL`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/index.html) is a communication framework used by
`PyTorch` to do distributed training/inference. `text-generation-inference` make
use of `NCCL` to enable Tensor Parallelism to dramatically speed up inference for large language models.In order to share data between the different devices of a `NCCL` group, `NCCL` might fall back to using the host memory if
peer-to-peer using NVLink or PCI is not possible.To allow the container to use 1G of Shared Memory and support SHM sharing, we add `--shm-size 1g` on the above command.
If you are running `text-generation-inference` inside `Kubernetes`. You can also add Shared Memory to the container by
creating a volume with:```yaml
- name: shm
emptyDir:
medium: Memory
sizeLimit: 1Gi
```and mounting it to `/dev/shm`.
Finally, you can also disable SHM sharing by using the `NCCL_SHM_DISABLE=1` environment variable. However, note that
this will impact performance.### Distributed Tracing
`text-generation-inference` is instrumented with distributed tracing using OpenTelemetry. You can use this feature
by setting the address to an OTLP collector with the `--otlp-endpoint` argument.### Architecture
![TGI architecture](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/TGI.png)
### Local install
You can also opt to install `text-generation-inference` locally.
First [install Rust](https://rustup.rs/) and create a Python virtual environment with at least
Python 3.9, e.g. using `conda`:```shell
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | shconda create -n text-generation-inference python=3.11
conda activate text-generation-inference
```You may also need to install Protoc.
On Linux:
```shell
PROTOC_ZIP=protoc-21.12-linux-x86_64.zip
curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v21.12/$PROTOC_ZIP
sudo unzip -o $PROTOC_ZIP -d /usr/local bin/protoc
sudo unzip -o $PROTOC_ZIP -d /usr/local 'include/*'
rm -f $PROTOC_ZIP
```On MacOS, using Homebrew:
```shell
brew install protobuf
```Then run:
```shell
BUILD_EXTENSIONS=True make install # Install repository and HF/transformer fork with CUDA kernels
text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2
```**Note:** on some machines, you may also need the OpenSSL libraries and gcc. On Linux machines, run:
```shell
sudo apt-get install libssl-dev gcc -y
```## Optimized architectures
TGI works out of the box to serve optimized models for all modern models. They can be found in [this list](https://huggingface.co/docs/text-generation-inference/supported_models).
Other architectures are supported on a best-effort basis using:
`AutoModelForCausalLM.from_pretrained(, device_map="auto")`
or
`AutoModelForSeq2SeqLM.from_pretrained(, device_map="auto")`
## Run locally
### Run
```shell
text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2
```### Quantization
You can also quantize the weights with bitsandbytes to reduce the VRAM requirement:
```shell
text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2 --quantize
```4bit quantization is available using the [NF4 and FP4 data types from bitsandbytes](https://arxiv.org/pdf/2305.14314.pdf). It can be enabled by providing `--quantize bitsandbytes-nf4` or `--quantize bitsandbytes-fp4` as a command line argument to `text-generation-launcher`.
## Develop
```shell
make server-dev
make router-dev
```## Testing
```shell
# python
make python-server-tests
make python-client-tests
# or both server and client tests
make python-tests
# rust cargo tests
make rust-tests
# integration tests
make integration-tests
```