https://github.com/huggingface/text-generation-inference

Large Language Model Text Generation Inference
https://github.com/huggingface/text-generation-inference

bloom deep-learning falcon gpt inference nlp pytorch starcoder transformer

Last synced: about 1 year ago
JSON representation

Large Language Model Text Generation Inference

Host: GitHub
URL: https://github.com/huggingface/text-generation-inference
Owner: huggingface
License: apache-2.0
Created: 2022-10-08T10:26:28.000Z (almost 4 years ago)
Default Branch: main
Last Pushed: 2025-05-05T17:59:04.000Z (about 1 year ago)
Last Synced: 2025-05-05T20:51:44.222Z (about 1 year ago)
Topics: bloom, deep-learning, falcon, gpt, inference, nlp, pytorch, starcoder, transformer
Language: Python
Homepage: http://hf.co/docs/text-generation-inference
Size: 13.1 MB
Stars: 10,081
Watchers: 104
Forks: 1,189
Open Issues: 269
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md

Awesome Lists containing this project

awesome-llama-resources - text-generation-inference
awesome-mlops - Text Generation Inference
Awesome_Multimodel_LLM - Text Generation Inference - A Rust, Python and gRPC server for text generation inference. Used in production at [HuggingFace](https://huggingface.co/) to power LLMs api-inference widgets. (Tools for deploying LLM)
awesome-llm-list - Hugging Face's Text Generation Inference
awesome-ml-python-packages - Text Generation Inference
awesome-llmops - text-generation-inference - generation-inference.svg?style=flat-square) | (Serving / Large Model Serving)
StarryDivineSky - huggingface/text-generation-inference
awesome-llm-services - Text Generation Inference
awesome-production-machine-learning - text-generation-inference - generation-inference.svg?style=social) - Large Language Model Text Generation Inference. (Deployment and Serving)
awesome-llm-projects - Text Generation Inference
awesome-open-weight-models - TGI
awesome-open-source-ai-tools - huggingface/text-generation-inference - Large Language Model Text Generation Inference (Writing & Editing)
awesome-mistral - Text Generation Inference
awesome-private-ai - text-generation-inference - Optimized serving stack from Hugging Face. (Inference Runtimes & Backends)
awesome-ai-papers - [text-generation-inference - embeddings-inference](https://github.com/huggingface/text-embeddings-inference)\]\[[quantization](https://huggingface.co/docs/transformers/main/en/quantization)\]\[[optimum-quanto](https://github.com/huggingface/optimum-quanto)\]\[[optimum](https://github.com/huggingface/optimum)\]\[[huggingface-inference-toolkit](https://github.com/huggingface/huggingface-inference-toolkit)\]\[[torchao](https://github.com/pytorch/ao)\] (NLP / 3. Pretraining)
awesome-ai - Text Generation Inference (TGI)
awesome-hugging-face - text-generation-inference
Awesome-LLM-Productization - text-generation-inference - Large Language Model Text Generation Inference (Models and Tools / LLM Deployment)
AiTreasureBox - huggingface/text-generation-inference - 11-03_10616_0](https://img.shields.io/github/stars/huggingface/text-generation-inference.svg)|Large Language Model Text Generation Inference| (Repos)
awesome-manus - Text Generation Inference - Toolkit for serving LLMs (Model Serving Frameworks / LangManus)
Awesome-LLMOps - Text Generation Inference - generation-inference.svg?style=flat&color=green) ![Contributors](https://img.shields.io/github/contributors/huggingface/text-generation-inference?color=green) ![LastCommit](https://img.shields.io/github/last-commit/huggingface/text-generation-inference?color=green) (Inference / Inference Engine)
Awesome-LLM-VLM-Foundation-Models - Text Generation Inference (TGI)
awesome-local-ai - text-generation-inference - Inference serving toolbox with optimized kernels for each LLM architecture | Safetensors / AWQ / GPTQ | Both | ❌ | Python/Rust | Text-Gen | (Inference Engine)
awesome-ai - 🔗
awesome - huggingface/text-generation-inference - Large Language Model Text Generation Inference (Python)
awesome-local-llms - text-generation-inference
awesome-llm-resources - Text-Generation-Inference(TGI)
awesome-llmops - TGI (Text Generation Inference) - performance inference server by Hugging Face. (Serving & Inference)
awesome-llm-prod - text-generation-inference - Serving, Deployment|A Rust, Python and gRPC server for text generation inference with optimized performance and production features| (2. **Production Tools**)
awesome-ai-engineering - text-generation-inference from HuggingFace - "A Rust, Python and gRPC server for text generation inference. Used in production at Hugging Face to power Hugging Chat, the Inference API and Inference Endpoint". (Tools / Inference)
awesome-github-projects - text-generation-inference - Large Language Model Text Generation Inference ⭐10,882 `Python` (📦 Legacy & Inactive Projects)
awesome-llm-tools - TGI (Text Generation Inference) - grade Rust/Python server | Flash Attention, tensor parallelism | NVIDIA, AMD, Gaudi | (8. Inference Engines / Server / Production)
awesome-nlp - Text Generation Inference (TGI) - HF production serving for LMs. (Language Models for NLP / Efficient and Small Language Models)
awesome-llm-tools - Text Generation Inference (TGI)
awesome-ai-tools - TGI (Text Generation Inference) - grade inference server. `#opensource` (🛠️ Developer Infrastructure / Model Deployment & Local Inference)
awesome-genai - HF Text Generation Interface
awesome-production-llm - text-generation-inference - generation-inference.svg?style=social) Large Language Model Text Generation Inference (LLM Serving / Inference)
awesome-open-source-llms - Text Generation Inference - Text Generation Inference (TGI) is Hugging Face's production-ready inference framework for deploying large language models at scale with optimized throughput. Features include tensor parallelism, continuous batching, quantization support, and built-in OpenAI-compatible API server. It's the go-to deployment solution for enterprises serving open-source LLMs, offering GPU utilization comparable to proprietary serving systems and native integration with Hugging Face models. ([Read more](/details/text-generation-inference.md)) `Open Source` `Production` `Hugging Face` (Deployment Platforms)
awesome-vision-ai-stack - TGI - Hugging Face serving stack. (Local Inference and Serving / Serve at scale)
awesome-rust-ai-libraries - Text Generation Inference - High-performance text generation server written in Rust by Hugging Face, designed for hosting transformer models with gRPC and HTTP APIs. The production inference backend powering Hugging Face Inference API. ([Read more](/details/text-generation-inference.md)) `Serving` `Transformers` `Production` (Model Inference)

README

# Text Generation Inference

A Rust, Python and gRPC server for text generation inference. Used in production at [Hugging Face](https://huggingface.co)
to power Hugging Chat, the Inference API and Inference Endpoints.

## Table of contents

- [Get Started](#get-started)
- [Docker](#docker)
- [API documentation](#api-documentation)
- [Using a private or gated model](#using-a-private-or-gated-model)
- [A note on Shared Memory (shm)](#a-note-on-shared-memory-shm)
- [Distributed Tracing](#distributed-tracing)
- [Architecture](#architecture)
- [Local install](#local-install)
- [Local install (Nix)](#local-install-nix)
- [Optimized architectures](#optimized-architectures)
- [Run locally](#run-locally)
- [Run](#run)
- [Quantization](#quantization)
- [Develop](#develop)
- [Testing](#testing)

Text Generation Inference (TGI) is a toolkit for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation for the most popular open-source LLMs, including Llama, Falcon, StarCoder, BLOOM, GPT-NeoX, and [more](https://huggingface.co/docs/text-generation-inference/supported_models). TGI implements many features, such as:

- Simple launcher to serve most popular LLMs
- Production ready (distributed tracing with Open Telemetry, Prometheus metrics)
- Tensor Parallelism for faster inference on multiple GPUs
- Token streaming using Server-Sent Events (SSE)
- Continuous batching of incoming requests for increased total throughput
- [Messages API](https://huggingface.co/docs/text-generation-inference/en/messages_api) compatible with Open AI Chat Completion API
- Optimized transformers code for inference using [Flash Attention](https://github.com/HazyResearch/flash-attention) and [Paged Attention](https://github.com/vllm-project/vllm) on the most popular architectures
- Quantization with :
- [bitsandbytes](https://github.com/TimDettmers/bitsandbytes)
- [GPT-Q](https://arxiv.org/abs/2210.17323)
- [EETQ](https://github.com/NetEase-FuXi/EETQ)
- [AWQ](https://github.com/casper-hansen/AutoAWQ)
- [Marlin](https://github.com/IST-DASLab/marlin)
- [fp8](https://developer.nvidia.com/blog/nvidia-arm-and-intel-publish-fp8-specification-for-standardization-as-an-interchange-format-for-ai/)
- [Safetensors](https://github.com/huggingface/safetensors) weight loading
- Watermarking with [A Watermark for Large Language Models](https://arxiv.org/abs/2301.10226)
- Logits warper (temperature scaling, top-p, top-k, repetition penalty, more details see [transformers.LogitsProcessor](https://huggingface.co/docs/transformers/internal/generation_utils#transformers.LogitsProcessor))
- Stop sequences
- Log probabilities
- [Speculation](https://huggingface.co/docs/text-generation-inference/conceptual/speculation) ~2x latency
- [Guidance/JSON](https://huggingface.co/docs/text-generation-inference/conceptual/guidance). Specify output format to speed up inference and make sure the output is valid according to some specs..
- Custom Prompt Generation: Easily generate text by providing custom prompts to guide the model's output
- Fine-tuning Support: Utilize fine-tuned models for specific tasks to achieve higher accuracy and performance

### Hardware support

- [Nvidia](https://github.com/huggingface/text-generation-inference/pkgs/container/text-generation-inference)
- [AMD](https://github.com/huggingface/text-generation-inference/pkgs/container/text-generation-inference) (-rocm)
- [Inferentia](https://github.com/huggingface/optimum-neuron/tree/main/text-generation-inference)
- [Intel GPU](https://github.com/huggingface/text-generation-inference/pull/1475)
- [Gaudi](https://github.com/huggingface/tgi-gaudi)
- [Google TPU](https://huggingface.co/docs/optimum-tpu/howto/serving)

## Get Started

### Docker

For a detailed starting guide, please see the [Quick Tour](https://huggingface.co/docs/text-generation-inference/quicktour). The easiest way of getting started is using the official Docker container:

```shell
model=HuggingFaceH4/zephyr-7b-beta
# share a volume with the Docker container to avoid downloading weights every run
volume=$PWD/data

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \
ghcr.io/huggingface/text-generation-inference:3.2.3 --model-id $model
```

And then you can make requests like

```bash
curl 127.0.0.1:8080/generate_stream \
-X POST \
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
-H 'Content-Type: application/json'
```

You can also use [TGI's Messages API](https://huggingface.co/docs/text-generation-inference/en/messages_api) to obtain Open AI Chat Completion API compatible responses.

```bash
curl localhost:8080/v1/chat/completions \
-X POST \
-d '{
"model": "tgi",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is deep learning?"
}
],
"stream": true,
"max_tokens": 20
}' \
-H 'Content-Type: application/json'
```

**Note:** To use NVIDIA GPUs, you need to install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html). We also recommend using NVIDIA drivers with CUDA version 12.2 or higher. For running the Docker container on a machine with no GPUs or CUDA support, it is enough to remove the `--gpus all` flag and add `--disable-custom-kernels`, please note CPU is not the intended platform for this project, so performance might be subpar.

**Note:** TGI supports AMD Instinct MI210 and MI250 GPUs. Details can be found in the [Supported Hardware documentation](https://huggingface.co/docs/text-generation-inference/installation_amd#using-tgi-with-amd-gpus). To use AMD GPUs, please use `docker run --device /dev/kfd --device /dev/dri --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:3.2.3-rocm --model-id $model` instead of the command above.

To see all options to serve your models (in the [code](https://github.com/huggingface/text-generation-inference/blob/main/launcher/src/main.rs) or in the cli):
```
text-generation-launcher --help
```

### API documentation

You can consult the OpenAPI documentation of the `text-generation-inference` REST API using the `/docs` route.
The Swagger UI is also available at: [https://huggingface.github.io/text-generation-inference](https://huggingface.github.io/text-generation-inference).

### Using a private or gated model

You have the option to utilize the `HF_TOKEN` environment variable for configuring the token employed by
`text-generation-inference`. This allows you to gain access to protected resources.

For example, if you want to serve the gated Llama V2 model variants:

1. Go to https://huggingface.co/settings/tokens
2. Copy your CLI READ token
3. Export `HF_TOKEN=`

or with Docker:

```shell
model=meta-llama/Meta-Llama-3.1-8B-Instruct
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
token=

docker run --gpus all --shm-size 1g -e HF_TOKEN=$token -p 8080:80 -v $volume:/data \
ghcr.io/huggingface/text-generation-inference:3.2.3 --model-id $model
```

### A note on Shared Memory (shm)

[`NCCL`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/index.html) is a communication framework used by
`PyTorch` to do distributed training/inference. `text-generation-inference` makes
use of `NCCL` to enable Tensor Parallelism to dramatically speed up inference for large language models.

In order to share data between the different devices of a `NCCL` group, `NCCL` might fall back to using the host memory if
peer-to-peer using NVLink or PCI is not possible.

To allow the container to use 1G of Shared Memory and support SHM sharing, we add `--shm-size 1g` on the above command.

If you are running `text-generation-inference` inside `Kubernetes`. You can also add Shared Memory to the container by
creating a volume with:

```yaml
- name: shm
emptyDir:
medium: Memory
sizeLimit: 1Gi
```

and mounting it to `/dev/shm`.

Finally, you can also disable SHM sharing by using the `NCCL_SHM_DISABLE=1` environment variable. However, note that
this will impact performance.

### Distributed Tracing

`text-generation-inference` is instrumented with distributed tracing using OpenTelemetry. You can use this feature
by setting the address to an OTLP collector with the `--otlp-endpoint` argument. The default service name can be
overridden with the `--otlp-service-name` argument

### Architecture

![TGI architecture](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/TGI.png)

Detailed blogpost by Adyen on TGI inner workings: [LLM inference at scale with TGI (Martin Iglesias Goyanes - Adyen, 2024)](https://www.adyen.com/knowledge-hub/llm-inference-at-scale-with-tgi)

### Local install

You can also opt to install `text-generation-inference` locally.

First clone the repository and change directory into it:

```shell
git clone https://github.com/huggingface/text-generation-inference
cd text-generation-inference
```

Then [install Rust](https://rustup.rs/) and create a Python virtual environment with at least
Python 3.9, e.g. using `conda` or `python venv`:

```shell
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

#using conda
conda create -n text-generation-inference python=3.11
conda activate text-generation-inference

#using python venv
python3 -m venv .venv
source .venv/bin/activate
```

You may also need to install Protoc.

On Linux:

```shell
PROTOC_ZIP=protoc-21.12-linux-x86_64.zip
curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v21.12/$PROTOC_ZIP
sudo unzip -o $PROTOC_ZIP -d /usr/local bin/protoc
sudo unzip -o $PROTOC_ZIP -d /usr/local 'include/*'
rm -f $PROTOC_ZIP
```

On MacOS, using Homebrew:

```shell
brew install protobuf
```

Then run:

```shell
BUILD_EXTENSIONS=True make install # Install repository and HF/transformer fork with CUDA kernels
text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2
```

**Note:** on some machines, you may also need the OpenSSL libraries and gcc. On Linux machines, run:

```shell
sudo apt-get install libssl-dev gcc -y
```

### Local install (Nix)

Another option is to install `text-generation-inference` locally using [Nix](https://nixos.org). Currently,
we only support Nix on x86_64 Linux with CUDA GPUs. When using Nix, all dependencies can
be pulled from a binary cache, removing the need to build them locally.

First follow the instructions to [install Cachix and enable the TGI cache](https://app.cachix.org/cache/text-generation-inference).
Setting up the cache is important, otherwise Nix will build many of the dependencies
locally, which can take hours.

After that you can run TGI with `nix run`:

```shell
cd text-generation-inference
nix run --extra-experimental-features nix-command --extra-experimental-features flakes . -- --model-id meta-llama/Llama-3.1-8B-Instruct
```

**Note:** when you are using Nix on a non-NixOS system, you have to [make some symlinks](https://danieldk.eu/Nix-CUDA-on-non-NixOS-systems#make-runopengl-driverlib-and-symlink-the-driver-library)
to make the CUDA driver libraries visible to Nix packages.

For TGI development, you can use the `impure` dev shell:

```shell
nix develop .#impure

# Only needed the first time the devshell is started or after updating the protobuf.
(
cd server
mkdir text_generation_server/pb || true
python -m grpc_tools.protoc -I../proto/v3 --python_out=text_generation_server/pb \
--grpc_python_out=text_generation_server/pb --mypy_out=text_generation_server/pb ../proto/v3/generate.proto
find text_generation_server/pb/ -type f -name "*.py" -print0 -exec sed -i -e 's/^$import.*pb2$/from . \1/g' {} \;
touch text_generation_server/pb/__init__.py
)
```

All development dependencies (cargo, Python, Torch), etc. are available in this
dev shell.

## Optimized architectures

TGI works out of the box to serve optimized models for all modern models. They can be found in [this list](https://huggingface.co/docs/text-generation-inference/supported_models).

Other architectures are supported on a best-effort basis using:

`AutoModelForCausalLM.from_pretrained(, device_map="auto")`

`AutoModelForSeq2SeqLM.from_pretrained(, device_map="auto")`

## Run locally

### Run

```shell
text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2
```

### Quantization

You can also run pre-quantized weights (AWQ, GPTQ, Marlin) or on-the-fly quantize weights with bitsandbytes, EETQ, fp8, to reduce the VRAM requirement:

```shell
text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2 --quantize
```

4bit quantization is available using the [NF4 and FP4 data types from bitsandbytes](https://arxiv.org/pdf/2305.14314.pdf). It can be enabled by providing `--quantize bitsandbytes-nf4` or `--quantize bitsandbytes-fp4` as a command line argument to `text-generation-launcher`.

Read more about quantization in the [Quantization documentation](https://huggingface.co/docs/text-generation-inference/en/conceptual/quantization).

## Develop

```shell
make server-dev
make router-dev
```

## Testing

```shell
# python
make python-server-tests
make python-client-tests
# or both server and client tests
make python-tests
# rust cargo tests
make rust-tests
# integration tests
make integration-tests
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/huggingface/text-generation-inference

Awesome Lists containing this project

README