{"id":13407478,"url":"https://github.com/huggingface/text-generation-inference","last_synced_at":"2025-05-13T11:03:47.287Z","repository":{"id":62896430,"uuid":"547806116","full_name":"huggingface/text-generation-inference","owner":"huggingface","description":"Large Language Model Text Generation Inference","archived":false,"fork":false,"pushed_at":"2025-05-05T17:59:04.000Z","size":13775,"stargazers_count":10081,"open_issues_count":269,"forks_count":1189,"subscribers_count":104,"default_branch":"main","last_synced_at":"2025-05-05T20:51:44.222Z","etag":null,"topics":["bloom","deep-learning","falcon","gpt","inference","nlp","pytorch","starcoder","transformer"],"latest_commit_sha":null,"homepage":"http://hf.co/docs/text-generation-inference","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/huggingface.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2022-10-08T10:26:28.000Z","updated_at":"2025-05-05T11:30:33.000Z","dependencies_parsed_at":"2023-10-02T23:24:34.524Z","dependency_job_id":"9b6c52d6-444c-4dcd-9e6f-ba0c053e7422","html_url":"https://github.com/huggingface/text-generation-inference","commit_stats":{"total_commits":1098,"total_committers":126,"mean_commits":8.714285714285714,"dds":0.7313296903460837,"last_synced_commit":"0f346a3296486deb79c63f778b9fc4d9107e4a23"},"previous_names":[],"tags_count":59,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggingface%2Ftext-generation-inference","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggingface%2Ftext-generation-inference/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggingface%2Ftext-generation-inference/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggingface%2Ftext-generation-inference/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/huggingface","download_url":"https://codeload.github.com/huggingface/text-generation-inference/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253183146,"owners_count":21867368,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bloom","deep-learning","falcon","gpt","inference","nlp","pytorch","starcoder","transformer"],"created_at":"2024-07-30T20:00:40.978Z","updated_at":"2025-05-13T11:03:47.107Z","avatar_url":"https://github.com/huggingface.png","language":"Python","funding_links":[],"categories":["Toolkits","Serving","Python","🎯 Tool Categories","Tools for deploying LLM","Software","INFERENCING FRAMEWORKS","LLM","A01_文本生成_文本对话","\u003cimg src=\"./assets/cpu.svg\" width=\"16\" height=\"16\" style=\"vertical-align: middle;\"\u003e Backends","Deployment and Serving","Projects","Writing \u0026 Editing","Inference \u0026 Deployment","Inference Runtimes \u0026 Backends","deep-learning","NLP","⚡ LLM Inference \u0026 Hosting","Popular Libraries","🔓 Open Source Inference Engines","pytorch","Models and Tools","Repos","Model Serving Frameworks","Inference","🚀 Model Serving \u0026 Deployment","Inference Engine","🛠️ AI 工具与框架","Open-Source Local LLM Projects","Tools","Deployment","Serving \u0026 Inference","2. **Production Tools**","8. Inference Engines","Language Models for NLP","🖥 Local Deployment Tools","🛠️ Developer Infrastructure","Tools for Deployment","LLM Serving / Inference","Local Inference and Serving","📦 Legacy \u0026 Inactive Projects","Model Inference"],"sub_categories":["Others","🤖 LLMOps \u0026 GenAI (2024-2025)","TGI (text-generation-inference)","Large Model Serving","大语言对话模型及数据","🤯 LLMs Inference and Serving","High-Performance Inference","3. Pretraining","Embedding Models","LLM Deployment","LangManus","Inference Engine","LLM 推理与部署","Inference","Server / Production","Efficient and Small Language Models","Server Deployment \u0026 High-Performance Inference","Model Deployment \u0026 Local Inference","Serve at scale"],"readme":"\u003cdiv align=\"center\"\u003e\n\n\u003ca href=\"https://www.youtube.com/watch?v=jlMAX2Oaht0\"\u003e\n  \u003cimg width=560 alt=\"Making TGI deployment optimal\" src=\"https://huggingface.co/datasets/Narsil/tgi_assets/resolve/main/thumbnail.png\"\u003e\n\u003c/a\u003e\n\n# Text Generation Inference\n\n\u003ca href=\"https://github.com/huggingface/text-generation-inference\"\u003e\n  \u003cimg alt=\"GitHub Repo stars\" src=\"https://img.shields.io/github/stars/huggingface/text-generation-inference?style=social\"\u003e\n\u003c/a\u003e\n\u003ca href=\"https://huggingface.github.io/text-generation-inference\"\u003e\n  \u003cimg alt=\"Swagger API documentation\" src=\"https://img.shields.io/badge/API-Swagger-informational\"\u003e\n\u003c/a\u003e\n\nA Rust, Python and gRPC server for text generation inference. Used in production at [Hugging Face](https://huggingface.co)\nto power Hugging Chat, the Inference API and Inference Endpoints.\n\n\u003c/div\u003e\n\n## Table of contents\n\n  - [Get Started](#get-started)\n    - [Docker](#docker)\n    - [API documentation](#api-documentation)\n    - [Using a private or gated model](#using-a-private-or-gated-model)\n    - [A note on Shared Memory (shm)](#a-note-on-shared-memory-shm)\n    - [Distributed Tracing](#distributed-tracing)\n    - [Architecture](#architecture)\n    - [Local install](#local-install)\n    - [Local install (Nix)](#local-install-nix)\n  - [Optimized architectures](#optimized-architectures)\n  - [Run locally](#run-locally)\n    - [Run](#run)\n    - [Quantization](#quantization)\n  - [Develop](#develop)\n  - [Testing](#testing)\n\nText Generation Inference (TGI) is a toolkit for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation for the most popular open-source LLMs, including Llama, Falcon, StarCoder, BLOOM, GPT-NeoX, and [more](https://huggingface.co/docs/text-generation-inference/supported_models). TGI implements many features, such as:\n\n- Simple launcher to serve most popular LLMs\n- Production ready (distributed tracing with Open Telemetry, Prometheus metrics)\n- Tensor Parallelism for faster inference on multiple GPUs\n- Token streaming using Server-Sent Events (SSE)\n- Continuous batching of incoming requests for increased total throughput\n- [Messages API](https://huggingface.co/docs/text-generation-inference/en/messages_api) compatible with Open AI Chat Completion API\n- Optimized transformers code for inference using [Flash Attention](https://github.com/HazyResearch/flash-attention) and [Paged Attention](https://github.com/vllm-project/vllm) on the most popular architectures\n- Quantization with :\n  - [bitsandbytes](https://github.com/TimDettmers/bitsandbytes)\n  - [GPT-Q](https://arxiv.org/abs/2210.17323)\n  - [EETQ](https://github.com/NetEase-FuXi/EETQ)\n  - [AWQ](https://github.com/casper-hansen/AutoAWQ)\n  - [Marlin](https://github.com/IST-DASLab/marlin)\n  - [fp8](https://developer.nvidia.com/blog/nvidia-arm-and-intel-publish-fp8-specification-for-standardization-as-an-interchange-format-for-ai/)\n- [Safetensors](https://github.com/huggingface/safetensors) weight loading\n- Watermarking with [A Watermark for Large Language Models](https://arxiv.org/abs/2301.10226)\n- Logits warper (temperature scaling, top-p, top-k, repetition penalty, more details see [transformers.LogitsProcessor](https://huggingface.co/docs/transformers/internal/generation_utils#transformers.LogitsProcessor))\n- Stop sequences\n- Log probabilities\n- [Speculation](https://huggingface.co/docs/text-generation-inference/conceptual/speculation) ~2x latency\n- [Guidance/JSON](https://huggingface.co/docs/text-generation-inference/conceptual/guidance). Specify output format to speed up inference and make sure the output is valid according to some specs..\n- Custom Prompt Generation: Easily generate text by providing custom prompts to guide the model's output\n- Fine-tuning Support: Utilize fine-tuned models for specific tasks to achieve higher accuracy and performance\n\n### Hardware support\n\n- [Nvidia](https://github.com/huggingface/text-generation-inference/pkgs/container/text-generation-inference)\n- [AMD](https://github.com/huggingface/text-generation-inference/pkgs/container/text-generation-inference) (-rocm)\n- [Inferentia](https://github.com/huggingface/optimum-neuron/tree/main/text-generation-inference)\n- [Intel GPU](https://github.com/huggingface/text-generation-inference/pull/1475)\n- [Gaudi](https://github.com/huggingface/tgi-gaudi)\n- [Google TPU](https://huggingface.co/docs/optimum-tpu/howto/serving)\n\n\n## Get Started\n\n### Docker\n\nFor a detailed starting guide, please see the [Quick Tour](https://huggingface.co/docs/text-generation-inference/quicktour). The easiest way of getting started is using the official Docker container:\n\n```shell\nmodel=HuggingFaceH4/zephyr-7b-beta\n# share a volume with the Docker container to avoid downloading weights every run\nvolume=$PWD/data\n\ndocker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \\\n    ghcr.io/huggingface/text-generation-inference:3.2.3 --model-id $model\n```\n\nAnd then you can make requests like\n\n```bash\ncurl 127.0.0.1:8080/generate_stream \\\n    -X POST \\\n    -d '{\"inputs\":\"What is Deep Learning?\",\"parameters\":{\"max_new_tokens\":20}}' \\\n    -H 'Content-Type: application/json'\n```\n\nYou can also use [TGI's Messages API](https://huggingface.co/docs/text-generation-inference/en/messages_api) to obtain Open AI Chat Completion API compatible responses.\n\n```bash\ncurl localhost:8080/v1/chat/completions \\\n    -X POST \\\n    -d '{\n  \"model\": \"tgi\",\n  \"messages\": [\n    {\n      \"role\": \"system\",\n      \"content\": \"You are a helpful assistant.\"\n    },\n    {\n      \"role\": \"user\",\n      \"content\": \"What is deep learning?\"\n    }\n  ],\n  \"stream\": true,\n  \"max_tokens\": 20\n}' \\\n    -H 'Content-Type: application/json'\n```\n\n**Note:** To use NVIDIA GPUs, you need to install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html). We also recommend using NVIDIA drivers with CUDA version 12.2 or higher. For running the Docker container on a machine with no GPUs or CUDA support, it is enough to remove the `--gpus all` flag and add `--disable-custom-kernels`, please note CPU is not the intended platform for this project, so performance might be subpar.\n\n**Note:** TGI supports AMD Instinct MI210 and MI250 GPUs. Details can be found in the [Supported Hardware documentation](https://huggingface.co/docs/text-generation-inference/installation_amd#using-tgi-with-amd-gpus). To use AMD GPUs, please use `docker run --device /dev/kfd --device /dev/dri --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:3.2.3-rocm --model-id $model` instead of the command above.\n\nTo see all options to serve your models (in the [code](https://github.com/huggingface/text-generation-inference/blob/main/launcher/src/main.rs) or in the cli):\n```\ntext-generation-launcher --help\n```\n\n### API documentation\n\nYou can consult the OpenAPI documentation of the `text-generation-inference` REST API using the `/docs` route.\nThe Swagger UI is also available at: [https://huggingface.github.io/text-generation-inference](https://huggingface.github.io/text-generation-inference).\n\n### Using a private or gated model\n\nYou have the option to utilize the `HF_TOKEN` environment variable for configuring the token employed by\n`text-generation-inference`. This allows you to gain access to protected resources.\n\nFor example, if you want to serve the gated Llama V2 model variants:\n\n1. Go to https://huggingface.co/settings/tokens\n2. Copy your CLI READ token\n3. Export `HF_TOKEN=\u003cyour CLI READ token\u003e`\n\nor with Docker:\n\n```shell\nmodel=meta-llama/Meta-Llama-3.1-8B-Instruct\nvolume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run\ntoken=\u003cyour cli READ token\u003e\n\ndocker run --gpus all --shm-size 1g -e HF_TOKEN=$token -p 8080:80 -v $volume:/data \\\n    ghcr.io/huggingface/text-generation-inference:3.2.3 --model-id $model\n```\n\n### A note on Shared Memory (shm)\n\n[`NCCL`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/index.html) is a communication framework used by\n`PyTorch` to do distributed training/inference. `text-generation-inference` makes\nuse of `NCCL` to enable Tensor Parallelism to dramatically speed up inference for large language models.\n\nIn order to share data between the different devices of a `NCCL` group, `NCCL` might fall back to using the host memory if\npeer-to-peer using NVLink or PCI is not possible.\n\nTo allow the container to use 1G of Shared Memory and support SHM sharing, we add `--shm-size 1g` on the above command.\n\nIf you are running `text-generation-inference` inside `Kubernetes`. You can also add Shared Memory to the container by\ncreating a volume with:\n\n```yaml\n- name: shm\n  emptyDir:\n   medium: Memory\n   sizeLimit: 1Gi\n```\n\nand mounting it to `/dev/shm`.\n\nFinally, you can also disable SHM sharing by using the `NCCL_SHM_DISABLE=1` environment variable. However, note that\nthis will impact performance.\n\n### Distributed Tracing\n\n`text-generation-inference` is instrumented with distributed tracing using OpenTelemetry. You can use this feature\nby setting the address to an OTLP collector with the `--otlp-endpoint` argument. The default service name can be\noverridden with the `--otlp-service-name` argument\n\n### Architecture\n\n![TGI architecture](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/TGI.png)\n\nDetailed blogpost by Adyen on TGI inner workings: [LLM inference at scale with TGI (Martin Iglesias Goyanes - Adyen, 2024)](https://www.adyen.com/knowledge-hub/llm-inference-at-scale-with-tgi)\n\n### Local install\n\nYou can also opt to install `text-generation-inference` locally.\n\nFirst clone the repository and change directory into it:\n\n```shell\ngit clone https://github.com/huggingface/text-generation-inference\ncd text-generation-inference\n```\n\nThen [install Rust](https://rustup.rs/) and create a Python virtual environment with at least\nPython 3.9, e.g. using `conda` or `python venv`:\n\n```shell\ncurl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh\n\n#using conda\nconda create -n text-generation-inference python=3.11\nconda activate text-generation-inference\n\n#using python venv\npython3 -m venv .venv\nsource .venv/bin/activate\n```\n\nYou may also need to install Protoc.\n\nOn Linux:\n\n```shell\nPROTOC_ZIP=protoc-21.12-linux-x86_64.zip\ncurl -OL https://github.com/protocolbuffers/protobuf/releases/download/v21.12/$PROTOC_ZIP\nsudo unzip -o $PROTOC_ZIP -d /usr/local bin/protoc\nsudo unzip -o $PROTOC_ZIP -d /usr/local 'include/*'\nrm -f $PROTOC_ZIP\n```\n\nOn MacOS, using Homebrew:\n\n```shell\nbrew install protobuf\n```\n\nThen run:\n\n```shell\nBUILD_EXTENSIONS=True make install # Install repository and HF/transformer fork with CUDA kernels\ntext-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2\n```\n\n**Note:** on some machines, you may also need the OpenSSL libraries and gcc. On Linux machines, run:\n\n```shell\nsudo apt-get install libssl-dev gcc -y\n```\n\n### Local install (Nix)\n\nAnother option is to install `text-generation-inference` locally using [Nix](https://nixos.org). Currently,\nwe only support Nix on x86_64 Linux with CUDA GPUs. When using Nix, all dependencies can\nbe pulled from a binary cache, removing the need to build them locally.\n\nFirst follow the instructions to [install Cachix and enable the TGI cache](https://app.cachix.org/cache/text-generation-inference).\nSetting up the cache is important, otherwise Nix will build many of the dependencies\nlocally, which can take hours.\n\nAfter that you can run TGI with `nix run`:\n\n```shell\ncd text-generation-inference\nnix run --extra-experimental-features nix-command --extra-experimental-features flakes . -- --model-id meta-llama/Llama-3.1-8B-Instruct\n```\n\n**Note:** when you are using Nix on a non-NixOS system, you have to [make some symlinks](https://danieldk.eu/Nix-CUDA-on-non-NixOS-systems#make-runopengl-driverlib-and-symlink-the-driver-library)\nto make the CUDA driver libraries visible to Nix packages.\n\nFor TGI development, you can use the `impure` dev shell:\n\n```shell\nnix develop .#impure\n\n# Only needed the first time the devshell is started or after updating the protobuf.\n(\ncd server\nmkdir text_generation_server/pb || true\npython -m grpc_tools.protoc -I../proto/v3 --python_out=text_generation_server/pb \\\n       --grpc_python_out=text_generation_server/pb --mypy_out=text_generation_server/pb ../proto/v3/generate.proto\nfind text_generation_server/pb/ -type f -name \"*.py\" -print0 -exec sed -i -e 's/^\\(import.*pb2\\)/from . \\1/g' {} \\;\ntouch text_generation_server/pb/__init__.py\n)\n```\n\nAll development dependencies (cargo, Python, Torch), etc. are available in this\ndev shell.\n\n## Optimized architectures\n\nTGI works out of the box to serve optimized models for all modern models. They can be found in [this list](https://huggingface.co/docs/text-generation-inference/supported_models).\n\nOther architectures are supported on a best-effort basis using:\n\n`AutoModelForCausalLM.from_pretrained(\u003cmodel\u003e, device_map=\"auto\")`\n\nor\n\n`AutoModelForSeq2SeqLM.from_pretrained(\u003cmodel\u003e, device_map=\"auto\")`\n\n\n\n## Run locally\n\n### Run\n\n```shell\ntext-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2\n```\n\n### Quantization\n\nYou can also run pre-quantized weights (AWQ, GPTQ, Marlin) or on-the-fly quantize weights with bitsandbytes, EETQ, fp8, to reduce the VRAM requirement:\n\n```shell\ntext-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2 --quantize\n```\n\n4bit quantization is available using the [NF4 and FP4 data types from bitsandbytes](https://arxiv.org/pdf/2305.14314.pdf). It can be enabled by providing `--quantize bitsandbytes-nf4` or `--quantize bitsandbytes-fp4` as a command line argument to `text-generation-launcher`.\n\nRead more about quantization in the [Quantization documentation](https://huggingface.co/docs/text-generation-inference/en/conceptual/quantization).\n\n## Develop\n\n```shell\nmake server-dev\nmake router-dev\n```\n\n## Testing\n\n```shell\n# python\nmake python-server-tests\nmake python-client-tests\n# or both server and client tests\nmake python-tests\n# rust cargo tests\nmake rust-tests\n# integration tests\nmake integration-tests\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhuggingface%2Ftext-generation-inference","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhuggingface%2Ftext-generation-inference","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhuggingface%2Ftext-generation-inference/lists"}