{"id":14964630,"url":"https://github.com/hpcaitech/swiftinfer","last_synced_at":"2025-04-05T12:07:44.600Z","repository":{"id":216022844,"uuid":"740111119","full_name":"hpcaitech/SwiftInfer","owner":"hpcaitech","description":"Efficient AI Inference \u0026 Serving","archived":false,"fork":false,"pushed_at":"2024-01-08T09:18:42.000Z","size":520,"stargazers_count":470,"open_issues_count":3,"forks_count":28,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-04-05T12:07:39.963Z","etag":null,"topics":["artificial-intelligence","deep-learning","gpt","inference","llama","llama2","llm-inference","llm-serving"],"latest_commit_sha":null,"homepage":"https://hpc-ai.com/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hpcaitech.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-01-07T15:03:40.000Z","updated_at":"2025-04-03T05:56:47.000Z","dependencies_parsed_at":"2024-01-08T03:01:59.284Z","dependency_job_id":"5ee60d99-e7a7-4606-8dcb-c423268d9185","html_url":"https://github.com/hpcaitech/SwiftInfer","commit_stats":{"total_commits":4,"total_committers":2,"mean_commits":2.0,"dds":0.25,"last_synced_commit":"239fd3a80a4a35e52cb8a99f959a451229782532"},"previous_names":["hpcaitech/swiftinfer"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hpcaitech%2FSwiftInfer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hpcaitech%2FSwiftInfer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hpcaitech%2FSwiftInfer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hpcaitech%2FSwiftInfer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hpcaitech","download_url":"https://codeload.github.com/hpcaitech/SwiftInfer/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247332609,"owners_count":20921853,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["artificial-intelligence","deep-learning","gpt","inference","llama","llama2","llm-inference","llm-serving"],"created_at":"2024-09-24T13:33:32.553Z","updated_at":"2025-04-05T12:07:44.572Z","avatar_url":"https://github.com/hpcaitech.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 🚀 SwiftInfer\n\n## 🔗 Table of Contents\n\n- [🚀 SwiftInfer](#-swiftinfer)\n  - [🔗 Table of Contents](#-table-of-contents)\n  - [📌 Overview](#-overview)\n  - [🚗 Quick Start](#-quick-start)\n    - [🛠 Installation](#-installation)\n    - [🕹 Run Llama example](#-run-llama-example)\n  - [⚖️ Benchmark](#-benchmark)\n  - [🗺 Roadmap](#-roadmap)\n  - [📃 Acknowledgement](#-acknowledgement)\n  - [📝 Citation](#-citation)\n\n## 📌 Overview\n\n[**Streaming-LLM**](https://github.com/mit-han-lab/streaming-llm) is a technique to support infinite input length for LLM inference. It leverages [**Attention Sink**](https://arxiv.org/abs/2309.17453) to prevent the model collapse when the attention window shifts. The original work is implemented in PyTorch, we offer **SwiftInfer**, a TensorRT implementation to make StreamingLLM more production-grade. Our implementation was built upon the recently released [**TensorRT-LLM**](https://github.com/NVIDIA/TensorRT-LLM) project.\n\n## 🚗 Quick Start\n\n### 🛠 Installation\n\nWe use the API in [**TensorRT-LLM**](https://github.com/NVIDIA/TensorRT-LLM) to construct the model and run inference. As the API of TensorRT-LLM is not stable and changing rapidly, we bind our implementation with the `42af740db51d6f11442fd5509ef745a4c043ce51` commit whose version is `v0.6.0`. We may upgrade this repository as TensorRT-LLM's APIs become more stable.\n\nIf you have build **TensorRT-LLM V0.6.0**, simply run:\n\n```bash\ngit clone https://github.com/hpcaitech/SwiftInfer.git\ncd SwiftInfer\npip install .\n```\n\nOtherwise, you should install TensorRT-LLM first.\n\n#### Install TensorRT-LLM with Docker\n\nIf using docker, you can follow [TensorRT-LLM Installation](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/installation.md) to install **TensorRT-LLM V0.6.0**.\n\nBy using docker, you can install SwiftInfer by simply running:\n\n```bash\ngit clone https://github.com/hpcaitech/SwiftInfer.git\ncd SwiftInfer\npip install .\n```\n\n#### Install TensorRT-LLM without Docker\n\nIf not using docker, we provide a script to install TensorRT-LLM automatically.\n\n**Prerequisites**\n\nPlease ensure that you have installed the following packages:\n\n- python\n- build essentials, including gcc/g++, make, cmake\n- CUDA toolkit\n- cuDNN\n- NCCL\n- TensorRT\n- PyTorch\n\nMake sure the version of TensorRT \u003e= 9.1.0 and CUDA toolkit \u003e= 12.2.\n\nTo install tensorrt:\n\n```bash\nARCH=$(uname -m)\nif [ \"$ARCH\" = \"arm64\" ];then ARCH=\"aarch64\";fi\nif [ \"$ARCH\" = \"amd64\" ];then ARCH=\"x86_64\";fi\nif [ \"$ARCH\" = \"aarch64\" ];then OS=\"ubuntu-22.04\"; else OS=\"linux\";fi\nwget https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/secure/9.1.0/tars/tensorrt-9.1.0.4.$OS.$ARCH-gnu.cuda-12.2.tar.gz\ntar xzvf tensorrt-9.1.0.4.linux.x86_64-gnu.cuda-12.2.tar.gz\nPY_VERSION=$(python -c 'import sys; print(\".\".join(map(str, sys.version_info[0:2])))')\nPARSED_PY_VERSION=$(echo \"${PY_VERSION//./}\")\npip install TensorRT-9.1.0.4/python/tensorrt-*-cp${PARSED_PY_VERSION}-*.whl\nexport TRT_ROOT=$(realpath TensorRT-9.1.0.4)\n```\n\nTo download nccl, follow [NCCL download page](https://developer.nvidia.com/nccl/nccl-download).\n\nTo download cudnn, follow [cuDNN download page](https://developer.nvidia.com/rdp/cudnn-download).\n\n**Commands**\n\nBefore running the following commands, please ensure that you have set `nvcc` correctly. To check it, run:\n\n```bash\nnvcc --version\n```\n\nTo install TensorRT-LLM and SwiftInfer, run:\n\n```bash\ngit clone https://github.com/hpcaitech/SwiftInfer.git\ncd SwiftInfer\nTRT_ROOT=xxx NCCL_ROOT=xxx CUDNN_ROOT=xxx pip install .\n```\n\n### 🕹 Run Llama example\n\nTo run the Llama example, you need to first clone the Hugging Face repository for the [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) model or other Llama-based variants such as [lmsys/vicuna-7b-v1.3](https://huggingface.co/lmsys/vicuna-7b-v1.3). Then, you can run the following command to build the TensorRT engine. **You need to replace `\u003cmodel-dir\u003e` with the actual path to the Llama model.**\n\n```bash\ncd examples/llama\n\npython build.py \\\n--model_dir \u003cmodel-dir\u003e \\\n--dtype float16 \\\n--enable_context_fmha \\\n--use_gemm_plugin float16 \\\n--max_input_len 2048 \\\n--max_output_len 1024 \\\n--output_dir ./output/7B-streaming-8k-1k-4-2000/trt_engines/fp16/1-gpu/ \\\n--max_batch_size 1\n```\n\nNext, you need to download the [MT-Bench](https://github.com/lm-sys/FastChat/blob/main/fastchat/llm_judge/README.md#mt-bench) data provided by [LMSYS-FastChat](https://github.com/lm-sys/FastChat).\n\n```bash\nmkdir mt_bench_data\nwget -P ./mt_bench_data https://raw.githubusercontent.com/lm-sys/FastChat/main/fastchat/llm_judge/data/mt_bench/question.jsonl\n```\n\nFinally, you are ready to run the Llama example with the following command.\n\n❗️❗️❗️ **Before that, please note that:**\n1. The `only_n_first` argument is used to control the number of samples to be evaluated. If you want to evaluate all samples, please remove this argument.\n\n```bash\npython ../run_conversation.py \\\n--max_input_length 2048 \\\n--max_output_len 1024 \\\n--tokenizer_dir \u003cmodel-dir\u003e \\\n--engine_dir ./output/7B-streaming-8k-1k-4-2000/trt_engines/fp16/1-gpu/ \\\n--input_file ./mt_bench_data/question.jsonl \\\n--streaming_llm_start_size 4 \\\n--only_n_first 5\n```\n\nYou should expect to see the generation out as follows:\n\n![generation output](./assets/inference-result.png)\n\n## ⚖️ Benchmark\n\nWe have benchmarked our implementations of Streaming-LLM with the [original PyTorch version](https://github.com/mit-han-lab/streaming-llm). The benchmark command for our implementation is given in the [Run Llama Example](#🕹-run-llama-example) section while that for the original PyTorch implementation is given in the [torch_streamingllm](./examples/torch_streamingllm/) folder. The hardware used is listed below:\n\n- GPU: Nvidia H800 (80GB)\n- CPU: Intel(R) Xeon(R) Platinum 8468\n- RAM: 2TB\n\nThe results (20 rounds of conversations) are:\n\n![performance](./assets/performance.jpg)\n\nWe are still working on further performance improvement and adapting to the TensorRT V0.7.1 APIs. We also notice that TensorRT-LLM has integrated StreamingLLM in their [example](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama#run-llama-with-streamingllm) but it seems it is more suitable for single text generation instead of multi-round conversations. \n\n## 🗺 Roadmap\n\n- [x] Streaming-LLM attention implementation based on TRT-LLM APIs\n- [x] KV cache adaptation\n- [x] Early stop adaptation\n- [x] Contiguous tensor fix\n- [x] Llama example for multi-round conversation\n\n## 📃 Acknowledgement\n\nThis work is inspired by Streaming-LLM to make it usable for production. Throughout development, we have referenced the following materials and we wish to acknowledge their efforts and contribution to the open-source community and academia.\n\n- Streaming-LLM\n    - [Paper](https://arxiv.org/abs/2309.17453)\n    - [Slides](https://github.com/mit-han-lab/streaming-llm/blob/main/assets/StreamingLLM.pdf)\n    - [GitHub Repository](https://github.com/mit-han-lab/streaming-llm)\n- TensorRT-LLM\n    - [Documentation](https://nvidia.github.io/TensorRT-LLM/)\n    - [GitHub Repository](https://github.com/NVIDIA/TensorRT-LLM)\n\n\n## 📝 Citation\n\nIf you find StreamingLLM and our TensorRT implementation useful, please kindly cite our repository and the original work proposed by [Xiao et al.](https://github.com/Guangxuan-Xiao) from [MIT Han Lab](https://github.com/mit-han-lab).\n\n```bibtex\n# our repository\n# NOTE: the listed authors have equal contribution\n@misc{streamingllmtrt2023,\n  title = {SwiftInfer},\n  year = {2023},\n  publisher = {GitHub},\n  journal = {GitHub repository},\n  howpublished = {\\url{https://github.com/hpcaitech/SwiftInfer}},\n}\n\n# Xiao's original paper\n@article{xiao2023streamingllm,\n        title={Efficient Streaming Language Models with Attention Sinks},\n        author={Xiao, Guangxuan and Tian, Yuandong and Chen, Beidi and Han, Song and Lewis, Mike},\n        journal={arXiv},\n        year={2023}\n        }\n\n# TensorRT-LLM repo\n# as TensorRT-LLM team does not provide a bibtex\n# please let us know if there is any change needed\n@misc{trtllm2023,\n  title = {TensorRT-LLM},\n  year = {2023},\n  publisher = {GitHub},\n  journal = {GitHub repository},\n  howpublished = {\\url{https://github.com/NVIDIA/TensorRT-LLM}},\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhpcaitech%2Fswiftinfer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhpcaitech%2Fswiftinfer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhpcaitech%2Fswiftinfer/lists"}