{"id":13442092,"url":"https://github.com/InternLM/lmdeploy","last_synced_at":"2025-03-20T13:32:16.012Z","repository":{"id":179063850,"uuid":"654122609","full_name":"InternLM/lmdeploy","owner":"InternLM","description":"LMDeploy is a toolkit for compressing, deploying, and serving LLMs.","archived":false,"fork":false,"pushed_at":"2025-03-17T10:38:39.000Z","size":7350,"stargazers_count":5863,"open_issues_count":413,"forks_count":508,"subscribers_count":48,"default_branch":"main","last_synced_at":"2025-03-17T22:42:00.217Z","etag":null,"topics":["codellama","cuda-kernels","deepspeed","fastertransformer","internlm","llama","llama2","llama3","llm","llm-inference","turbomind"],"latest_commit_sha":null,"homepage":"https://lmdeploy.readthedocs.io/en/latest/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/InternLM.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":".github/CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-06-15T12:38:06.000Z","updated_at":"2025-03-17T16:44:10.000Z","dependencies_parsed_at":"2023-12-25T10:29:13.999Z","dependency_job_id":"12baaea0-5050-4b66-9923-78b57366ef4b","html_url":"https://github.com/InternLM/lmdeploy","commit_stats":{"total_commits":1045,"total_committers":86,"mean_commits":"12.151162790697674","dds":0.8086124401913876,"last_synced_commit":"1efed796eeb2555e5194b7a99356100aaeac980e"},"previous_names":["internlm/lmdeploy","open-mmlab/llmdeploy"],"tags_count":46,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/InternLM%2Flmdeploy","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/InternLM%2Flmdeploy/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/InternLM%2Flmdeploy/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/InternLM%2Flmdeploy/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/InternLM","download_url":"https://codeload.github.com/InternLM/lmdeploy/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244619153,"owners_count":20482369,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["codellama","cuda-kernels","deepspeed","fastertransformer","internlm","llama","llama2","llama3","llm","llm-inference","turbomind"],"created_at":"2024-07-31T03:01:41.600Z","updated_at":"2025-03-20T13:32:16.003Z","avatar_url":"https://github.com/InternLM.png","language":"Python","readme":"\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"docs/en/_static/image/lmdeploy-logo.svg\" width=\"450\"/\u003e\n\n[![PyPI](https://img.shields.io/pypi/v/lmdeploy)](https://pypi.org/project/lmdeploy)\n![PyPI - Downloads](https://img.shields.io/pypi/dm/lmdeploy)\n[![license](https://img.shields.io/github/license/InternLM/lmdeploy.svg)](https://github.com/InternLM/lmdeploy/tree/main/LICENSE)\n[![issue resolution](https://img.shields.io/github/issues-closed-raw/InternLM/lmdeploy)](https://github.com/InternLM/lmdeploy/issues)\n[![open issues](https://img.shields.io/github/issues-raw/InternLM/lmdeploy)](https://github.com/InternLM/lmdeploy/issues)\n\n[📘Documentation](https://lmdeploy.readthedocs.io/en/latest/) |\n[🛠️Quick Start](https://lmdeploy.readthedocs.io/en/latest/get_started/get_started.html) |\n[🤔Reporting Issues](https://github.com/InternLM/lmdeploy/issues/new/choose)\n\nEnglish | [简体中文](README_zh-CN.md) | [日本語](README_ja.md)\n\n👋 join us on [![Static Badge](https://img.shields.io/badge/-grey?style=social\u0026logo=wechat\u0026label=WeChat)](https://cdn.vansin.top/internlm/lmdeploy.jpg)\n[![Static Badge](https://img.shields.io/badge/-grey?style=social\u0026logo=twitter\u0026label=Twitter)](https://twitter.com/intern_lm)\n[![Static Badge](https://img.shields.io/badge/-grey?style=social\u0026logo=discord\u0026label=Discord)](https://discord.gg/xa29JuW87d)\n\n\u003c/div\u003e\n\n______________________________________________________________________\n\n## Latest News 🎉\n\n\u003cdetails open\u003e\n\u003csummary\u003e\u003cb\u003e2025\u003c/b\u003e\u003c/summary\u003e\n\u003c/details\u003e\n\n\u003cdetails close\u003e\n\u003csummary\u003e\u003cb\u003e2024\u003c/b\u003e\u003c/summary\u003e\n\n- \\[2024/11\\] Support Mono-InternVL with PyTorch engine\n- \\[2024/10\\] PyTorchEngine supports graph mode on ascend platform, doubling the inference speed\n- \\[2024/09\\] LMDeploy PyTorchEngine adds support for [Huawei Ascend](./docs/en/get_started/ascend/get_started.md). See supported models [here](docs/en/supported_models/supported_models.md)\n- \\[2024/09\\] LMDeploy PyTorchEngine achieves 1.3x faster on Llama3-8B inference by introducing CUDA graph\n- \\[2024/08\\] LMDeploy is integrated into [modelscope/swift](https://github.com/modelscope/swift) as the default accelerator for VLMs inference\n- \\[2024/07\\] Support Llama3.1 8B, 70B and its TOOLS CALLING\n- \\[2024/07\\] Support [InternVL2](docs/en/multi_modal/internvl.md) full-series models, [InternLM-XComposer2.5](docs/en/multi_modal/xcomposer2d5.md) and [function call](docs/en/llm/api_server_tools.md) of InternLM2.5\n- \\[2024/06\\] PyTorch engine support DeepSeek-V2 and several VLMs, such as CogVLM2, Mini-InternVL, LlaVA-Next\n- \\[2024/05\\] Balance vision model when deploying VLMs with multiple GPUs\n- \\[2024/05\\] Support 4-bits weight-only quantization and inference on VLMs, such as InternVL v1.5, LLaVa, InternLMXComposer2\n- \\[2024/04\\] Support Llama3 and more VLMs, such as InternVL v1.1, v1.2, MiniGemini, InternLMXComposer2.\n- \\[2024/04\\] TurboMind adds online int8/int4 KV cache quantization and inference for all supported devices. Refer [here](docs/en/quantization/kv_quant.md) for detailed guide\n- \\[2024/04\\] TurboMind latest upgrade boosts GQA, rocketing the [internlm2-20b](https://huggingface.co/internlm/internlm2-20b) model inference to 16+ RPS, about 1.8x faster than vLLM.\n- \\[2024/04\\] Support Qwen1.5-MOE and dbrx.\n- \\[2024/03\\] Support DeepSeek-VL offline inference pipeline and serving.\n- \\[2024/03\\] Support VLM offline inference pipeline and serving.\n- \\[2024/02\\] Support Qwen 1.5, Gemma, Mistral, Mixtral, Deepseek-MOE and so on.\n- \\[2024/01\\] [OpenAOE](https://github.com/InternLM/OpenAOE) seamless integration with [LMDeploy Serving Service](docs/en/llm/api_server.md).\n- \\[2024/01\\] Support for multi-model, multi-machine, multi-card inference services. For usage instructions, please refer to [here](docs/en/llm/proxy_server.md)\n- \\[2024/01\\] Support [PyTorch inference engine](./docs/en/inference/pytorch.md), developed entirely in Python, helping to lower the barriers for developers and enable  rapid experimentation with new features and technologies.\n\n\u003c/details\u003e\n\n\u003cdetails close\u003e\n\u003csummary\u003e\u003cb\u003e2023\u003c/b\u003e\u003c/summary\u003e\n\n- \\[2023/12\\] Turbomind supports multimodal input.\n- \\[2023/11\\] Turbomind supports loading hf model directly. Click [here](docs/en/inference/load_hf.md) for details.\n- \\[2023/11\\] TurboMind major upgrades, including: Paged Attention, faster attention kernels without sequence length limitation, 2x faster KV8 kernels, Split-K decoding (Flash Decoding), and W4A16 inference for sm_75\n- \\[2023/09\\] TurboMind supports Qwen-14B\n- \\[2023/09\\] TurboMind supports InternLM-20B\n- \\[2023/09\\] TurboMind supports all features of Code Llama: code completion, infilling, chat / instruct, and python specialist. Click [here](./docs/en/llm/codellama.md) for deployment guide\n- \\[2023/09\\] TurboMind supports Baichuan2-7B\n- \\[2023/08\\] TurboMind supports flash-attention2.\n- \\[2023/08\\] TurboMind supports Qwen-7B, dynamic NTK-RoPE scaling and dynamic logN scaling\n- \\[2023/08\\] TurboMind supports Windows (tp=1)\n- \\[2023/08\\] TurboMind supports 4-bit inference, 2.4x faster than FP16, the fastest open-source implementation. Check [this](docs/en/quantization/w4a16.md) guide for detailed info\n- \\[2023/08\\] LMDeploy has launched on the [HuggingFace Hub](https://huggingface.co/lmdeploy), providing ready-to-use 4-bit models.\n- \\[2023/08\\] LMDeploy supports 4-bit quantization using the [AWQ](https://arxiv.org/abs/2306.00978) algorithm.\n- \\[2023/07\\] TurboMind supports Llama-2 70B with GQA.\n- \\[2023/07\\] TurboMind supports Llama-2 7B/13B.\n- \\[2023/07\\] TurboMind supports tensor-parallel inference of InternLM.\n\n\u003c/details\u003e\n\n______________________________________________________________________\n\n# Introduction\n\nLMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the [MMRazor](https://github.com/open-mmlab/mmrazor) and [MMDeploy](https://github.com/open-mmlab/mmdeploy) teams. It has the following core features:\n\n- **Efficient Inference**: LMDeploy delivers up to 1.8x higher request throughput than vLLM, by introducing key features like persistent batch(a.k.a. continuous batching), blocked KV cache, dynamic split\u0026fuse, tensor parallelism, high-performance CUDA kernels and so on.\n\n- **Effective Quantization**: LMDeploy supports weight-only and k/v quantization, and the 4-bit inference performance is 2.4x higher than FP16. The quantization quality has been confirmed via OpenCompass evaluation.\n\n- **Effortless Distribution Server**: Leveraging the request distribution service, LMDeploy facilitates an easy and efficient deployment of multi-model services across multiple machines and cards.\n\n- **Interactive Inference Mode**: By caching the k/v of attention during multi-round dialogue processes, the engine remembers dialogue history, thus avoiding repetitive processing of historical sessions.\n\n- **Excellent Compatibility**: LMDeploy supports [KV Cache Quant](docs/en/quantization/kv_quant.md), [AWQ](docs/en/quantization/w4a16.md) and [Automatic Prefix Caching](docs/en/inference/turbomind_config.md) to be used simultaneously.\n\n# Performance\n\n![v0 1 0-benchmark](https://github.com/InternLM/lmdeploy/assets/4560679/8e455cf1-a792-4fa8-91a2-75df96a2a5ba)\n\n# Supported Models\n\n\u003ctable\u003e\n\u003ctbody\u003e\n\u003ctr align=\"center\" valign=\"middle\"\u003e\n\u003ctd\u003e\n  \u003cb\u003eLLMs\u003c/b\u003e\n\u003c/td\u003e\n\u003ctd\u003e\n  \u003cb\u003eVLMs\u003c/b\u003e\n\u003c/td\u003e\n\u003ctr valign=\"top\"\u003e\n\u003ctd align=\"left\" valign=\"top\"\u003e\n\u003cul\u003e\n  \u003cli\u003eLlama (7B - 65B)\u003c/li\u003e\n  \u003cli\u003eLlama2 (7B - 70B)\u003c/li\u003e\n  \u003cli\u003eLlama3 (8B, 70B)\u003c/li\u003e\n  \u003cli\u003eLlama3.1 (8B, 70B)\u003c/li\u003e\n  \u003cli\u003eLlama3.2 (1B, 3B)\u003c/li\u003e\n  \u003cli\u003eInternLM (7B - 20B)\u003c/li\u003e\n  \u003cli\u003eInternLM2 (7B - 20B)\u003c/li\u003e\n  \u003cli\u003eInternLM3 (8B)\u003c/li\u003e\n  \u003cli\u003eInternLM2.5 (7B)\u003c/li\u003e\n  \u003cli\u003eQwen (1.8B - 72B)\u003c/li\u003e\n  \u003cli\u003eQwen1.5 (0.5B - 110B)\u003c/li\u003e\n  \u003cli\u003eQwen1.5 - MoE (0.5B - 72B)\u003c/li\u003e\n  \u003cli\u003eQwen2 (0.5B - 72B)\u003c/li\u003e\n  \u003cli\u003eQwen2-MoE (57BA14B)\u003c/li\u003e\n  \u003cli\u003eQwen2.5 (0.5B - 32B)\u003c/li\u003e\n  \u003cli\u003eBaichuan (7B)\u003c/li\u003e\n  \u003cli\u003eBaichuan2 (7B-13B)\u003c/li\u003e\n  \u003cli\u003eCode Llama (7B - 34B)\u003c/li\u003e\n  \u003cli\u003eChatGLM2 (6B)\u003c/li\u003e\n  \u003cli\u003eGLM4 (9B)\u003c/li\u003e\n  \u003cli\u003eCodeGeeX4 (9B)\u003c/li\u003e\n  \u003cli\u003eFalcon (7B - 180B)\u003c/li\u003e\n  \u003cli\u003eYI (6B-34B)\u003c/li\u003e\n  \u003cli\u003eMistral (7B)\u003c/li\u003e\n  \u003cli\u003eDeepSeek-MoE (16B)\u003c/li\u003e\n  \u003cli\u003eDeepSeek-V2 (16B, 236B)\u003c/li\u003e\n  \u003cli\u003eDeepSeek-V2.5 (236B)\u003c/li\u003e\n  \u003cli\u003eMixtral (8x7B, 8x22B)\u003c/li\u003e\n  \u003cli\u003eGemma (2B - 7B)\u003c/li\u003e\n  \u003cli\u003eDbrx (132B)\u003c/li\u003e\n  \u003cli\u003eStarCoder2 (3B - 15B)\u003c/li\u003e\n  \u003cli\u003ePhi-3-mini (3.8B)\u003c/li\u003e\n  \u003cli\u003ePhi-3.5-mini (3.8B)\u003c/li\u003e\n  \u003cli\u003ePhi-3.5-MoE (16x3.8B)\u003c/li\u003e\n  \u003cli\u003eMiniCPM3 (4B)\u003c/li\u003e\n\u003c/ul\u003e\n\u003c/td\u003e\n\u003ctd\u003e\n\u003cul\u003e\n  \u003cli\u003eLLaVA(1.5,1.6) (7B-34B)\u003c/li\u003e\n  \u003cli\u003eInternLM-XComposer2 (7B, 4khd-7B)\u003c/li\u003e\n  \u003cli\u003eInternLM-XComposer2.5 (7B)\u003c/li\u003e\n  \u003cli\u003eQwen-VL (7B)\u003c/li\u003e\n  \u003cli\u003eQwen2-VL (2B, 7B, 72B)\u003c/li\u003e\n  \u003cli\u003eQwen2.5-VL (3B, 7B, 72B)\u003c/li\u003e\n  \u003cli\u003eDeepSeek-VL (7B)\u003c/li\u003e\n  \u003cli\u003eDeepSeek-VL2 (3B, 16B, 27B)\u003c/li\u003e\n  \u003cli\u003eInternVL-Chat (v1.1-v1.5)\u003c/li\u003e\n  \u003cli\u003eInternVL2 (1B-76B)\u003c/li\u003e\n  \u003cli\u003eInternVL2.5(MPO) (1B-78B)\u003c/li\u003e\n  \u003cli\u003eMono-InternVL (2B)\u003c/li\u003e\n  \u003cli\u003eChemVLM (8B-26B)\u003c/li\u003e\n  \u003cli\u003eMiniGeminiLlama (7B)\u003c/li\u003e\n  \u003cli\u003eCogVLM-Chat (17B)\u003c/li\u003e\n  \u003cli\u003eCogVLM2-Chat (19B)\u003c/li\u003e\n  \u003cli\u003eMiniCPM-Llama3-V-2_5\u003c/li\u003e\n  \u003cli\u003eMiniCPM-V-2_6\u003c/li\u003e\n  \u003cli\u003ePhi-3-vision (4.2B)\u003c/li\u003e\n  \u003cli\u003ePhi-3.5-vision (4.2B)\u003c/li\u003e\n  \u003cli\u003eGLM-4V (9B)\u003c/li\u003e\n  \u003cli\u003eLlama3.2-vision (11B, 90B)\u003c/li\u003e\n  \u003cli\u003eMolmo (7B-D,72B)\u003c/li\u003e\n  \u003cli\u003eGemma3 (1B - 27B)\u003c/li\u003e\n\u003c/ul\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/tbody\u003e\n\u003c/table\u003e\n\nLMDeploy has developed two inference engines - [TurboMind](./docs/en/inference/turbomind.md) and [PyTorch](./docs/en/inference/pytorch.md), each with a different focus. The former strives for ultimate optimization of inference performance, while the latter, developed purely in Python, aims to decrease the barriers for developers.\n\nThey differ in the types of supported models and the inference data type. Please refer to [this table](./docs/en/supported_models/supported_models.md) for each engine's capability and choose the proper one that best fits your actual needs.\n\n# Quick Start [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1Dh-YlSwg78ZO3AlleO441NF_QP2shs95#scrollTo=YALmXnwCG1pQ)\n\n## Installation\n\nIt is recommended installing lmdeploy using pip in a conda environment (python 3.8 - 3.12):\n\n```shell\nconda create -n lmdeploy python=3.8 -y\nconda activate lmdeploy\npip install lmdeploy\n```\n\nThe default prebuilt package is compiled on **CUDA 12** since v0.3.0.\nFor more information on installing on CUDA 11+ platform, or for instructions on building from source, please refer to the [installation guide](docs/en/get_started/installation.md).\n\n## Offline Batch Inference\n\n```python\nimport lmdeploy\nwith lmdeploy.pipeline(\"internlm/internlm3-8b-instruct\") as pipe:\n    response = pipe([\"Hi, pls intro yourself\", \"Shanghai is\"])\n    print(response)\n```\n\n\u003e \\[!NOTE\\]\n\u003e By default, LMDeploy downloads model from HuggingFace. If you would like to use models from ModelScope, please install ModelScope by `pip install modelscope` and set the environment variable:\n\u003e\n\u003e `export LMDEPLOY_USE_MODELSCOPE=True`\n\u003e\n\u003e If you would like to use models from openMind Hub, please install openMind Hub by `pip install openmind_hub` and set the environment variable:\n\u003e\n\u003e `export LMDEPLOY_USE_OPENMIND_HUB=True`\n\nFor more information about inference pipeline, please refer to [here](docs/en/llm/pipeline.md).\n\n# Tutorials\n\nPlease review [getting_started](docs/en/get_started/get_started.md) section for the basic usage of LMDeploy.\n\nFor detailed user guides and advanced guides, please refer to our [tutorials](https://lmdeploy.readthedocs.io/en/latest/):\n\n- User Guide\n  - [LLM Inference pipeline](docs/en/llm/pipeline.md) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1Dh-YlSwg78ZO3AlleO441NF_QP2shs95#scrollTo=YALmXnwCG1pQ)\n  - [VLM Inference pipeline](docs/en/multi_modal/vl_pipeline.md) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1nKLfnPeDA3p-FMNw2NhI-KOpk7-nlNjF?usp=sharing)\n  - [LLM Serving](docs/en/llm/api_server.md)\n  - [VLM Serving](docs/en/multi_modal/api_server_vl.md)\n  - [Quantization](docs/en/quantization)\n- Advance Guide\n  - [Inference Engine - TurboMind](docs/en/inference/turbomind.md)\n  - [Inference Engine - PyTorch](docs/en/inference/pytorch.md)\n  - [Customize chat templates](docs/en/advance/chat_template.md)\n  - [Add a new model](docs/en/advance/pytorch_new_model.md)\n  - gemm tuning\n  - [Long context inference](docs/en/advance/long_context.md)\n  - [Multi-model inference service](docs/en/llm/proxy_server.md)\n\n# Third-party projects\n\n- Deploying LLMs offline on the NVIDIA Jetson platform by LMDeploy: [LMDeploy-Jetson](https://github.com/BestAnHongjun/LMDeploy-Jetson)\n\n- Example project for deploying LLMs using LMDeploy and BentoML: [BentoLMDeploy](https://github.com/bentoml/BentoLMDeploy)\n\n# Contributing\n\nWe appreciate all contributions to LMDeploy. Please refer to [CONTRIBUTING.md](.github/CONTRIBUTING.md) for the contributing guideline.\n\n# Acknowledgement\n\n- [FasterTransformer](https://github.com/NVIDIA/FasterTransformer)\n- [llm-awq](https://github.com/mit-han-lab/llm-awq)\n- [vLLM](https://github.com/vllm-project/vllm)\n- [DeepSpeed-MII](https://github.com/microsoft/DeepSpeed-MII)\n\n# Citation\n\n```bibtex\n@misc{2023lmdeploy,\n    title={LMDeploy: A Toolkit for Compressing, Deploying, and Serving LLM},\n    author={LMDeploy Contributors},\n    howpublished = {\\url{https://github.com/InternLM/lmdeploy}},\n    year={2023}\n}\n```\n\n# License\n\nThis project is released under the [Apache 2.0 license](LICENSE).\n","funding_links":[],"categories":["Python","A01_文本生成_文本对话","LLM Deployment","NLP","推理 Inference","Inference","Deployment and Serving","🔓 Open Source Inference Engines","LLM Inference","Inference Engine","📋 Contents","2. **Production Tools**","One-Click Runners \u0026 Installers (15)","LLM Serving / Inference","Inference \u0026 Serving","Open-Source Local LLM Projects","LLM 部署与推理 (Deployment \u0026 Inference)"],"sub_categories":["大语言对话模型及数据","3. Pretraining","Inference Engine","⚡ 3. Inference Engines \u0026 Serving","Inference Engines","推理引擎 (Inference Engines)"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FInternLM%2Flmdeploy","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FInternLM%2Flmdeploy","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FInternLM%2Flmdeploy/lists"}