{"id":13753987,"url":"https://github.com/vectorch-ai/ScaleLLM","last_synced_at":"2025-05-09T22:30:39.004Z","repository":{"id":183542996,"uuid":"670332256","full_name":"vectorch-ai/ScaleLLM","owner":"vectorch-ai","description":"A high-performance inference system for large language models, designed for production environments.","archived":false,"fork":false,"pushed_at":"2025-05-06T22:53:37.000Z","size":19955,"stargazers_count":436,"open_issues_count":57,"forks_count":35,"subscribers_count":16,"default_branch":"main","last_synced_at":"2025-05-06T23:27:59.573Z","etag":null,"topics":["cuda","efficiency","gpu","inference","llama","llama3","llm","llm-inference","model","performance","production","serving","speculative","transformer"],"latest_commit_sha":null,"homepage":"https://docs.vectorch.com/","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/vectorch-ai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-07-24T20:14:28.000Z","updated_at":"2025-05-06T10:14:13.000Z","dependencies_parsed_at":"2024-03-16T23:13:13.362Z","dependency_job_id":"6d42a12d-ec89-4b4b-af05-aa5216ba3a0c","html_url":"https://github.com/vectorch-ai/ScaleLLM","commit_stats":{"total_commits":624,"total_committers":6,"mean_commits":104.0,"dds":0.04647435897435892,"last_synced_commit":"a7abc225e7d6019ac9c0246011828810529a8fc8"},"previous_names":["vectorch-ai/llm-serving","vectorch-ai/llminfer"],"tags_count":26,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vectorch-ai%2FScaleLLM","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vectorch-ai%2FScaleLLM/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vectorch-ai%2FScaleLLM/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vectorch-ai%2FScaleLLM/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/vectorch-ai","download_url":"https://codeload.github.com/vectorch-ai/ScaleLLM/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253335170,"owners_count":21892620,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cuda","efficiency","gpu","inference","llama","llama3","llm","llm-inference","model","performance","production","serving","speculative","transformer"],"created_at":"2024-08-03T09:01:36.552Z","updated_at":"2025-05-09T22:30:38.986Z","avatar_url":"https://github.com/vectorch-ai.png","language":"C++","readme":"\u003cdiv align=\"center\"\u003e\n\nScaleLLM\n=================\n\u003ch3\u003e An efficient LLM Inference solution \u003c/h3\u003e\n\n[![Discord][discord-shield]][discord-url]\n[![X][x-shield]][x-url]\n\u003cbr\u003e\n[![Docs][docs-shield]][docs-url]\n[![PyPI][pypi-shield]][pypi-url]\n[![downloads][github-downloads-shield]][github-downloads-link]\n[![License][license-shield]][license-url]\n\n[discord-shield]: https://dcbadge.vercel.app/api/server/PKe5gvBZfn?compact=true\u0026style=flat\n[discord-url]: https://discord.gg/PKe5gvBZfn\n[x-shield]: https://img.shields.io/twitter/url?label=%20%40VectorchAI\u0026style=social\u0026url=https://x.com/VectorchAI\n[x-url]: https://x.com/VectorchAI\n\n[docs-shield]: https://img.shields.io/badge/docs-latest-brightgreen.svg?style=flat\n[docs-url]: https://docs.vectorch.com/\n[pypi-shield]: https://badge.fury.io/py/scalellm.svg\n[pypi-url]: https://pypi.org/project/scalellm/\n[github-downloads-shield]: https://img.shields.io/github/downloads/vectorch-ai/ScaleLLM/total?style=flat\n[github-downloads-link]: https://github.com/vectorch-ai/ScaleLLM/releases\n[build-shield]: https://github.com/vectorch-ai/ScaleLLM/actions/workflows/build.yml/badge.svg?branch=main\n[build-url]:https://github.com/vectorch-ai/ScaleLLM/actions/workflows/build.yml\n[license-shield]: https://img.shields.io/badge/License-Apache_2.0-blue.svg\n[license-url]: https://opensource.org/licenses/Apache-2.0\n\n---\n\n\u003cdiv align=\"left\"\u003e\n\n[ScaleLLM](#) is a cutting-edge inference system engineered for large language models (LLMs), designed to meet the demands of production environments. It extends its support to a wide range of popular open-source models, including [Llama3.1](https://github.com/meta-llama/llama3), [Gemma2](https://github.com/google-deepmind/gemma), Bloom, GPT-NeoX, and more.\n\nScaleLLM is currently undergoing active development. We are fully committed to consistently enhancing its efficiency while also incorporating additional features. Feel free to explore our [**_Roadmap_**](https://github.com/vectorch-ai/ScaleLLM/issues/84) for more details.\n\n## News:\n* [06/2024] - ScaleLLM is now available on [PyPI](https://pypi.org/project/scalellm/). You can install it using `pip install scalellm`.\n* [03/2024] - [Advanced features](#advanced-features) support for ✅ [CUDA graph](#cuda-graph), ✅ [prefix cache](#prefix-cache), ✅ [chunked prefill](#chunked-prefill) and ✅ [speculative decoding](#speculative-decoding).\n* [11/2023] - [First release](https://github.com/vectorch-ai/ScaleLLM/releases/tag/v0.0.1) with support for popular [open-source models](#supported-models).\n\n## Key Features\n\n- [High Efficiency](): Excels in high-performance LLM inference, leveraging state-of-the-art techniques and technologies like [Flash Attention](https://github.com/Dao-AILab/flash-attention), [Paged Attention](https://github.com/vllm-project/vllm), [Continuous batching](https://www.anyscale.com/blog/continuous-batching-llm-inference), and more.\n- [Tensor Parallelism](): Utilizes tensor parallelism for efficient model execution.\n- [OpenAI-compatible API](): An OpenAI-compatible REST API server that supports both chat and completions.\n- [Huggingface models](): Seamless integration with most popular [HF models](#supported-models), supporting safetensors.\n- [Customizable](): Offers flexibility for customization to meet your specific needs, and provides an easy way to add new models.\n- [Production Ready](): Engineered with production environments in mind, ScaleLLM is equipped with robust system monitoring and management features to ensure a seamless deployment experience.\n\n## Table of contents\n\n- [Get Started](#get-started)\n  - [Installation](#installation)\n  - [Chatbot UI](#chatbot-ui)\n  - [Usage Examples](#usage-examples)\n- [Advanced Features](#advanced-features)\n  - [CUDA Graph](#cuda-graph)\n  - [Prefix Cache](#prefix-cache)\n  - [Chunked Prefill](#chunked-prefill)\n  - [Speculative Decoding](#speculative-decoding)\n  - [Quantization](#quantization)\n- [Supported Models](#supported-models)\n- [Limitations](#limitations)\n- [Contributing](#Contributing)\n- [Acknowledgements](#acknowledgements)\n- [License](#license)\n\n## Getting Started\n\nScaleLLM is available as a Python Wheel package on PyPI. You can install it using pip:\n```bash\n# Install scalellm with CUDA 12.4 and Pytorch 2.6.0\npip install -U scalellm\n```\n\nIf you want to install ScaleLLM with different version of CUDA and Pytorch, you can pip install it with provding index URL of the version. For example, to install ScaleLLM with CUDA 12.1 and Pytorch 2.4.1, you can use the following command:\n\n```bash\npip install -U scalellm -i https://whl.vectorch.com/cu121/torch2.4.1/\n```\n\n### Build from source\nIf no wheel package is available for your configuration, you can build ScaleLLM from source code. You can clone the repository and install it locally using the following commands:\n```bash\npython setup.py bdist_wheel\npip install dist/scalellm-*.whl\n```\n\n### OpenAI-Compatible Server\nYou can start the OpenAI-compatible REST API server with the following command:\n```bash\npython3 -m scalellm.serve.api_server --model=meta-llama/Meta-Llama-3.1-8B-Instruct\n```\n\n### Chatbot UI\n\nA local Chatbot UI is also available on [localhost:3000](localhost:3000). You can start it with [latest image](https://hub.docker.com/r/vectorchai/chatbot-ui/tags) using the following command:\n\n```bash\ndocker pull docker.io/vectorchai/chatbot-ui:latest\ndocker run -it --net=host \\\n  -e OPENAI_API_HOST=http://127.0.0.1:8080 \\\n  -e OPENAI_API_KEY=YOUR_API_KEY \\\n  docker.io/vectorchai/chatbot-ui:latest\n```\n\n### Usage Examples\nYou can use ScaleLLM for offline batch inference, or online distributed inference. Below are some examples to help you get started. More examples can be found in the [examples](https://github.com/vectorch-ai/ScaleLLM/tree/main/examples) folder.\n\n#### Chat Completions\n\nStart rest api server with the following command:\n```bash\npython3 -m scalellm.serve.api_server --model=meta-llama/Meta-Llama-3.1-8B-Instruct\n```\n\nYou can query the chat completions with curl:\n\n```bash\ncurl http://localhost:8080/v1/chat/completions \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"model\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n    \"messages\": [\n      {\n        \"role\": \"system\",\n        \"content\": \"You are a helpful assistant.\"\n      },\n      {\n        \"role\": \"user\",\n        \"content\": \"Hello!\"\n      }\n    ]\n  }'\n```\n\nor with openai python client:\n\n```python {linenos=true}\nimport openai\n\nclient = openai.Client(\n    base_url=\"http://localhost:8080/v1\",\n    api_key=\"EMPTY\",\n)\n\n# List available models\nmodels = client.models.list()\nprint(\"==== Available models ====\")\nfor model in models.data:\n    print(model.id)\n\n# choose the first model\nmodel = models.data[0].id\n\nstream = client.chat.completions.create(\n    model=model,\n    messages=[\n        {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n        {\"role\": \"user\", \"content\": \"Hello\"},\n    ],\n    stream=True,\n)\n\nprint(f\"==== Model: {model} ====\")\nfor chunk in stream:\n    choice = chunk.choices[0]\n    delta = choice.delta\n    if delta.content:\n        print(delta.content, end=\"\")\nprint()\n```\n\n#### Completions\nStart rest api server with the following command:\n```bash\npython3 -m scalellm.serve.api_server --model=meta-llama/Meta-Llama-3.1-8B\n```\n\nFor regular completions, you can use this example:\n\n```bash\ncurl http://localhost:8080/v1/completions \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"model\": \"meta-llama/Meta-Llama-3.1-8B\",\n    \"prompt\": \"hello\",\n    \"max_tokens\": 32,\n    \"temperature\": 0.7,\n    \"stream\": true\n  }'\n```\n\n```python {linenos=true}\nimport openai\n\nclient = openai.Client(\n    base_url=\"http://localhost:8080/v1\",\n    api_key=\"EMPTY\",\n)\n\n# List available models\nmodels = client.models.list()\n\nprint(\"==== Available models ====\")\nfor model in models.data:\n    print(model.id)\n\n# choose the first model\nmodel = models.data[0].id\n\nstream = client.completions.create(\n    model=model,\n    prompt=\"hello\",\n    max_tokens=32,\n    temperature=0.7,\n    stream=True,\n)\n\nprint(f\"==== Model: {model} ====\")\nfor chunk in stream:\n    choice = chunk.choices[0]\n    if choice.text:\n        print(choice.text, end=\"\")\nprint()\n```\n\n## Advanced Features\n### CUDA Graph\nCUDA Graph can improve performance by reducing the overhead of launching kernels. ScaleLLM supports CUDA Graph for decoding by default. In addition, It also allows user to specify which batch size to capture by setting the `--cuda_graph_batch_sizes` flag.\n\nfor example:\n```bash\npython3 -m scalellm.serve.api_server \\\n  --model=meta-llama/Meta-Llama-3.1-8B-Instruct \\\n  --enable_cuda_graph=true \\\n  --cuda_graph_batch_sizes=1,2,4,8\n```\n\nThe limitations of CUDA Graph could cause problems during development and debugging. If you encounter any issues related to it, you can disable CUDA Graph by setting the `--enable_cuda_graph=false` flag.\n\n### Prefix Cache\nThe KV cache is a technique that caches the intermediate kv states to avoid redundant computation during LLM inference. Prefix cache extends this idea by allowing kv caches with the same prefix to be shared among different requests.\n\nScaleLLM supports Prefix Cache and enables it by default. You can disable it by setting the `--enable_prefix_cache=false` flag.\n\n### Chunked Prefill\nChunked Prefill splits a long user prompt into multiple chunks and populates the remaining slots with decodes. This technique can improve decoding throughput and enhance the user experience caused by long stalls. However it may slightly increase Time to First Token (TTFT). ScaleLLM supports Chunked Prefill, and its behavior can be controlled by setting the following flags:\n- `--max_tokens_per_batch`: The maximum tokens for each batch, default is 512.\n- `--max_seqs_per_batch`: The maximum sequences for each batch, default is 128.\n\n### Speculative Decoding\nSpeculative Decoding is a common used technique to speed up LLM inference without\nchanging distribution. During inference, it employs an economical approximation to generate speculative tokens, subsequently validated by the target model. For now, ScaleLLM supports Speculative Decoding with a draft model to generate draft tokens, which can be enabled by configuring a draft model and setting the speculative steps.\n\nfor example:\n```bash\npython3 -m scalellm.serve.api_server \\\n  --model=google/gemma-7b-it \\\n  --draft_model=google/gemma-2b-it \\\n  --num_speculative_tokens=5 \\\n  --device=cuda:0 \\\n  --draft_device=cuda:0\n```\n\n### Quantization\nQuantization is a crucial process for reducing the memory footprint of models. ScaleLLM offers support for two quantization techniques: Accurate Post-Training Quantization ([GPTQ](https://arxiv.org/abs/2210.17323)) and Activation-aware Weight Quantization ([AWQ](https://arxiv.org/abs/2306.00978)), with seamless integration into the following libraries: autogptq and awq.\n\n\n## Supported Models\n\n|   Models   | Tensor Parallel | Quantization | Chat API | HF models examples |\n| :--------: | :-------------: | :----------: | :------: | :---------------------------:|\n|   Aquila   |       Yes       |     Yes      |    Yes   | [BAAI/Aquila-7B](https://huggingface.co/BAAI/Aquila-7B), [BAAI/AquilaChat-7B](https://huggingface.co/BAAI/AquilaChat-7B) |\n|   Bloom    |       Yes       |     Yes      |    No    | [bigscience/bloom](https://huggingface.co/bigscience/bloom) |\n|   Baichuan |       Yes       |     Yes      |    Yes   | [baichuan-inc/Baichuan2-7B-Chat](https://huggingface.co/baichuan-inc/Baichuan2-7B-Chat) |\n| ChatGLM4/3 |       Yes       |     Yes      |    Yes   | [THUDM/chatglm3-6b](https://huggingface.co/THUDM/chatglm3-6b) |\n|   Gemma2   |       Yes       |     Yes      |    Yes   | [google/gemma-2-2b](https://huggingface.co/google/gemma-2-2b) |\n|   GPT_j    |       Yes       |     Yes      |    No    | [EleutherAI/gpt-j-6b](https://huggingface.co/EleutherAI/gpt-j-6b) |\n|  GPT_NeoX  |       Yes       |     Yes      |    No    | [EleutherAI/gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b) |\n|    GPT2    |       Yes       |     Yes      |    No    | [gpt2](https://huggingface.co/gpt2)|\n| InternLM   |       Yes       |     Yes      |    Yes   | [internlm/internlm-7b](https://huggingface.co/internlm/internlm-7b) |\n|   Llama3/2 |       Yes       |     Yes      |    Yes   | [meta-llama/Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct), [meta-llama/Meta-Llama-3.1-8B](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B) |\n|  Mistral   |       Yes       |     Yes      |    Yes   | [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) |\n|    MPT     |       Yes       |     Yes      |    Yes   | [mosaicml/mpt-30b](https://huggingface.co/mosaicml/mpt-30b) |\n|   Phi2     |       Yes       |     Yes      |    No   | [microsoft/phi-2](https://huggingface.co/microsoft/phi-2) |\n|   Qwen2    |       Yes       |     Yes      |    Yes   | [Qwen/Qwen-72B-Chat](https://huggingface.co/Qwen/Qwen-72B-Chat) |\n|    Yi      |       Yes       |     Yes      |    Yes    |[01-ai/Yi-6B](https://huggingface.co/01-ai/Yi-6B), [01-ai/Yi-34B-Chat-4bits](https://huggingface.co/01-ai/Yi-34B-Chat-4bits), [01-ai/Yi-6B-200K](https://huggingface.co/01-ai/Yi-6B-200K) |\n\nIf your model is not included in the supported list, we are more than willing to assist you. Please feel free to create a request for adding a new model on [GitHub Issues](https://github.com/vectorch-ai/ScaleLLM/issues).\n\n## Limitations\n\nThere are several known limitations we are looking to address in the coming months, including:\n\n- Only supports GPUs that newer than Turing architecture.\n\n## Contributing\n\nIf you have any questions or want to contribute, please don't hesitate to ask in our [\"Discussions\" forum](https://github.com/vectorch-ai/ScaleLLM/discussions) or join our [\"Discord\" chat room](https://discord.gg/PKe5gvBZfn). We welcome your input and contributions to make ScaleLLM even better. Please follow the [Contributing.md](https://github.com/vectorch-ai/ScaleLLM/blob/main/CONTRIBUTING.md) to get started.\n\n## Acknowledgements\nThe following open-source projects have been used in this project, either in their original form or modified to meet our needs:\n* [pytorch](https://github.com/pytorch/pytorch)\n* [FasterTransformer](https://github.com/NVIDIA/FasterTransformer)\n* [vllm](https://github.com/vllm-project/vllm)\n* [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)\n* [llm-awq](https://github.com/mit-han-lab/llm-awq)\n* [exllama](https://github.com/turboderp/exllamav2)\n* [tokenizers](https://github.com/huggingface/tokenizers)\n* [safetensors](https://github.com/huggingface/safetensors/)\n* [sentencepiece](https://github.com/google/sentencepiece)\n* [grpc-gateway](https://github.com/grpc-ecosystem/grpc-gateway)\n\n## License\nThis project is released under the [Apache 2.0 license](https://github.com/vectorch-ai/ScaleLLM/blob/main/LICENSE).\n","funding_links":[],"categories":["A01_文本生成_文本对话"],"sub_categories":["大语言对话模型及数据"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvectorch-ai%2FScaleLLM","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvectorch-ai%2FScaleLLM","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvectorch-ai%2FScaleLLM/lists"}