https://github.com/mani-kantap/llm-inference-solutions
A collection of all available inference solutions for the LLMs
https://github.com/mani-kantap/llm-inference-solutions
llm-inference llm-serving llmops
Last synced: 17 days ago
JSON representation
A collection of all available inference solutions for the LLMs
- Host: GitHub
- URL: https://github.com/mani-kantap/llm-inference-solutions
- Owner: mani-kantap
- License: mit
- Created: 2023-07-23T20:39:23.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2024-04-25T21:01:31.000Z (about 1 year ago)
- Last Synced: 2024-04-25T23:22:19.921Z (about 1 year ago)
- Topics: llm-inference, llm-serving, llmops
- Homepage:
- Size: 19.5 KB
- Stars: 43
- Watchers: 1
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-llm - llm-inference-solutions
- Awesome-LLM - llm-inference-solutions
- awesome-llm - llm-inference-solutions
README
# llm-inference-solutions
A collection of all available inference solutions for the LLMs| Name | Organization | Description | Supported Hardware | Key Features | License |
|------|--------------|-------------|--------------------|--------------|---------|
| [vLLM](https://github.com/vllm-project/vllm) | UC Berkeley | High-throughput and memory-efficient inference and serving engine for LLMs. | CPU, GPU | PagedAttention for optimized memory management, high-throughput serving. | Apache 2.0 |
| [Text-Generation-Inference](https://github.com/huggingface/text-generation-inference) | Hugging Face 🤗 | Efficient and scalable text generation inference for LLMs. | CPU, GPU | Multi-model serving, dynamic batching, optimized for transformers. | Apache 2.0 |
| [llm-engine](https://github.com/scaleapi/llm-engine) | Scale AI | Scale LLM Engine public repository for efficient inference. | CPU, GPU | Scalable deployment, monitoring tools, integration with Scale AI services. | Apache 2.0 |
| [DeepSpeed](https://github.com/microsoft/DeepSpeed) | Microsoft | Deep learning optimization library for easy, efficient, and effective distributed training and inference. | CPU, GPU | ZeRO redundancy optimizer, mixed-precision training, model parallelism. | MIT |
| [OpenLLM](https://github.com/bentoml/OpenLLM) | BentoML | Operating LLMs in production with ease. | CPU, GPU | Model serving, deployment orchestration, integration with BentoML. | Apache 2.0 |
| [LMDeploy](https://github.com/InternLM/lmdeploy) | InternLM Team | Toolkit for compressing, deploying, and serving LLMs. | CPU, GPU | Model compression, deployment automation, serving optimization. | Apache 2.0 |
| [FlexFlow](https://github.com/flexflow/FlexFlow) | CMU, Stanford, UCSD | A distributed deep learning framework. | CPU, GPU, TPU | Automatic parallelization, support for complex models, scalability. | Apache 2.0 |
| [CTranslate2](https://github.com/OpenNMT/CTranslate2) | OpenNMT | Fast inference engine for Transformer models. | CPU, GPU | Int8 quantization, multi-threaded execution, optimized for translation models. | MIT |
| [FastChat](https://github.com/lm-sys/FastChat) | lm-sys | Open platform for training, serving, and evaluating large language models; release repo for Vicuna and Chatbot Arena. | CPU, GPU | Chatbot framework, multi-turn conversations, evaluation tools. | Apache 2.0 |
| [Triton Inference Server](https://github.com/triton-inference-server/server) | NVIDIA | Optimized cloud and edge inferencing solution. | CPU, GPU | Model ensemble, dynamic batching, support for multiple frameworks. | BSD-3-Clause |
| [Lepton.AI](https://github.com/leptonai/leptonai) | lepton.ai | Pythonic framework to simplify AI service building. | CPU, GPU | Service orchestration, API generation, scalability. | MIT |
| [ScaleLLM](https://github.com/vectorch-ai/ScaleLLM) | Vectorch | High-performance inference system for LLMs, designed for production environments. | CPU, GPU | Low-latency serving, high throughput, production-ready. | Apache 2.0 |
| [Lorax](https://predibase.com/blog/lorax-the-open-source-framework-for-serving-100s-of-fine-tuned-llms-in) | Predibase | Serve hundreds of fine-tuned LLMs in production for the cost of one. | CPU, GPU | Model multiplexing, cost-efficient serving, scalability. | Apache 2.0 |
| [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) | NVIDIA | Provides users with an easy-to-use Python API to define LLMs and build TensorRT engines. | GPU | TensorRT optimization, high-performance inference, integration with NVIDIA GPUs. | Apache 2.0 |
| [mistral.rs](https://github.com/EricLBuehler/mistral.rs) | mistral.rs | Blazingly fast LLM inference. | CPU, GPU | Rust-based implementation, performance optimization, lightweight. | MIT |
| [NanoFlow](https://github.com/efeslab/Nanoflow) | NanoFlow | Throughput-oriented high-performance serving framework for LLMs. | CPU, GPU | High throughput, low latency, optimized for large-scale deployments. | Apache 2.0 |
| [LMCache](https://github.com/LMCache/LMCache) | LMCache | Fast and cost-efficient inference. | CPU, GPU | Caching mechanisms, cost optimization, scalable serving. | Apache 2.0 |
| [Litserve](https://github.com/Lightning-AI/LitServe) | Lightning.AI | Lightning-fast serving engine for AI models; flexible, easy, enterprise-scale. | CPU, GPU | Rapid deployment, flexible architecture, enterprise integration. | Apache 2.0 |
| [DeepSeek Inference System Overview](https://github.com/deepseek-ai/open-infra-index/blob/main/202502OpenSourceWeek/day_6_one_more_thing_deepseekV3R1_inference_system_overview.md) | DeepSeek | Higher throughput and lower latency inference system. | CPU, GPU | Optimized performance, low latency, high throughput. | Proprietary |