An open API service indexing awesome lists of open source software.

https://github.com/mani-kantap/llm-inference-solutions

A collection of all available inference solutions for the LLMs
https://github.com/mani-kantap/llm-inference-solutions

llm-inference llm-serving llmops

Last synced: 17 days ago
JSON representation

A collection of all available inference solutions for the LLMs

Awesome Lists containing this project

README

        

# llm-inference-solutions
A collection of all available inference solutions for the LLMs

| Name | Organization | Description | Supported Hardware | Key Features | License |
|------|--------------|-------------|--------------------|--------------|---------|
| [vLLM](https://github.com/vllm-project/vllm) | UC Berkeley | High-throughput and memory-efficient inference and serving engine for LLMs. | CPU, GPU | PagedAttention for optimized memory management, high-throughput serving. | Apache 2.0 |
| [Text-Generation-Inference](https://github.com/huggingface/text-generation-inference) | Hugging Face 🤗 | Efficient and scalable text generation inference for LLMs. | CPU, GPU | Multi-model serving, dynamic batching, optimized for transformers. | Apache 2.0 |
| [llm-engine](https://github.com/scaleapi/llm-engine) | Scale AI | Scale LLM Engine public repository for efficient inference. | CPU, GPU | Scalable deployment, monitoring tools, integration with Scale AI services. | Apache 2.0 |
| [DeepSpeed](https://github.com/microsoft/DeepSpeed) | Microsoft | Deep learning optimization library for easy, efficient, and effective distributed training and inference. | CPU, GPU | ZeRO redundancy optimizer, mixed-precision training, model parallelism. | MIT |
| [OpenLLM](https://github.com/bentoml/OpenLLM) | BentoML | Operating LLMs in production with ease. | CPU, GPU | Model serving, deployment orchestration, integration with BentoML. | Apache 2.0 |
| [LMDeploy](https://github.com/InternLM/lmdeploy) | InternLM Team | Toolkit for compressing, deploying, and serving LLMs. | CPU, GPU | Model compression, deployment automation, serving optimization. | Apache 2.0 |
| [FlexFlow](https://github.com/flexflow/FlexFlow) | CMU, Stanford, UCSD | A distributed deep learning framework. | CPU, GPU, TPU | Automatic parallelization, support for complex models, scalability. | Apache 2.0 |
| [CTranslate2](https://github.com/OpenNMT/CTranslate2) | OpenNMT | Fast inference engine for Transformer models. | CPU, GPU | Int8 quantization, multi-threaded execution, optimized for translation models. | MIT |
| [FastChat](https://github.com/lm-sys/FastChat) | lm-sys | Open platform for training, serving, and evaluating large language models; release repo for Vicuna and Chatbot Arena. | CPU, GPU | Chatbot framework, multi-turn conversations, evaluation tools. | Apache 2.0 |
| [Triton Inference Server](https://github.com/triton-inference-server/server) | NVIDIA | Optimized cloud and edge inferencing solution. | CPU, GPU | Model ensemble, dynamic batching, support for multiple frameworks. | BSD-3-Clause |
| [Lepton.AI](https://github.com/leptonai/leptonai) | lepton.ai | Pythonic framework to simplify AI service building. | CPU, GPU | Service orchestration, API generation, scalability. | MIT |
| [ScaleLLM](https://github.com/vectorch-ai/ScaleLLM) | Vectorch | High-performance inference system for LLMs, designed for production environments. | CPU, GPU | Low-latency serving, high throughput, production-ready. | Apache 2.0 |
| [Lorax](https://predibase.com/blog/lorax-the-open-source-framework-for-serving-100s-of-fine-tuned-llms-in) | Predibase | Serve hundreds of fine-tuned LLMs in production for the cost of one. | CPU, GPU | Model multiplexing, cost-efficient serving, scalability. | Apache 2.0 |
| [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) | NVIDIA | Provides users with an easy-to-use Python API to define LLMs and build TensorRT engines. | GPU | TensorRT optimization, high-performance inference, integration with NVIDIA GPUs. | Apache 2.0 |
| [mistral.rs](https://github.com/EricLBuehler/mistral.rs) | mistral.rs | Blazingly fast LLM inference. | CPU, GPU | Rust-based implementation, performance optimization, lightweight. | MIT |
| [NanoFlow](https://github.com/efeslab/Nanoflow) | NanoFlow | Throughput-oriented high-performance serving framework for LLMs. | CPU, GPU | High throughput, low latency, optimized for large-scale deployments. | Apache 2.0 |
| [LMCache](https://github.com/LMCache/LMCache) | LMCache | Fast and cost-efficient inference. | CPU, GPU | Caching mechanisms, cost optimization, scalable serving. | Apache 2.0 |
| [Litserve](https://github.com/Lightning-AI/LitServe) | Lightning.AI | Lightning-fast serving engine for AI models; flexible, easy, enterprise-scale. | CPU, GPU | Rapid deployment, flexible architecture, enterprise integration. | Apache 2.0 |
| [DeepSeek Inference System Overview](https://github.com/deepseek-ai/open-infra-index/blob/main/202502OpenSourceWeek/day_6_one_more_thing_deepseekV3R1_inference_system_overview.md) | DeepSeek | Higher throughput and lower latency inference system. | CPU, GPU | Optimized performance, low latency, high throughput. | Proprietary |