https://github.com/mani-kantap/llm-inference-solutions

A collection of all available inference solutions for the LLMs
https://github.com/mani-kantap/llm-inference-solutions

llm-inference llm-serving llmops

Last synced: about 1 month ago
JSON representation

A collection of all available inference solutions for the LLMs

Host: GitHub
URL: https://github.com/mani-kantap/llm-inference-solutions
Owner: mani-kantap
License: mit
Created: 2023-07-23T20:39:23.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2024-04-25T21:01:31.000Z (about 1 year ago)
Last Synced: 2024-04-25T23:22:19.921Z (about 1 year ago)
Topics: llm-inference, llm-serving, llmops
Homepage:
Size: 19.5 KB
Stars: 43
Watchers: 1
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-llm - llm-inference-solutions
Awesome-LLM - llm-inference-solutions
awesome-llm - llm-inference-solutions

README

        # llm-inference-solutions

A collection of all available inference solutions for the LLMs

| Name | Organization | Description | Supported Hardware | Key Features | License |

|------|--------------|-------------|--------------------|--------------|---------|

| [vLLM](https://github.com/vllm-project/vllm) | UC Berkeley | High-throughput and memory-efficient inference and serving engine for LLMs. | CPU, GPU | PagedAttention for optimized memory management, high-throughput serving. | Apache 2.0 |

| [Text-Generation-Inference](https://github.com/huggingface/text-generation-inference) | Hugging Face 🤗 | Efficient and scalable text generation inference for LLMs. | CPU, GPU | Multi-model serving, dynamic batching, optimized for transformers. | Apache 2.0 |

| [llm-engine](https://github.com/scaleapi/llm-engine) | Scale AI | Scale LLM Engine public repository for efficient inference. | CPU, GPU | Scalable deployment, monitoring tools, integration with Scale AI services. | Apache 2.0 |

| [DeepSpeed](https://github.com/microsoft/DeepSpeed) | Microsoft | Deep learning optimization library for easy, efficient, and effective distributed training and inference. | CPU, GPU | ZeRO redundancy optimizer, mixed-precision training, model parallelism. | MIT |

| [OpenLLM](https://github.com/bentoml/OpenLLM) | BentoML | Operating LLMs in production with ease. | CPU, GPU | Model serving, deployment orchestration, integration with BentoML. | Apache 2.0 |

| [LMDeploy](https://github.com/InternLM/lmdeploy) | InternLM Team | Toolkit for compressing, deploying, and serving LLMs. | CPU, GPU | Model compression, deployment automation, serving optimization. | Apache 2.0 |

| [FlexFlow](https://github.com/flexflow/FlexFlow) | CMU, Stanford, UCSD | A distributed deep learning framework. | CPU, GPU, TPU | Automatic parallelization, support for complex models, scalability. | Apache 2.0 |

| [CTranslate2](https://github.com/OpenNMT/CTranslate2) | OpenNMT | Fast inference engine for Transformer models. | CPU, GPU | Int8 quantization, multi-threaded execution, optimized for translation models. | MIT |

| [FastChat](https://github.com/lm-sys/FastChat) | lm-sys | Open platform for training, serving, and evaluating large language models; release repo for Vicuna and Chatbot Arena. | CPU, GPU | Chatbot framework, multi-turn conversations, evaluation tools. | Apache 2.0 |

| [Triton Inference Server](https://github.com/triton-inference-server/server) | NVIDIA | Optimized cloud and edge inferencing solution. | CPU, GPU | Model ensemble, dynamic batching, support for multiple frameworks. | BSD-3-Clause |

| [Lepton.AI](https://github.com/leptonai/leptonai) | lepton.ai | Pythonic framework to simplify AI service building. | CPU, GPU | Service orchestration, API generation, scalability. | MIT |

| [ScaleLLM](https://github.com/vectorch-ai/ScaleLLM) | Vectorch | High-performance inference system for LLMs, designed for production environments. | CPU, GPU | Low-latency serving, high throughput, production-ready. | Apache 2.0 |

| [Lorax](https://predibase.com/blog/lorax-the-open-source-framework-for-serving-100s-of-fine-tuned-llms-in) | Predibase | Serve hundreds of fine-tuned LLMs in production for the cost of one. | CPU, GPU | Model multiplexing, cost-efficient serving, scalability. | Apache 2.0 |

| [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) | NVIDIA | Provides users with an easy-to-use Python API to define LLMs and build TensorRT engines. | GPU | TensorRT optimization, high-performance inference, integration with NVIDIA GPUs. | Apache 2.0 |

| [mistral.rs](https://github.com/EricLBuehler/mistral.rs) | mistral.rs | Blazingly fast LLM inference. | CPU, GPU | Rust-based implementation, performance optimization, lightweight. | MIT |

| [NanoFlow](https://github.com/efeslab/Nanoflow) | NanoFlow | Throughput-oriented high-performance serving framework for LLMs. | CPU, GPU | High throughput, low latency, optimized for large-scale deployments. | Apache 2.0 |

| [LMCache](https://github.com/LMCache/LMCache) | LMCache | Fast and cost-efficient inference. | CPU, GPU | Caching mechanisms, cost optimization, scalable serving. | Apache 2.0 |

| [Litserve](https://github.com/Lightning-AI/LitServe) | Lightning.AI | Lightning-fast serving engine for AI models; flexible, easy, enterprise-scale. | CPU, GPU | Rapid deployment, flexible architecture, enterprise integration. | Apache 2.0 |

| [DeepSeek Inference System Overview](https://github.com/deepseek-ai/open-infra-index/blob/main/202502OpenSourceWeek/day_6_one_more_thing_deepseekV3R1_inference_system_overview.md) | DeepSeek | Higher throughput and lower latency inference system. | CPU, GPU | Optimized performance, low latency, high throughput. | Proprietary |

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mani-kantap/llm-inference-solutions

Awesome Lists containing this project

README