https://github.com/LMCache/LMCache
Redis for LLMs
https://github.com/LMCache/LMCache
Last synced: 12 months ago
JSON representation
Redis for LLMs
- Host: GitHub
- URL: https://github.com/LMCache/LMCache
- Owner: LMCache
- License: apache-2.0
- Created: 2024-05-28T21:06:04.000Z (almost 2 years ago)
- Default Branch: dev
- Last Pushed: 2025-03-21T03:51:47.000Z (12 months ago)
- Last Synced: 2025-03-21T04:32:08.374Z (12 months ago)
- Language: Python
- Homepage: https://lmcache.ai/
- Size: 4.39 MB
- Stars: 623
- Watchers: 9
- Forks: 66
- Open Issues: 98
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Security: SECURITY.md
Awesome Lists containing this project
- awesome-repositories - LMCache/LMCache - Supercharge Your LLM with the Fastest KV Cache Layer (Python)
- AiTreasureBox - LMCache/LMCache - 10-21_5628_2](https://img.shields.io/github/stars/LMCache/LMCache.svg)|Supercharge Your LLM with the Fastest KV Cache Layer| (Repos)
- StarryDivineSky - LMCache/LMCache - 10倍。其核心原理基于动态缓存分区和高效内存管理,通过将中间计算结果缓存至共享内存并按需复用,避免重复的KV计算过程,同时支持多线程并行和流水线优化,确保在高并发场景下的稳定性。项目还提供灵活的配置接口,允许用户根据硬件条件调整缓存粒度和内存分配策略,适用于服务端推理加速、对话系统优化等场景。LMCache通过开源代码和详细文档降低使用门槛,目标是为LLM开发者提供一个轻量、可扩展的缓存解决方案,解决长序列生成中的性能瓶颈问题。 (A01_文本生成_文本对话 / 大语言对话模型及数据)
- Awesome-LLMOps - LMCache - Context LLM By Smart KV Cache Optimizations.     (Inference / Middleware)
- awesome-ai-efficiency - LLMCache - to-first-token and increase throughput, especially under long-context scenarios. (Tools 🛠️)
- awesome-local-llm - LMCache - supercharge your LLM with the fastest KV Cache Layer (Tools / Memory Management)
README
| [**Blog**](https://lmcache.github.io) | [**Documentation**](https://docs.lmcache.ai/) | [**Join Slack**](https://join.slack.com/t/lmcacheworkspace/shared_invite/zt-2viziwhue-5Amprc9k5hcIdXT7XevTaQ) | [**Interest Form**](https://forms.gle/mQfQDUXbKfp2St1z7) | [**Official Email**](contact@lmcache.ai) |
# 💡 What is LMCache?
TL;DR - Redis for LLMs.
LMCache is a **LLM** serving engine extension to **reduce TTFT** and **increase throughput**, especially under long-context scenarios. By storing the KV caches of reusable texts across various locations including (GPU, CPU DRAM, Local Disk), LMCache reuse the KV caches of **_any_** reused text (not necessarily prefix) in **_any_** serving engine instance. Thus, LMCache saves precious GPU cycles and reduces response delay for users.
By combining LMCache with vLLM, LMCaches achieves 3-10x delay savings and GPU cycle reduction in many LLM use cases, including multi-round QA and RAG.
Try LMCache with pre-built vllm docker images [here](https://github.com/LMCache/demo).
# 🚀 Performance snapshot

# 💻 Quickstart
LMCache provides the integration to the latest vLLM (0.6.2). To install LMCache, use the following command:
```bash
# requires python >= 3.10 and nvcc >= 12.1
pip install lmcache lmcache_vllm
```
LMCache has the same interface as vLLM (both online serving and offline inference).
To use the online serving, you can start an OpenAI API-compatible vLLM server with LMCache via:
```bash
lmcache_vllm serve lmsys/longchat-7b-16k --gpu-memory-utilization 0.8
```
To use vLLM's offline inference with LMCache, just simply add `lmcache_vllm` before the import to vLLM components. For example
```python
import lmcache_vllm.vllm as vllm
from lmcache_vllm.vllm import LLM
```
More detailed documentation will be available soon.
## - Sharing KV cache across multiple vLLM instances
LMCache supports sharing KV across different vLLM instances by the `lmcache.server` module. Here is a quick guide:
```bash
# Start lmcache server
lmcache_server localhost 65432
```
Then, start two vLLM instances with the LMCache config file
```bash
wget https://raw.githubusercontent.com/LMCache/LMCache/refs/heads/dev/examples/example.yaml
# start the first vLLM instance
LMCACHE_CONFIG_FILE=example.yaml CUDA_VISIBLE_DEVICES=0 lmcache_vllm serve lmsys/longchat-7b-16k --gpu-memory-utilization 0.8 --port 8000
# start the second vLLM instance
LMCACHE_CONFIG_FILE=example.yaml CUDA_VISIBLE_DEVICES=1 lmcache_vllm serve lmsys/longchat-7b-16k --gpu-memory-utilization 0.8 --port 8001
```
# - What's next
We also provide multiple docker-based demos at [🔗LMCache-demos repo](https://github.com/LMCache/demo). The demos cover the following use cases:
- Share KV caches across multiple serving engines [(🔗link)](https://github.com/LMCache/demo/tree/master/demo2-multi-node-sharing)
- Loading non-prefix KV caches for RAG [(🔗link)](https://github.com/LMCache/demo/tree/master/demo3-KV-blending)
# Interested in Connecting?
Fill out the interest form and our team will reach out to you!
https://forms.gle/mQfQDUXbKfp2St1z7
# 🛣️ Incoming Milestones
- [x] First release of LMCache
- [x] Support installation through pip install and integrate with latest vLLM
- [ ] Stable support for non-prefix KV caches
- [ ] User and developer documentation
# 📖 Blogs and documentations
Our [blog posts](https://lmcache.github.io) and [documentations](https://docs.lmcache.ai/) are available online
# Community meeting
- :link: Meeting link - https://uchicago.zoom.us/j/91454186439?pwd=Qu3IMJH7c83Qbg9hHsXZ3BxzLaEFoF.1
- :page_facing_up: Community Meeting Document - https://docs.google.com/document/d/1SnCKnB2UFBUyPhIpL9zzdZsn_hGp50spoZue-2SoxJY/edit?usp=sharing
- 🗓️ Calendar - https://calendar.app.google/rsu7Xgq4y4y5YuDj7
## Citation
If you use LMCache for your research, please cite our papers:
```
@inproceedings{liu2024cachegen,
title={Cachegen: Kv cache compression and streaming for fast large language model serving},
author={Liu, Yuhan and Li, Hanchen and Cheng, Yihua and Ray, Siddhant and Huang, Yuyang and Zhang, Qizheng and Du, Kuntai and Yao, Jiayi and Lu, Shan and Ananthanarayanan, Ganesh and others},
booktitle={Proceedings of the ACM SIGCOMM 2024 Conference},
pages={38--56},
year={2024}
}
@article{cheng2024large,
title={Do Large Language Models Need a Content Delivery Network?},
author={Cheng, Yihua and Du, Kuntai and Yao, Jiayi and Jiang, Junchen},
journal={arXiv preprint arXiv:2409.13761},
year={2024}
}
@article{yao2024cacheblend,
title={CacheBlend: Fast Large Language Model Serving with Cached Knowledge Fusion},
author={Yao, Jiayi and Li, Hanchen and Liu, Yuhan and Ray, Siddhant and Cheng, Yihua and Zhang, Qizheng and Du, Kuntai and Lu, Shan and Jiang, Junchen},
journal={arXiv preprint arXiv:2405.16444},
year={2024}
}
```