{"id":26726967,"url":"https://github.com/LMCache/LMCache","last_synced_at":"2025-03-27T22:04:55.526Z","repository":{"id":260139084,"uuid":"807305060","full_name":"LMCache/LMCache","owner":"LMCache","description":"Redis for LLMs","archived":false,"fork":false,"pushed_at":"2025-03-21T03:51:47.000Z","size":4599,"stargazers_count":623,"open_issues_count":98,"forks_count":66,"subscribers_count":9,"default_branch":"dev","last_synced_at":"2025-03-21T04:32:08.374Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://lmcache.ai/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/LMCache.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-05-28T21:06:04.000Z","updated_at":"2025-03-21T03:51:52.000Z","dependencies_parsed_at":"2024-11-18T07:28:13.069Z","dependency_job_id":"8acc6f1c-8824-4fd6-bd2e-d3091ff32694","html_url":"https://github.com/LMCache/LMCache","commit_stats":null,"previous_names":["lmcache/lmcache"],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LMCache%2FLMCache","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LMCache%2FLMCache/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LMCache%2FLMCache/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LMCache%2FLMCache/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/LMCache","download_url":"https://codeload.github.com/LMCache/LMCache/tar.gz/refs/heads/dev","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245931863,"owners_count":20695963,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-03-27T22:02:19.312Z","updated_at":"2025-03-27T22:04:55.477Z","avatar_url":"https://github.com/LMCache.png","language":"Python","funding_links":[],"categories":["Python","Repos","A01_文本生成_文本对话","Tools 🛠️","Inference","4. Context Optimization","Tools","3. Inference Engines \u0026 Serving","Caching"],"sub_categories":["大语言对话模型及数据","Middleware","Rust","Memory Management","Inference infrastructure KV cache"],"readme":"\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"https://github.com/user-attachments/assets/a0809748-3cb1-4732-9c5a-acfa90cc72d1\" width=\"720\" alt=\"lmcache logo\"\u003e\n\u003c/a\u003e\n\u003c/div\u003e\n\n| [**Blog**](https://lmcache.github.io) | [**Documentation**](https://docs.lmcache.ai/) | [**Join Slack**](https://join.slack.com/t/lmcacheworkspace/shared_invite/zt-2viziwhue-5Amprc9k5hcIdXT7XevTaQ) | [**Interest Form**](https://forms.gle/mQfQDUXbKfp2St1z7) | [**Official Email**](contact@lmcache.ai) |\n\n# 💡 What is LMCache?\n\nTL;DR - Redis for LLMs. \n\nLMCache is a **LLM** serving engine extension to **reduce TTFT** and **increase throughput**, especially under long-context scenarios. By storing the KV caches of reusable texts across various locations including (GPU, CPU DRAM, Local Disk), LMCache reuse the KV caches of **_any_** reused text (not necessarily prefix) in **_any_** serving engine instance. Thus, LMCache saves precious GPU cycles and reduces response delay for users.  \n\nBy combining LMCache with vLLM, LMCaches achieves 3-10x delay savings and GPU cycle reduction in many LLM use cases, including multi-round QA and RAG.\n\nTry LMCache with pre-built vllm docker images [here](https://github.com/LMCache/demo).\n\n# 🚀 Performance snapshot\n![image](https://github.com/user-attachments/assets/7db9510f-0104-4fb3-9976-8ad5d7fafe26)\n\n# 💻 Quickstart\n\nLMCache provides the integration to the latest vLLM (0.6.2). To install LMCache, use the following command:\n```bash\n# requires python \u003e= 3.10 and nvcc \u003e= 12.1\npip install lmcache lmcache_vllm\n```\n\nLMCache has the same interface as vLLM (both online serving and offline inference).\nTo use the online serving, you can start an OpenAI API-compatible vLLM server with LMCache via:\n```bash\nlmcache_vllm serve lmsys/longchat-7b-16k --gpu-memory-utilization 0.8\n```\n\nTo use vLLM's offline inference with LMCache, just simply add `lmcache_vllm` before the import to vLLM components. For example\n```python\nimport lmcache_vllm.vllm as vllm\nfrom lmcache_vllm.vllm import LLM \n```\n\nMore detailed documentation will be available soon.\n\n## - Sharing KV cache across multiple vLLM instances\n\nLMCache supports sharing KV across different vLLM instances by the `lmcache.server` module. Here is a quick guide:\n\n```bash\n# Start lmcache server\nlmcache_server localhost 65432\n```\n\nThen, start two vLLM instances with the LMCache config file\n```bash\nwget https://raw.githubusercontent.com/LMCache/LMCache/refs/heads/dev/examples/example.yaml\n\n# start the first vLLM instance\nLMCACHE_CONFIG_FILE=example.yaml CUDA_VISIBLE_DEVICES=0 lmcache_vllm serve lmsys/longchat-7b-16k --gpu-memory-utilization 0.8 --port 8000\n\n# start the second vLLM instance\nLMCACHE_CONFIG_FILE=example.yaml CUDA_VISIBLE_DEVICES=1 lmcache_vllm serve lmsys/longchat-7b-16k --gpu-memory-utilization 0.8 --port 8001\n```\n\n\n# - What's next\nWe also provide multiple docker-based demos at [🔗LMCache-demos repo](https://github.com/LMCache/demo). The demos cover the following use cases:\n- Share KV caches across multiple serving engines [(🔗link)](https://github.com/LMCache/demo/tree/master/demo2-multi-node-sharing)\n- Loading non-prefix KV caches for RAG [(🔗link)](https://github.com/LMCache/demo/tree/master/demo3-KV-blending)\n\n# Interested in Connecting?\nFill out the interest form and our team will reach out to you!\nhttps://forms.gle/mQfQDUXbKfp2St1z7\n\n# 🛣️ Incoming Milestones\n\n- [x] First release of LMCache \n- [x] Support installation through pip install and integrate with latest vLLM\n- [ ] Stable support for non-prefix KV caches\n- [ ] User and developer documentation\n\n# 📖 Blogs and documentations\n\nOur [blog posts](https://lmcache.github.io) and [documentations](https://docs.lmcache.ai/) are available online\n\n# Community meeting\n\n- :link: Meeting link - https://uchicago.zoom.us/j/91454186439?pwd=Qu3IMJH7c83Qbg9hHsXZ3BxzLaEFoF.1\n- :page_facing_up: Community Meeting Document - https://docs.google.com/document/d/1SnCKnB2UFBUyPhIpL9zzdZsn_hGp50spoZue-2SoxJY/edit?usp=sharing\n- 🗓️ Calendar - https://calendar.app.google/rsu7Xgq4y4y5YuDj7\n\n\n## Citation\nIf you use LMCache for your research, please cite our papers:\n\n```\n@inproceedings{liu2024cachegen,\n  title={Cachegen: Kv cache compression and streaming for fast large language model serving},\n  author={Liu, Yuhan and Li, Hanchen and Cheng, Yihua and Ray, Siddhant and Huang, Yuyang and Zhang, Qizheng and Du, Kuntai and Yao, Jiayi and Lu, Shan and Ananthanarayanan, Ganesh and others},\n  booktitle={Proceedings of the ACM SIGCOMM 2024 Conference},\n  pages={38--56},\n  year={2024}\n}\n\n@article{cheng2024large,\n  title={Do Large Language Models Need a Content Delivery Network?},\n  author={Cheng, Yihua and Du, Kuntai and Yao, Jiayi and Jiang, Junchen},\n  journal={arXiv preprint arXiv:2409.13761},\n  year={2024}\n}\n\n@article{yao2024cacheblend,\n  title={CacheBlend: Fast Large Language Model Serving with Cached Knowledge Fusion},\n  author={Yao, Jiayi and Li, Hanchen and Liu, Yuhan and Ray, Siddhant and Cheng, Yihua and Zhang, Qizheng and Du, Kuntai and Lu, Shan and Jiang, Junchen},\n  journal={arXiv preprint arXiv:2405.16444},\n  year={2024}\n}\n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FLMCache%2FLMCache","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FLMCache%2FLMCache","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FLMCache%2FLMCache/lists"}