{"id":13618405,"url":"https://github.com/kvcache-ai/Mooncake","last_synced_at":"2025-04-14T10:31:59.062Z","repository":{"id":246138028,"uuid":"819733173","full_name":"kvcache-ai/Mooncake","owner":"kvcache-ai","description":"Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.","archived":false,"fork":false,"pushed_at":"2024-07-31T14:42:11.000Z","size":2435,"stargazers_count":939,"open_issues_count":2,"forks_count":19,"subscribers_count":12,"default_branch":"main","last_synced_at":"2024-08-01T20:53:27.030Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2407.00079","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kvcache-ai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-06-25T05:21:05.000Z","updated_at":"2024-08-01T06:10:29.000Z","dependencies_parsed_at":"2024-06-27T10:43:38.085Z","dependency_job_id":"dcc0cf93-58e6-4b99-8a0d-01d809d50333","html_url":"https://github.com/kvcache-ai/Mooncake","commit_stats":null,"previous_names":["kvcache-ai/mooncake"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kvcache-ai%2FMooncake","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kvcache-ai%2FMooncake/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kvcache-ai%2FMooncake/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kvcache-ai%2FMooncake/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kvcache-ai","download_url":"https://codeload.github.com/kvcache-ai/Mooncake/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223627514,"owners_count":17175721,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T20:02:00.720Z","updated_at":"2024-11-08T03:31:11.308Z","avatar_url":"https://github.com/kvcache-ai.png","language":null,"readme":"\u003cdiv align=\"center\"\u003e\n  \u003ch1\u003eMooncake: A KVCache-centric Disaggregated\u003cbr/\u003e Architecture for LLM Serving\u003c/h1\u003e\n  \u003ca href=\"https://arxiv.org/abs/2407.00079\" target=\"_blank\"\u003e\u003cstrong\u003e📃 Technical Report\u003c/strong\u003e\u003c/a\u003e\n\u003c/div\u003e\n\u003cbr/\u003e\n\nMooncake is the serving platform for  \u003ca href=\"https://kimi.ai/\"\u003e\u003cimg src=\"image/kimi.png\" alt=\"icon\" style=\"height: 16px; vertical-align: middle;\"\u003e Kimi\u003c/a\u003e, a leading LLM service provided by \u003ca href=\"https://www.moonshot.cn/\"\u003e\u003cimg src=\"image/moonshot.jpg\" alt=\"icon\" style=\"height: 16px; vertical-align: middle;\"\u003e Moonshot AI\u003c/a\u003e. \nThis repository hosts its technical report and also the open sourced traces. \n\nMore will come - perhaps not very soon, but stay tuned!\n\n\u003ch2 id=\"updates\"\u003e🔥 Updates\u003c/h2\u003e\n\n - **July 9, 2024**: We open sourced the trace as a \u003ca href=\"https://github.com/kvcache-ai/Mooncake/blob/main/mooncake_trace.jsonl\" target=\"_blank\"\u003ejsonl file\u003c/a\u003e!.\n - **June 27, 2024**: We present a series of Chinese blogs with more discussions on \u003ca href=\"https://zhuanlan.zhihu.com/p/705754254\"\u003ezhihu 1\u003c/a\u003e, \u003ca href=\"https://zhuanlan.zhihu.com/p/705910725\"\u003e2\u003c/a\u003e, \u003ca href=\"https://zhuanlan.zhihu.com/p/706204757\"\u003e3\u003c/a\u003e, \u003ca href=\"https://zhuanlan.zhihu.com/p/707997501\"\u003e4\u003c/a\u003e.\n - **June 26, 2024**: Initial technical report release.\n\n\n\u003ch2 id=\"overview\"\u003e🎉 Overview\u003c/h2\u003e\n\nMooncake features a KVCache-centric disaggregated architecture that separates the prefill and decoding clusters. It also leverages the underutilized CPU, DRAM, and SSD resources of the GPU cluster to implement a disaggregated cache of KVCache. \n\n![architecture](image/architecture.png)\n\nThe core of Mooncake is its KVCache-centric scheduler, which balances maximizing overall effective throughput while meeting latency-related Service Level Objectives (SLOs) requirements. Unlike traditional studies that assume all requests will be processed, Mooncake faces challenges due to highly overloaded scenarios. To mitigate these, we developed a prediction-based early rejection policy. Experiments show that Mooncake excels in long-context scenarios. Compared to the baseline method, Mooncake can achieve up to a 525% increase in throughput in certain simulated scenarios while adhering to SLOs. Under real workloads, Mooncake’s innovative architecture enables \u003ca href=\"https://kimi.ai/\"\u003eKimi\u003c/a\u003e to handle 75% more requests.\n\n\u003ch2 id=\"trace\"\u003e📦 Open Source Trace\u003c/h2\u003e\n\n```json\n{\n    \"timestamp\": 27482,\n    \"input_length\": 6955,\n    \"output_length\": 52,\n    \"hash_ids\": [46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 2353, 2354]\n}\n{\n    \"timestamp\": 30535,\n    \"input_length\": 6472,\n    \"output_length\": 26,\n    \"hash_ids\": [46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 2366]\n}\n```\nThe above presents two samples from our trace dataset. The trace includes the timing of request arrivals, the number of input tokens, the number of output tokens, and the remapped block hash. To protect our customers' privacy, we applied several mechanisms to remove user-related information while preserving the dataset's utility for simulated evaluation. More descriptions of the trace (e.g., up to 50% cache hit ratio) can be found in Section 4 of the paper's Version 3.\n\n\u003ch2 id=\"citation\"\u003e📑 Citation\u003c/h2\u003e\nPlease kindly cite our paper if you find the paper or the trace is useful:\n\n```bibtex\n@article{qin2024mooncake,\n  title        = {Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving},\n  author       = {Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu},\n  year         = {2024},\n  url          = {https://arxiv.org/abs/2407.00079}\n}\n```\n","funding_links":[],"categories":["A01_文本生成_文本对话","C++","Inference","Community Projects"],"sub_categories":["大语言对话模型及数据","Inference Platform","Quantized Models"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkvcache-ai%2FMooncake","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkvcache-ai%2FMooncake","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkvcache-ai%2FMooncake/lists"}