{"id":27984403,"url":"https://github.com/jy-yuan/KIVI","last_synced_at":"2025-05-08T05:01:55.443Z","repository":{"id":220903022,"uuid":"751957145","full_name":"jy-yuan/KIVI","owner":"jy-yuan","description":"[ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache","archived":false,"fork":false,"pushed_at":"2025-01-19T02:55:29.000Z","size":17537,"stargazers_count":291,"open_issues_count":8,"forks_count":30,"subscribers_count":6,"default_branch":"main","last_synced_at":"2025-04-30T03:36:38.761Z","etag":null,"topics":["inference","large-language-models","llama","llm","natural-language-processing","quantization","transformer"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2402.02750","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jy-yuan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-02-02T17:41:32.000Z","updated_at":"2025-04-25T23:58:59.000Z","dependencies_parsed_at":"2025-04-30T03:31:53.833Z","dependency_job_id":"a0064740-cd9a-467c-bf43-1a85a8b280fa","html_url":"https://github.com/jy-yuan/KIVI","commit_stats":null,"previous_names":["jy-yuan/kivi"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jy-yuan%2FKIVI","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jy-yuan%2FKIVI/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jy-yuan%2FKIVI/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jy-yuan%2FKIVI/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jy-yuan","download_url":"https://codeload.github.com/jy-yuan/KIVI/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252092130,"owners_count":21693324,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["inference","large-language-models","llama","llm","natural-language-processing","quantization","transformer"],"created_at":"2025-05-08T05:01:50.768Z","updated_at":"2025-05-08T05:01:55.392Z","avatar_url":"https://github.com/jy-yuan.png","language":"Python","funding_links":[],"categories":["A01_文本生成_文本对话"],"sub_categories":["大语言对话模型及数据"],"readme":"# KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache\n\nImplementation of [KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache](https://arxiv.org/abs/2402.02750)\n\n## Updates\n- [2025.01.18]:We add KIVI implementation with GQA and compatiable with transformers 4.43. Now it supports LLama3 family. Please reinstall KIVI.\n- [2024.06.07]:🎉 KIVI largely inspires the [HuggingFace Transformers KV Cache quantization](https://huggingface.co/docs/transformers/main/en/kv_cache)\n- [2024.06.06]:(Beta) We extensively optimize the codebase in [branch develop](https://github.com/jy-yuan/KIVI/tree/develop) to reduce the latency of KIVI. Note that **you need to reinstall our CUDA implementation** under the ```quant``` folder. We will release a blog soon about the detailed optimization.\n- [2024.05.01]:🎉 KIVI has been accepted by ICML 2024! See you in Vienna!\n- [2024.04.12]: We add the support for Mistral model family. The performance of LongChat-7b-v1.5-32K and Mistral-7B-Instruct-v0.2 on 15 tasks from LongBench can be found in [long_bench.md](./docs/long_bench.md).\n\n- [2024.04.05]: We release the code for reproducing our CoQA/TruthfulQA/GSM8K results using LM-Eval. Please check the [README of branch lmeval](https://github.com/jy-yuan/KIVI/tree/lmeval).\n\n- [2024.04.04]: 🔥🔥We add a new 5-digit [passkey example](./long_context_example.py) with 12k context length to show the performance of 2bit KIVI under the long context senario.\n\n- [2024.04.04]: (Beta) We add the flash-attention support for KIVI during the prefill phase. \n\n- [2024.04.03]: We add a new [5-shot GSM8K example.py](./example.py) to show the performance of 2/4 bit KIVI with 32 full precision tokens.\n\n- [2024.02.05]: KIVI ver. 2 is released on [arXiv](https://arxiv.org/abs/2402.02750).\n\n- [2024.02.03]: KIVI code is released.\n\n- [2023.12.29]: KIVI ver. 1 is released on [researchgate](https://www.researchgate.net/publication/376831635_KIVI_Plug-and-play_2bit_KV_Cache_Quantization_with_Streaming_Asymmetric_Quantization).\n\n## Overview\n\nKIVI is a new plug-and-play 2bit KV cache quantization algorithm without any fine-tuning. This algorithm optimizes memory usage by quantizing the key cache per-channel and the value cache per-token to 2bit. KIVI's hardware-friendly design allows LLMs like Llama-2, Falcon, and Mistral to maintain comparable quality levels while reducing peak memory usage by 2.6 times. This enables up to 4 times larger batch sizes and significantly increases throughput by 2.35 to 3.47 times in real LLM inference workloads, effectively addressing the bottleneck issues in speed and memory usage.\n\nIllustration of KIVI quantization scheme: key cache per-channel and value cache per-token.\n\u003cp align=\"center\"\u003e\n\u003cimg width=\"300\" src=\"./img/quant_scheme.png\"\u003e\n\u003c/p\u003e\n\nIllustration of KIVI algorithm during inference prefill and decoding phase:\n\u003cp align=\"center\"\u003e\n\u003cimg width=\"700\" src=\"./img/algo.png\"\u003e\n\u003c/p\u003e\n\n## How to use KIVI\n\n### Setup\n\nTo install the required packages:\n\n```bash\nconda create -n kivi python=3.10\nconda activate kivi\npip install --upgrade pip  # enable PEP 660 support\npip install -e .\n```\n\nThen install our CUDA implementation:\n\n```bash\ncd quant \u0026\u0026 pip install -e .\n```\n\n### Example\n\nLoad model with KIVI: (e.g., Llama-2-7b)\n\n```python\n# LLaMA model with KIVI\nimport torch\nimport os\nfrom models.llama_kivi import LlamaForCausalLM_KIVI\nfrom transformers import LlamaConfig, AutoTokenizer\nconfig = LlamaConfig.from_pretrained(\"meta-llama/Llama-2-7b-hf\")\n\nconfig.k_bits = K_BITS # current support 2/4 bit for KV Cache\nconfig.v_bits = V_BITS # current support 2/4 bit for KV Cache\nconfig.group_size = GROUP_SIZE\nconfig.residual_length = RESIDUAL_LENGTH # the number of recent fp16 tokens\nCACHE_DIR = PATH_TO_YOUR_SAVE_DIR\n\nmodel = LlamaForCausalLM_KIVI.from_pretrained(\n    pretrained_model_name_or_path='meta-llama/Llama-2-7b-hf',\n    config=config,\n    cache_dir=CACHE_DIR,\n    torch_dtype=torch.float16,\n    low_cpu_mem_usage=True,\n    device_map=\"auto\",\n)\n\ntokenizer = AutoTokenizer.from_pretrained(\n    'meta-llama/Llama-2-7b-hf', \n    use_fast=False, \n    trust_remote_code=True, \n    tokenizer_type='llama')\n\n# Inference\n# e.g., model.generate(...)\n```\n\n#### GSM8K example\nWe use GSM8K as an example to show how to use KIVI. You can check [example.py](./example.py):\n\n```bash\npython example.py\n```\n\n#### Passkey retrieval example\n\nPasskey retrieval with KIVI. You can check [long_context_example.py](./long_context_example.py):\n\n```bash\npython long_context_example.py\n```\n\n#### Evaluate KIVI on LongBench\n\nWe currently support Llama and Mistral family of models. We recently test KIVI on Mistral-7B-Instruct-v0.2 and Longchat-7b-v1.5-32k. Please check [long_bench.md](./docs/long_bench.md) for more details.\n```bash\nbash scripts/long_test.sh {GPU_ID} {K_BITS} {V_BITS} {GROUP_LENGTH} {RESIDUAL_LENGTH} {MODEL_NAME}\npython eval_long_bench.py --model {MODEL} # MODEL is the dir name under pred/ Currently it support Llama family model and Mistral model.\n```\n\n## Citation\n\nIf you find our method useful, please kindly cite our paper.\n\n```bibtex\n@article{liu2024kivi,\n  title={KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache},\n  author={Liu, Zirui and Yuan, Jiayi and Jin, Hongye and Zhong, Shaochen and Xu, Zhaozhuo and Braverman, Vladimir and Chen, Beidi and Hu, Xia},\n  journal={arXiv preprint arXiv:2402.02750},\n  year={2024}\n}\n```\n\n## Contributing\nWe welcome contributions from the research community to improve KIVI. If you have any idea or would like to report a bug, please open an issue or submit a pull request.\n\n## License\nThe code is released under the MIT License.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjy-yuan%2FKIVI","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjy-yuan%2FKIVI","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjy-yuan%2FKIVI/lists"}