https://github.com/SqueezeAILab/KVQuant
[NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
https://github.com/SqueezeAILab/KVQuant
compression efficient-inference efficient-model large-language-models llama llm localllama localllm mistral model-compression natural-language-processing quantization small-models text-generation transformer
Last synced: 5 months ago
JSON representation
[NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
- Host: GitHub
- URL: https://github.com/SqueezeAILab/KVQuant
- Owner: SqueezeAILab
- Created: 2024-01-31T17:30:10.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-08-13T11:19:28.000Z (about 1 year ago)
- Last Synced: 2025-04-30T02:04:45.176Z (5 months ago)
- Topics: compression, efficient-inference, efficient-model, large-language-models, llama, llm, localllama, localllm, mistral, model-compression, natural-language-processing, quantization, small-models, text-generation, transformer
- Language: Python
- Homepage: https://arxiv.org/abs/2401.18079
- Size: 19.8 MB
- Stars: 342
- Watchers: 10
- Forks: 30
- Open Issues: 15
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- StarryDivineSky - SqueezeAILab/KVQuant