https://github.com/mohsenhariri/kvq

More for Keys, Less for Values: Adaptive KV Cache Quantization ☝️🔑👇🔢
https://github.com/mohsenhariri/kvq

Last synced: about 1 month ago
JSON representation

More for Keys, Less for Values: Adaptive KV Cache Quantization ☝️🔑👇🔢

Host: GitHub
URL: https://github.com/mohsenhariri/kvq
Owner: mohsenhariri
License: gpl-3.0
Created: 2025-02-27T19:38:20.000Z (3 months ago)
Default Branch: main
Last Pushed: 2025-03-08T22:19:36.000Z (3 months ago)
Last Synced: 2025-04-22T20:13:22.951Z (about 1 month ago)
Language: Python
Homepage: https://mohsenhariri.github.io/kvq/
Size: 32.2 KB
Stars: 12
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE

Awesome Lists containing this project

README

        # More for Keys, Less for Values: Adaptive KV Cache Quantization ☝️🔑👇🔢

[Source code](https://github.com/mohsenhariri/kvq) for [More for Keys, Less for Values: Adaptive KV Cache Quantization](https://arxiv.org/pdf/2502.15075v1).

## Supported backends

- [x] [Quanto](https://github.com/huggingface/optimum-quanto)

- [ ] [HQQ](https://mobiusml.github.io/hqq_blog/)

## Usage

### 1. Installation

### Installation

`KVQ` can be installed via pip:

```bash

pip install kvq

```

Please note that an NVIDIA `nvcc` compiler is required to build the package. Before installing, ensure that you have the following dependencies properly set up on your system:

- **GNU Binutils (e.g., GNU assembler 2.42)**

- **C/C++ compiler** (e.g., GCC via `build-essential` or `cmake`)

### 2. Initialization

#### 2.1. Creating a KVQ object using a configuration object:

```python

import torch

from kvq import KVQ, KVQCacheConfig

config = KVQCacheConfig(

    nbits_k=4,

    nbits_v=2,

    axis_key=0,

    axis_value=0,

    q_group_size=64,

    residual_length=128,

    compute_dtype=torch.bfloat16,

    backend="quanto",

    device=model.device,

)

kvq = KVQ(config)

```

#### 2.2. Alternatively, you can create a KVQ object using a dictionary:

```python

kvq_dict = {

    "nbits_k": 4,

    "nbits_v": 2,

    "axis_key": 0,

    "axis_value": 0,

    "q_group_size": 64,

    "residual_length": 128,

    "compute_dtype": torch.float16,

    "backend": "quanto",

    "device": model.device,

}

kvq = KVQ(kvq_dict)

```

### 3. Using KVQ during text generation

```python

# Assume 'model' is a transformer-like model (e.g. Llama, Mistral, ...)

# that supports caching past key-value states.

outputs = model.generate(

    **inputs,

    max_new_tokens=1024,

    use_cache=True,

    past_key_values=kvq,

)

print(outputs)

```

## Citation

If you find our method useful, please kindly cite our paper.

```bibtex

@article{hariri2025kvq,

  title={More for Keys, Less for Values: Adaptive KV Cache Quantization},

  author={Hariri, Mohsen and Nguyen, Lam and Chen, Sixu and Zhong, Shaochen and Wang, Qifan and Hu, Xia and Han, Xiaotian and Chaudhary, Vipin},

  journal={arXiv preprint arXiv:2502.15075},

  year={2025}

}

```

## Contributing

We welcome contributions from the research community to improve this work. If you have any idea or would like to report a bug, please open an issue or submit a pull request.

## License

The code is released under the MIT License.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mohsenhariri/kvq

Awesome Lists containing this project

README