https://github.com/StargazerX0/ScaleKV

ScaleKV: Memory-Efficient Visual Autoregressive Modeling with Scale-Aware KV Cache Compression
https://github.com/StargazerX0/ScaleKV

auto-regressive-model efficient-image-generation model-acceleration transformers

Last synced: about 2 months ago
JSON representation

ScaleKV: Memory-Efficient Visual Autoregressive Modeling with Scale-Aware KV Cache Compression

Host: GitHub
URL: https://github.com/StargazerX0/ScaleKV
Owner: StargazerX0
License: mit
Created: 2025-05-26T06:01:22.000Z (6 months ago)
Default Branch: main
Last Pushed: 2025-05-27T02:23:28.000Z (6 months ago)
Last Synced: 2025-05-27T03:30:04.527Z (6 months ago)
Topics: auto-regressive-model, efficient-image-generation, model-acceleration, transformers
Language: Python
Homepage: https://arxiv.org/abs/2505.19602
Size: 3.48 MB
Stars: 15
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-diffusion-categorized - [Code

README

ScaleKV: Memory-Efficient Visual Autoregressive Modeling with Scale-Aware KV Cache Compression

> **Memory-Efficient Visual Autoregressive Modeling with Scale-Aware KV Cache Compression**
> [Kunjun Li](https://kunjun-li.github.io/), [Zigeng Chen](https://github.com/czg1225), [Cheng-Yen Yang](https://yangchris11.github.io/), [Jenq-Neng Hwang](https://people.ece.uw.edu/hwang/)
> [University of Washington](https://www.washington.edu/)，[National University of Singapore](https://nus.edu.sg/)

## 💡 Introduction
We propose Scale-Aware KV Cache (ScaleKV), a novel KV Cache compression framework tailored for VAR’s next-scale prediction paradigm. ScaleKV leverages on two critical observations: varying cache demands across transformer layers and distinct attention patterns at different scales. Based on these insights, we categorizes transformer layers into two functional groups termed drafters and refiners, implementing adaptive cache management strategies based on these roles and optimize multi-scale inference by identifying each layer's function at every scale, enabling adaptive cache allocation that aligns with specific computational demands of each layer. On Infinity-8B, it achieves 10x memory reduction from 85 GB to 8.5 GB with negligible quality degradation (GenEval score remains at 0.79 and DPG score marginally decreases from 86.61 to 86.49).

## 🔥Updates
* 🔥 **May 26, 2025**: Our paper is available now!
* 🔥 **May 25, 2025**: Code repo is released! Arxiv paper will come soon!

## 🔧 Installation:
### Reequirements
```bash
pip install -r requirements.txt
```

### Model Checkpoints
Download google flan-t5-xl:
```bash
pip install -U huggingface_hub
huggingface-cli download google/flan-t5-xl --local-dir ./weights/flan-t5-xl
```

Download Infinity-2B:
```bash
huggingface-cli download FoundationVision/Infinity --include "infinity_2b_reg.pth" --local-dir ./weights/
huggingface-cli download FoundationVision/Infinity --include "infinity_vae_d32reg.pth" --local-dir ./weights/
```

Download Infinity-8B:
```bash
huggingface-cli download FoundationVision/Infinity --include "infinity_8b_weights/**" --local-dir ./weights/infinity_8b_weights
huggingface-cli download FoundationVision/Infinity --include "infinity_vae_d56_f8_14_patchify.pth" --local-dir ./weights/

```

## ⚡ Quick Start:

Sample images with ScaleKV-Compressed Infinity-8B (10% KV Cache):
```python
python infer_8B.py
```

Sample images with ScaleKV-Compressed Infinity-2B (10% KV Cache):
```python
python infer_2B.py
```

## ⚡ Sample & Evaluations
### Sampling 5000 images from COCO-2017 captions with Infinity-8B.

```python
torchrun --nproc_per_node=$N_GPUS scripts/sample_8b.py
```

Sample images with ScaleKV compressed Infinity-8B (10% KV Cache):
```python
torchrun --nproc_per_node=$N_GPUS scripts/sample_kv_8b.py
```

After you sample all the images, you can calculate PSNR, LPIPS and FID with:
```python
python scripts/compute_metrics.py --input_root0 samples/gt_8b --input_root1 samples/scalekv_8b
```

### Sampling 5000 images from COCO captions with Infinity-2B.
```python
torchrun --nproc_per_node=$N_GPUS scripts/sample_2b.py
```

```python
torchrun --nproc_per_node=$N_GPUS scripts/sample_kv_2b.py
```

```python
python scripts/compute_metrics.py --input_root0 samples/gt_2b --input_root1 samples/scalekv_2b
```

## 📚 Key Results

## Acknowlegdement
Thanks to [Infinity](https://github.com/FoundationVision/Infinity) for their wonderful work and codebase!

## Citation
If our research assists your work, please give us a star ⭐ or cite us using:
```
@article{li2025scalekv,
title={Memory-Efficient Visual Autoregressive Modeling with Scale-Aware KV Cache Compression},
author={Li, Kunjun and Chen, Zigeng and Yang, Cheng-Yen and Hwang, Jenq-Neng},
journal={arXiv preprint arXiv:2505.19602},
year={2025}
}
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/StargazerX0/ScaleKV

Awesome Lists containing this project

README

ScaleKV: Memory-Efficient Visual Autoregressive Modeling with Scale-Aware KV Cache Compression