https://github.com/bytedance/shadowkv

ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
https://github.com/bytedance/shadowkv

cpu-offload high-throughput llm-inference long-context low-rank research sparse-attention

Last synced: about 1 year ago
JSON representation

ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference

Host: GitHub
URL: https://github.com/bytedance/shadowkv
Owner: bytedance
License: apache-2.0
Created: 2024-10-22T02:32:21.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-10-30T00:10:32.000Z (over 1 year ago)
Last Synced: 2025-03-28T13:07:23.961Z (about 1 year ago)
Topics: cpu-offload, high-throughput, llm-inference, long-context, low-rank, research, sparse-attention
Language: Python
Homepage: https://bytedance.github.io/ShadowKV/
Size: 19.9 MB
Stars: 153
Watchers: 3
Forks: 7
Open Issues: 0
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE

Awesome Lists containing this project

README

ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference

**training-free, high-throughput long-context LLM inference**

Hanshi Sun^1,2,
Li-Wen Chang²,
Wenlei Bao²,
Size Zheng²,
Ningxin Zheng²,
Xin Liu²,

Harry Dong¹,
Yuejie Chi¹,
Beidi Chen¹

¹Carnegie Mellon University
²ByteDance

[Paper] | [Blog]

ShadowKV Framework

## Environment Set Up
To reproduce the results in the paper, you need to set up the environment as follows with a single A100 GPU:
```bash
# create env
conda create -n ShadowKV python=3.10 -y
conda activate ShadowKV

# install packages
pip install -r requirements.txt
pip install flash-attn --no-build-isolation

# nemo dependencies (for dataset building)
pip install wheel
pip install Cython
pip install youtokentome
pip install nemo_toolkit[all]==1.23

# flashinfer
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.3/

# cutlass
mkdir 3rdparty
git clone https://github.com/NVIDIA/cutlass.git 3rdparty/cutlass

# build kernels for ShadowKV
python setup.py build_ext --inplace
```
## Supported Models
Currently, we support the following LLMs:
- Llama-3-8B-1M: [gradientai/Llama-3-8B-Instruct-Gradient-1048k](https://huggingface.co/gradientai/Llama-3-8B-Instruct-Gradient-1048k)
- GLM-4-9B-1M: [THUDM/glm-4-9b-chat-1m](https://huggingface.co/THUDM/glm-4-9b-chat-1m)
- Llama-3.1-8B: [meta-llama/Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct)
- Yi-9B-200K: [01-ai/Yi-9B-200K](https://huggingface.co/01-ai/Yi-9B-200K)
- Phi-3-Mini-128K: [microsoft/Phi-3-mini-128k-instruct](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct) (only NIAH test supported)
- Qwen2-7B-128K: [Qwen/Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) (only NIAH test supported)

## Accuracy Evaluations
Here we provide an example to build the dataset and run evaluation for the [RULER](https://github.com/hsiehjackson/RULER) benchmark with Llama-3-8B-1M.

### Build Datasets
To build RULER dataset, please run the following command:
```bash
# build RULER
python -c "import nltk; nltk.download('punkt')"
cd data/ruler
bash create_dataset.sh "gradientai/Llama-3-8B-Instruct-Gradient-1048k" "llama-3"
```

### Run Evaluations
For the accuracy evaluation, please run the following command with 8xA100 GPUs:

```bash
# Full attention
OMP_NUM_THREADS=48 torchrun --standalone --nnodes=1 --nproc_per_node 8 test/eval_acc.py --datalen 131072 --method full --dataset_name "ruler/niah_single_1,ruler/niah_single_2,ruler/niah_single_3,ruler/niah_multikey_1,ruler/niah_multikey_2,ruler/niah_multiquery,ruler/niah_multivalue,ruler/vt,ruler/fwe,ruler/qa_1,ruler/qa_2" --model_name "gradientai/Llama-3-8B-Instruct-Gradient-1048k"

# ShadowKV
OMP_NUM_THREADS=48 torchrun --standalone --nnodes=1 --nproc_per_node 8 test/eval_acc.py --datalen 131072 --method shadowkv --dataset_name "ruler/niah_single_1,ruler/niah_single_2,ruler/niah_single_3,ruler/niah_multikey_1,ruler/niah_multikey_2,ruler/niah_multiquery,ruler/niah_multivalue,ruler/vt,ruler/fwe,ruler/qa_1,ruler/qa_2" --sparse_budget 2048 --rank 160 --chunk_size 8
```

#### Compatibility with MInference
ShadowKV is compatible with pre-filling acceleration techniques, such as MInference. To enable MInference, please add the `--minference` flag to the command. For example:

```bash
# Full attention with MInference
OMP_NUM_THREADS=48 torchrun --standalone --nnodes=1 --nproc_per_node 8 test/eval_acc.py --datalen 131072 --method full --dataset_name "ruler/niah_single_1,ruler/niah_single_2,ruler/niah_single_3,ruler/niah_multikey_1,ruler/niah_multikey_2,ruler/niah_multiquery,ruler/niah_multivalue,ruler/vt,ruler/fwe,ruler/qa_1,ruler/qa_2" --minference

# ShadowKV with MInference
OMP_NUM_THREADS=48 torchrun --standalone --nnodes=1 --nproc_per_node 8 test/eval_acc.py --datalen 131072 --method shadowkv --dataset_name "ruler/niah_single_1,ruler/niah_single_2,ruler/niah_single_3,ruler/niah_multikey_1,ruler/niah_multikey_2,ruler/niah_multiquery,ruler/niah_multivalue,ruler/vt,ruler/fwe,ruler/qa_1,ruler/qa_2" --sparse_budget 2048 --rank 160 --chunk_size 8 --minference
```

## Efficiency Evaluations
For the efficiency evaluation, please run the following command with a single A100 GPU:

```bash
python test/e2e.py --model_name "meta-llama/Meta-Llama-3.1-8B-Instruct" --datalen "122k"
```
## Citation
If you find ShadowKV useful or relevant to your project and research, please kindly cite our paper:

```bibtex
@article{sun2024shadowkv,
title={ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference},
author={Sun, Hanshi and Chang, Li-Wen and Bao, Wenlei and Zheng, Size and Zheng, Ningxin and Liu, Xin and Dong, Harry and Chi, Yuejie and Chen, Beidi},
journal={arXiv preprint arXiv:2410.21465},
year={2024}
}
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/bytedance/shadowkv

Awesome Lists containing this project

README

ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference