https://github.com/fasterdecoding/bitdelta

llm quantization serving

Last synced: 6 months ago
JSON representation

Host: GitHub
URL: https://github.com/fasterdecoding/bitdelta
Owner: FasterDecoding
License: apache-2.0
Created: 2024-02-14T05:29:56.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-12-05T04:09:06.000Z (10 months ago)
Last Synced: 2025-03-28T19:07:41.021Z (6 months ago)
Topics: llm, quantization, serving
Language: Jupyter Notebook
Homepage: https://fasterdecoding.github.io/BitDelta/
Size: 7.21 MB
Stars: 195
Watchers: 5
Forks: 15
Open Issues: 4
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# BitDelta: Your Fine-Tune May Only Be Worth One Bit

[[Paper](https://arxiv.org/abs/2402.10193)][[Blog](https://fasterdecoding.github.io/BitDelta/)]

BitDelta compresses the weight delta between a fine-tuned and base model LLM to 1 bit, enabling accurate and efficient multi-tenant serving.

The current release supports:

- Llama-2 and Mistral based models.
- Memory efficient 16-bit + 1-bit Δ Linear in PyTorch
- Triton kernel for fast inference (TODO: Update repo with faster [BitBLAS](https://github.com/microsoft/BitBLAS) W1A16 kernel)
- Gradio demo showcasing batched inference over 6 Mistral-7B based models, using only **30 GB** of GPU memory!

## News

- [10/2024] 🔥 BitDelta is accepted to NeurIPS 2024!
- [02/2024] 🔥 Arxiv release!

## Abstract

Large Language Models (LLMs) are typically trained in two phases: pre-training on large internet-scale datasets, and fine-tuning for downstream tasks. Given the higher computational demand of pre-training, it's intuitive to assume that fine-tuning adds less new information to the model, and is thus more compressible. We explore this assumption by decomposing the weights of fine-tuned models into their pre-trained components and an additional delta. We introduce a simple method, BitDelta, which successfully quantizes this delta down to 1 bit without compromising performance. This interesting finding not only highlights the potential redundancy of information added during fine-tuning, but also has significant implications for the multi-tenant serving and multi-tenant storage of fine-tuned models. By enabling the use of a single high-precision base model accompanied by multiple 1-bit deltas, BitDelta dramatically reduces GPU memory requirements by more than 10x, which can also be translated to enhanced generation latency in multi-tenant settings. We validate BitDelta through experiments across Llama-2 and Mistral model families, and on models up to 70B parameters, showcasing minimal performance degradation over all tested settings.

## Contents

- [Install](#Install)
- [Demo](#Demo)
- [Usage](#Usage)
- [Citation](#citation)

## Install

1. Clone the repo and navigate to BitDelta:

```
git clone https://github.com/FasterDecoding/BitDelta
cd BitDelta
```

2. Set up environment:

```bash
conda create -yn bitdelta python=3.9
conda activate bitdelta

pip install -e .
```

## Demo

See [`demo/README.md`](https://github.com/FasterDecoding/BitDelta/blob/main/demo/README.md) for instructions on how to set up the demo.

[BitDelta Demo.webm](https://github.com/FasterDecoding/BitDelta/assets/51351043/b56747df-1108-42f2-ae6f-05e1c460080c)

## Usage

We provide some scripts in (`./scripts`) so you can compress your own models! As an example, we will compress `lmsys/vicuna-7b-v1.5` with base model `meta-llama/Llama-2-7b-hf`.

### Compress Model

Compress the weight delta and perform scale distillation:

```
CUDA_VISIBLE_DEVICES=0,1 python \
bitdelta/train.py \
--base_model meta-llama/Llama-2-7b-hf \
--finetuned_model lmsys/vicuna-7b-v1.5 \
--save_dir $MODEL_SAVE_DIR \
--batch_size 4 \
--num_steps 200 \
--save_full_model True
```

where `$MODEL_SAVE_DIR` is specified.

If `--save_full_model` is specified, the compressed model will also be saved in HuggingFace format at `$MODEL_SAVE_DIR/calibrated_model`. Otherwise, only the delta will be saved.

### Perplexity Check

Double check the perplexity of the compressed model:

```
CUDA_VISIBLE_DEVICES=0 python \
bitdelta/eval_ppl.py \
--base_model meta-llama/Llama-2-7b-hf \
--dataset_name wikitext \
--subset wikitext-2-raw-v1 \
--save_dir $PPL_SAVE_DIR \
--num_eval_samples 100 \
--model_diff $MODEL_SAVE_DIR/diff.pt \

```

### Replicate Results

To replicate our other results, please use `--save_full_model` to run the model in Llama format for compatibility with eval harnesses.

## Citation

If you find BitDelta useful, please consider citing:

```
@misc{liu2024bitdelta,
title={BitDelta: Your Fine-Tune May Only Be Worth One Bit},
author={James Liu and Guangxuan Xiao and Kai Li and Jason D. Lee and Song Han and Tri Dao and Tianle Cai},
year={2024},
eprint={2402.10193},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
```

[# Compressing Model Diffs for High-Througput Multi-Model Serving]: #

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/fasterdecoding/bitdelta

Awesome Lists containing this project

README