https://github.com/Aaronhuang-778/BiLLM

(ICML 2024) BiLLM: Pushing the Limit of Post-Training Quantization for LLMs
https://github.com/Aaronhuang-778/BiLLM

Last synced: 6 months ago
JSON representation

(ICML 2024) BiLLM: Pushing the Limit of Post-Training Quantization for LLMs

Host: GitHub
URL: https://github.com/Aaronhuang-778/BiLLM
Owner: Aaronhuang-778
License: mit
Created: 2024-01-24T08:37:04.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-05-27T09:35:57.000Z (11 months ago)
Last Synced: 2024-08-03T09:07:02.681Z (9 months ago)
Language: Python
Homepage: https://arxiv.org/abs/2402.04291
Size: 1.73 MB
Stars: 159
Watchers: 6
Forks: 12
Open Issues: 11
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

StarryDivineSky - Aaronhuang-778/BiLLM - 70B上的8.41困惑），在各种LLMs系列和评估指标中仅具有1.08位权重，远远优于SOTA量化方法LLM。此外，BiLLM 可在单个 GPU 上在 0.5 小时内实现 70 亿个权重的二值化过程LLM，表现出令人满意的时间效率。 (A01_文本生成_文本对话 / 大语言对话模型及数据)

README

# BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [[PDF]](https://arxiv.org/pdf/2402.04291.pdf)

![intuition](imgs/author.png)

**¹The University of Hong Kong ²Beihang University ³ETH Zürich**

![intuition](imgs/main.png)

## Abstract

Pretrained large language models (LLMs) exhibit exceptional general language processing capabilities but come with significant demands on memory and computational resources. As a powerful compression technology, binarization can extremely reduce model weights to a mere 1 bit, lowering the expensive computation and memory requirements. However, existing quantization techniques fall short of maintaining LLM performance under ultra-low bit-widths. In response to this challenge, we present *BiLLM*, a groundbreaking 1-bit post-training quantization scheme tailored for pretrained LLMs. Based on the weight distribution of LLMs, *BiLLM* first identifies and structurally selects salient weights, and minimizes the compression loss through an effective *binary residual approximation* strategy. Moreover, considering the bell-shaped distribution of the non-salient weights, we propose an *optimal splitting search* to group and binarize them accurately. *BiLLM* achieving for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families and evaluation metrics, outperforms SOTA quantization methods of LLM by significant margins. Moreover, *BiLLM* enables the binarization process of the LLM with 7 billion weights within 0.5 hours on a single GPU, demonstrating satisfactory time efficiency.

## News

- [2024/2] *BiLLM* source code is open now!

## Dependencies

* `torch`: tested on v2.0.1+cu117
* `transformers`: tested on v4.35.0 (the LLaMa integration currently requires a main install from source and `sentencepiece`)
* `datasets`: tested on v2.14.6
* `huggingface-hub`: tested on v0.16.4

All binarization processes and experiments were run on a single 80GB NVIDIA A100. However, all the process can also be conducted on a single 24GB NVIDIA 3090 Ti when the model's parameter is under 70B.

## LLMs Binarization

#### Binarization for OPT families

```
python3 run.py facebook/opt-6.7b c4 braq --blocksize 128 --salient_metric hessian
```

#### Binarization for LLaMA families

```
python3 run.py meta-llama/Llama-2-7b-hf c4 braq --blocksize 128 --salient_metric hessian
```
or
```
python3 run.py huggyllama/llama-7b c4 braq --blocksize 128 --salient_metric hessian
```

#### Binarization for Vicuna families (Instruction Fine-tuning Models)

```
python3 run.py lmsys/vicuna-7b-v1.5 c4 braq --blocksize 128 --salient_metric hessian
```

####

## Results

- BiLLM achieve superior perplexity performance on Wikitext2 datasets within only an average of **1.11** bit-width weights OPT families.

![intuition](imgs/opt_wiki_results.png)

- BiLLM achieve superior perplexity performance on Wikitext2 datasets within only an average of **1.09** bit-width weights LLaMA families and **1.08** bit-width weights LLaMA2 families.

![intuition](imgs/llama_wiki_results.png)

- We also evaluated the performance of *BiLLM* on PTB and C4 datasets.

![intuition](imgs/ptb1.png)

![intuition](imgs/ptb2.png)

- We further evaluated *BiLLM* on 7 zero-shot dataset to give extensive insight on binarization LLMs

![intuition](imgs/zero_shot.png)

- BiLLM achieve superior perplexity performance on Wikitext2 datasets within only an average of **1.10** bit-width weights Vicuna families (instruction fine-tune models).

![intuition](imgs/vicuna.png)

## Related Project
[GPTQ: Accurate Post-training Compression for Generative Pretrained Transformers](https://github.com/IST-DASLab/gptq)

[PB-LLM: Partially Binarized Large Language Models](https://github.com/hahnyuan/PB-LLM)

[AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration](https://github.com/mit-han-lab/llm-awq)

## Citation

If you find *BiLLM* is useful and helpful to your work, please kindly cite this paper:

```
@article{huang2024billm,
title={BiLLM: Pushing the Limit of Post-Training Quantization for LLMs},
author={Huang, Wei and Liu, Yangdong and Qin, Haotong and Li, Ying and Zhang, Shiming and Liu, Xianglong and Magno, Michele and Qi, Xiaojuan},
journal={arXiv preprint arXiv:2402.04291},
year={2024}
}
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/Aaronhuang-778/BiLLM

Awesome Lists containing this project

README