https://github.com/IST-DASLab/gptq

Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".
https://github.com/IST-DASLab/gptq

Last synced: 3 months ago
JSON representation

Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".

Host: GitHub
URL: https://github.com/IST-DASLab/gptq
Owner: IST-DASLab
License: apache-2.0
Created: 2022-10-19T15:11:53.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-03-27T01:46:02.000Z (over 1 year ago)
Last Synced: 2024-08-02T16:32:14.254Z (11 months ago)
Language: Python
Homepage: https://arxiv.org/abs/2210.17323
Size: 289 KB
Stars: 1,813
Watchers: 29
Forks: 146
Open Issues: 21
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

StarryDivineSky - IST-DASLab/gptq - training Compression for Generative Pretrained Transformers"的代码。项目特色包括高效的GPTQ算法实现、对OPT和BLOOM系列模型进行2/3/4位压缩（包括权重分组）、评估量化模型在多个语言生成任务上的困惑度、评估量化模型在多个零样本任务上的性能、3位量化矩阵全精度向量积CUDA内核、用于单个矩阵向量积和量化模型语言生成的基准测试代码等。项目还包含一些新功能，例如支持静态分组选项、优化了3位内核、集成了LLaMa模型，并针对LLaMa模型引入了新的技巧，例如`--act-order`和`--true-sequential`，显著提升了模型的精度。项目依赖于PyTorch、transformers、datasets等库，所有实验均在一台80GB NVIDIA A100上运行。 (A01_文本生成_文本对话 / 大语言对话模型及数据)
awesome-production-machine-learning - GPTQ - DASLab/gptq.svg?style=social) - Accurate Post-training Quantization of Generative Pretrained Transformers. (Model Storage Optimisation)
awesome-llm-and-aigc - GPTQ - DASLab/gptq?style=social"/> : "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers". (**[ICLR 2023](https://arxiv.org/abs/2210.17323)**). (Summary)

README

# GPTQ

This repository contains the code for the ICLR 2023 paper [GPTQ: Accurate Post-training Compression for Generative Pretrained Transformers](https://arxiv.org/abs/2210.17323).
The current release includes the following features:

* An efficient implementation of the GPTQ algorithm: `gptq.py`
* Compressing all models from the OPT and BLOOM families to 2/3/4 bits, including weight grouping: `opt.py`, `bloom.py`, `zeroShot/`
* Evaluating the perplexity of quantized models on several language generation tasks: `opt.py`, `bloom.py`
* Evaluating the performance of quantized models on several ZeroShot tasks: `zeroShot/`
* A 3-bit quantized matrix full-precision vector product CUDA kernel: `quant_cuda_kernel.cu`, `quant_cuda.cpp`, `setup_cuda.py`
* Benchmarking code for individual matrix-vector products and for language generation with quantized models: `test_kernel.py`, `opt.py`

## New Features

Update July 2023:

* Added `--static-groups` options which determines all group-grids in advance rather than dynamically during quantization, which has the effect that `--act-order` does not require any inference changes (that may cause slowdown) when used together with this option.

Together with the camera ready version of the paper we have added several updates to this repository:

* Slightly adjusted preprocessing of C4 and PTB for more realistic evaluations (used in our updated results); can be activated via the flag `--new-eval`.
* Optimized 3bit kernels, which are considerably faster especially on the A100, e.g. 1.9x -> 3.25x generation speedup for OPT-175B; can be activated via `--faster-kernel`.
* A minimal LlaMa integration (for more complete features see the [GPTQ-for-LLaMA](https://github.com/qwopqwop200/GPTQ-for-LLaMa) repository), which demonstrates two new tricks:`--act-order` (quantizing columns in order of decreasing activation size) and `--true-sequential` (performing sequential quantization even within a single Transformer block). Those fix GPTQ's strangely bad performance on the 7B model (from 7.15 to 6.09 Wiki2 PPL) and lead to slight improvements on most models/settings in general.

Here is a summary of LLaMa results:

| Wiki2 PPL | FP16 | 4bit-RTN | 4bit-GPTQ | 3bit-RTN | 3bit-GPTQ | 3g128-GPTQ |
|:---------:|:----:|:--------:|:---------:|:--------:|:---------:|:----------:|
| LLaMa-7B | 5.68 | 6.29 | **6.09** | 25.54 | **8.07** | 6.61 |
| LLaMa-13B | 5.09 | 5.53 | **5.36** | 11.40 | **6.63** | 5.62 |
| LLaMa-30B | 4.10 | 4.54 | **4.45** | 14.89 | **5.69** | 4.80 |
| LLaMa-65B | 3.53 | 3.92 | **3.84** | 10.59 | **5.04** | 4.17 |

Here is a sample command:

```
python llama.py LLAMA_HF_FOLDER c4 --wbits 4 --true-sequential --act-order --new-eval
```

The `--act-order` heuristic also dramatically improves accuracy on the OPT-66B outlier model: 9.55 to 9.34 and 14.16 to 9.95 PPL on Wiki2 for 4bit and 3bit, respectively.

## Dependencies

* `torch`: tested on v1.10.1+cu111
* `transformers`: tested on v4.21.2 (the LLaMa integration currently requires a main install from source and `sentencepiece`)
* `datasets`: tested on v1.17.0
* (to run 3-bit kernels: setup for compiling PyTorch CUDA extensions, see also https://pytorch.org/tutorials/advanced/cpp_extension.html, tested on CUDA 11.4)

All experiments were run on a single 80GB NVIDIA A100. However, most experiments will work on a GPU with a lot less memory as well.

## Language Generation

### OPT

```
# Compute full precision (FP16) results
CUDA_VISIBLE_DEVICES=0 python opt.py facebook/opt-125m c4
# Run RTN baseline and compute results
CUDA_VISIBLE_DEVICES=0 python opt.py facebook/opt-125m c4 --wbits 4 --nearest
# Run GPTQ and compute results
CUDA_VISIBLE_DEVICES=0 python opt.py facebook/opt-125m c4 --wbits 4 [--groupsize 1024]
````

To run other OPT models replace `opt-125m` with one of: `opt-350m`, `opt-1.3b`, `opt-2.7b`, `opt-6.7b`, `opt-13b`, `opt-66b`.
For the 175B-parameter mode, you have to request access from Meta and then convert it to a local HuggingFace checkpoint using their scripts in `metaseq`.
Once you have such a checkpoint, simply pass its path instead of `facebook/opt-125m`.

### BLOOM

```
# Compute full precision (FP16) results
CUDA_VISIBLE_DEVICES=0 python bloom.py bigscience/bloom-560m c4
# Run RTN baseline and compute results
CUDA_VISIBLE_DEVICES=0 python bloom.py bigscience/bloom-560m c4 --wbits 4 --nearest
# Run GPTQ and compute results
CUDA_VISIBLE_DEVICES=0 python bloom.py bigscience/bloom-560m c4 --wbits 4 [--groupsize 1024]
````

To run other BLOOM models replace `bloom-560m` with one of: `bloom-1b1`, `bloom-1b7`, `bloom-3b`, `bloom-7b1`, `bloom`.

## ZeroShot

See `zeroShot/` folder.

## 3-bit CUDA Kernels

```
# Install kernels
python setup_cuda.py install

# Benchmark performance for FC2 layer of OPT-175B
CUDA_VISIBLE_DEVICES=0 python test_kernel.py

# Benchmark language generation with 3-bit OPT-175B:
# OPT175B denotes the name of the folder with the HuggingFace OPT-175b checkpoint (see above)

# Save compressed model
CUDA_VISIBLE_DEVICES=0 python opt.py OPT175B c4 --wbits 3 --save opt175-3bit.pt
# Benchmark generating a 128 token sequence with the saved model
CUDA_VISIBLE_DEVICES=0 python opt.py OPT175B c4 --load opt175b-3bit.pt --benchmark 128
# Benchmark FP16 baseline, note that the model will be split across all listed GPUs
CUDA_VISIBLE_DEVICES=0,1,2,3,4 python opt.py OPT175B c4 --benchmark 128
```

Please note that our 3-bit kernels are currently only optimized for OPT-175B running on 1xA100 or 2xA6000 and may thus yield suboptimal performance on smaller models or on other GPUs.

## Cite

If you found this work useful, please consider citing:

```
@article{frantar-gptq,
title={{GPTQ}: Accurate Post-training Compression for Generative Pretrained Transformers},
author={Elias Frantar and Saleh Ashkboos and Torsten Hoefler and Dan Alistarh},
year={2022},
journal={arXiv preprint arXiv:2210.17323}
}
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/IST-DASLab/gptq

Awesome Lists containing this project

README