Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/picovoice/llm-compression-benchmark

LLM Compression Benchmark
https://github.com/picovoice/llm-compression-benchmark

llm llm-compression llm-inference

Last synced: about 2 months ago
JSON representation

LLM Compression Benchmark

Awesome Lists containing this project

README

        

# LLM Compression Benchmark

Made in Vancouver, Canada by [Picovoice](https://picovoice.ai)

This repository is a minimalist and extensible framework for benchmarking LLM compression algorithms.

## Table of Contents

- [Algorithms](#algorithms)
- [GPTQ](#gptq)
- [picoLLM Compression](#picollm-compression)
- [Tasks](#tasks)
- [MMLU Score](#mmlu-score)
- [ARC Score](#arc-score)
- [Perplexity Loss](#perplexity-loss)
- [Data](#data)
- [MMLU](#mmlu)
- [ARC](#arc)
- [Perplexity (C4)](#perplexity-c4)
- [Quantization (C4)](#quantization-c4)
- [Models](#models)
- [Usage](#usage)
- [Results](#results)
- [MMLU](#mmlu-1)
- [ARC-Easy](#arc-easy)
- [ARC-Challenge](#arc-challenge)
- [Perplexity](#perplexity)

## Algorithms

### GPTQ

[GPTQ](https://arxiv.org/abs/2210.17323) is arguably the most popular quantization algorithm for LLMs. GPTQ fully
reconstructs weights so that the quantized version closely mimics the full-precision one.

### picoLLM Compression

picoLLM Compression is Picovoice's in-house LLM compression algorithm. Given a target size, picoLLM optimally
distributes available bits within and across LLM's weights.

## Tasks

### MMLU Score

[MMLU](https://huggingface.co/datasets/lukaemon/mmlu) (Massive Multitask Language Understanding) is a
multiple-choice dataset that can measure the models' ability to understand natural language.

### ARC Score

[ARC](https://allenai.org/data/arc) (AI2 Reasoning Challenge) is a multiple-choice dataset that measures
the models' reasoning ability. The ARC dataset has two partitions: `Easy` and `Challenge`. We perform the benchmark on
both partitions and report the results separately.

### Perplexity Loss

Perplexity measures the models' language modeling capabilities.

## Data

The'/res' folder contains all required data for the benchmark. To reproduce it, follow the sections below.

### MMLU

Download the [MMLU dataset](https://huggingface.co/datasets/lukaemon/mmlu) and run the following from the
repository's root to extract and format it:

```console
python3 data/mmlu.py --dataset-folder ${DATASET_FOLDER}
```

### ARC

Download the [ARC dataset](https://allenai.org/data/arc) and run the following from the repository's root to extract and
format the `Challenge` portion:

```console
python3 data/arc.py --dataset-folder ${DATASET_FOLDER}
```

Perform the above for the `Easy` portion:

```console
python3 data/arc.py --dataset-folder ${DATASET_FOLDER} --easy
```

### Perplexity (C4)

For the perplexity measurement, we use 128 randomly selected text snippets from the validation portion of the
[C4 dataset](https://huggingface.co/datasets/c4). Once you download the dataset, run the following from the root of the
repository to extract and normalize the data:

```console
python3 data/c4-normalize.py \
--repository-folder ${REPOSITORY_FOLDER} \
--normalized-folder ${VALIDATION_FOLDER} \
--portion validation
```

Replace `${REPOSITORY_FOLDER}` with the path to the downloaded dataset repository and `${VALIDATION_FOLDER}` with a
folder to hold onto the normalized data.

Then we sample 128 sequences from the normalized data:

```console
python3 data/c4-sample.py \
--dataset-folder ${VALIDATION_FOLDER} \
--portion valid
```

### Quantization (C4)

We need a sample dataset for quantization algorithms (GPTQ, picoLLM). We use 128 randomly selected text snippets from
the train portion of the [C4 dataset](https://huggingface.co/datasets/c4). Once you download the dataset, run the
following from the root of the repository to extract and normalize the data:

```console
python3 data/c4-normalize.py \
--repository-folder ${REPOSITORY_FOLDER} \
--normalized-folder ${TRAIN_FOLDER} \
--portion train
```

Replace `${REPOSITORY_FOLDER}` with the path to the downloaded dataset repository and `${TRAIN_FOLDER}` with a
folder to hold onto the normalized data.

Then we sample 128 sequences from the normalized data:

```console
python3 data/c4-sample.py \
--dataset-folder ${TRAIN_FOLDER} \
--portion train
```

## Models

We use six models:

- `Gemma-2b`
- `Gemma-7b`
- `Llama-2-7b`
- `Llama-3-8b`
- `Mistral-7b-v0.1`
- `Phi-2`

The corresponding picoLLM compressed models are on [Picovoice Console](https://console.picovoice.ai/). We create GPTQ
models using the package [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ). You can quantize the models by running the
following:

```console
python3 model/autogptq.py \
--model-uri ${MODEL_URI} \
--quantized-model-folder ${QUANTIZED_MODEL_FOLDER} \
--bits ${BITS}
```

## Usage

To measure the MMLU score for a given model, run the following:

```console
python3 mmlu.py \
--compression ${COMPRESSION} \
--model-uri ${MODEL_URI}
```

Replace `${COMPRESSION}` with the model's compression. i.e., `NONE` for full-precision models, `GPTQ,` or `picoLLM.`

To measure the ARC score for a given model, run the following:

```console
python3 arc.py \
--compression ${COMPRESSION} \
--model-uri ${MODEL_URI}
```

Replace `${COMPRESSION}` with the model's compression. i.e., `NONE` for full-precision models, `GPTQ,` or `picoLLM.`

To measure the perplexity for a given model, run the following:

```console
python3 perplexity.py \
--compression ${COMPRESSION} \
--model-uri ${MODEL_URI}
```

Replace `${COMPRESSION}` with the model's compression. i.e., `NONE` for full-precision models, `GPTQ,` or `picoLLM.`

When running picoLLM Compressed models, you must also provide your Picovoice AccessKey, which is available on
[Picovoice Console](https://console.picovoice.ai/).

```console
... --picollm-access-key ${PICOLLM_ACCESS_KEY}
```

## Results

Below are our benchmark results comparing GPTQ against picoLLM for all [models](model). We perform 2, 3, and 4-bit
quantization using GPTQ, then find the model size in GB and set that as the target size for picoLLM Compression. Hence,
both models have the same size in terms of the number of bytes. When performing GPTQ, we set the group size parameter to
128, set the damp percent to 0.1 and enabled activation reordering.

### MMLU

The table below depicts the MMLU score of the original models.


Model
MMLU


Gemma-2b 5.0G
40.21


Gemma-7b 17.1G
64.48


Llama-3-8b 16.1G
64.88


Llama-2-7b 13.5G
46.38


Mistral-7b-v0.1 15.0G
62.41


Phi-2 5.6G
56.04

The table below depicts the MMLU score of the quantized models.


Model
GPTQ
picoLLM

Gemma-2b 3.1G
39.07
41.12


Gemma-2b 2.9G
27.51
41.12


Gemma-2b 2.6G
24.93
41.12


Gemma-7b 7.2G
62.58
64.98


Gemma-7b 6.2G
53.30
64.57


Gemma-7b 5.2G
25.58
64.32


Llama-2-7b 3.9G
45.26
44.99


Llama-2-7b 3.1G
40.40
40.68


Llama-2-7b 2.3G
25.36
28.72


Llama-3-8b 5.7G
63.09
64.96


Llama-3-8b 4.9G
53.86
64.76


Llama-3-8b 4.0G
25.05
61.26


Mistral-7b-v0.1 4.2G
61.00
59.19


Mistral-7b-v0.1 3.3G
23.73
57.72


Mistral-7b-v0.1 2.4G
25.70
43.53


Phi-2 1.8G
54.61
54.11


Phi-2 1.5G
50.64
52.24


Phi-2 1.2G
26.05
48.86

### ARC Easy

The table below depicts the ARC Easy score of the original models.


Model
ARC Easy


Gemma-2b 5.0G
33.75


Gemma-7b 17.1G
75.51


Llama-2-7b 13.5G
44.87


Llama-3-8b 16.1G
75.80


Mistral-7b-v0.1 15.0G
80.56


Phi-2 5.6G
75.25

The table below depicts the ARC Easy score of the quantized models.


Model
GPTQ
picoLLM


Gemma-2b 3.1G
30.39
34.39


Gemma-2b 2.9G
24.37
34.39


Gemma-2b 2.6G
23.82
34.39


Gemma-7b 7.2G
76.52
84.18


Gemma-7b 6.2G
44.28
84.51


Gemma-7b 5.2G
23.95
84.13


Llama-2-7b 3.9G
39.23
41.96


Llama-2-7b 3.1G
32.95
33.96


Llama-2-7b 2.3G
23.91
24.49


Llama-3-8b 5.7G
72.85
78.83


Llama-3-8b 4.9G
43.39
77.02


Llama-3-8b 4.0G
24.71
71.76


Mistral-7b-v0.1 4.2G
77.27
73.95


Mistral-7b-v0.1 3.3G
23.91
72.10


Mistral-7b-v0.1 2.4G
24.92
46.46


Phi-2 1.8G
70.45
75.04


Phi-2 1.5G
56.61
70.66


Phi-2 1.2G
22.10
62.42

### ARC Challenge

The table below depicts the ARC Challenge score of the original models.


Model
ARC Challenge


Gemma-2b 5.0G
30.38


Gemma-7b 17.1G
64.93


Llama-2-7b 13.5G
37.03

Llama-3-8b 16.1G
63.05


Mistral-7b-v0.1 15.0G
67.49


Phi-2 5.6G
61.60

The table below depicts the ARC Challenge score of the quantized models.


Model
GPTQ
picoLLM


Gemma-2b 3.1G
26.37
30.97


Gemma-2b 2.9G
23.55
30.97


Gemma-2b 2.6G
24.83
30.97


Gemma-7b 7.2G
66.30
72.35


Gemma-7b 6.2G
33.62
72.35


Gemma-7b 5.2G
24.06
72.61


Llama-2-7b 3.9G
32.42
34.30


Llama-2-7b 3.1G
27.56
28.24


Llama-2-7b 2.3G
21.16
23.63


Llama-3-8b 5.7G
60.24
64.33


Llama-3-8b 4.9G
36.18
63.48


Llama-3-8b 4.0G
23.29
57.85


Mistral-7b-v0.1 4.2G
64.42
60.49


Mistral-7b-v0.1 3.3G
24.06
59.04


Mistral-7b-v0.1 2.4G
23.21
37.80


Phi-2 1.8G
57.42
62.46


Phi-2 1.5G
44.97
57.51


Phi-2 1.2G
24.49
47.87

### Perplexity

The table below depicts the perplexity of the original models.


Model
Perplexity


Gemma-2b 5.0G
16.79


Gemma-7b 17.1G
14.67


Llama-2-7b 13.5G
8.40


Llama-3-8b 16.1G
11.61


Mistral-7b-v0.1 15.0G
10.50


Phi-2 5.6G
17.38

The table below depicts the perplexity of the quantized models.


Model
GPTQ
picoLLM


Gemma-2b 3.1G
17.85
16.86


Gemma-2b 2.9G
24.11
16.86


Gemma-2b 2.6G
8377.74
16.86


Gemma-7b 7.2G
15.47
14.82


Gemma-7b 6.2G
27.29
14.84


Gemma-7b 5.2G
33370970.40
15.08


Llama-2-7b 3.9G
8.59
8.50


Llama-2-7b 3.1G
9.66
8.86


Llama-2-7b 2.3G
67.43
10.87


Llama-3-8b 5.7G
12.31
11.73


Llama-3-8b 4.9G
17.47
11.90


Llama-3-8b 4.0G
712.70
12.67


Mistral-7b-v0.1 4.2G
10.43
10.62


Mistral-7b-v0.1 3.3G
2909.83
10.81


Mistral-7b-v0.1 2.4G
1176.43
14.87


Phi-2 1.8G
18.15
17.76


Phi-2 1.5G
19.94
18.14


Phi-2 1.2G
76.55
20.22