Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/picovoice/llm-compression-benchmark
LLM Compression Benchmark
https://github.com/picovoice/llm-compression-benchmark
llm llm-compression llm-inference
Last synced: about 2 months ago
JSON representation
LLM Compression Benchmark
- Host: GitHub
- URL: https://github.com/picovoice/llm-compression-benchmark
- Owner: Picovoice
- License: apache-2.0
- Created: 2024-05-01T16:58:09.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2024-05-28T17:51:50.000Z (8 months ago)
- Last Synced: 2024-05-29T08:53:03.903Z (8 months ago)
- Topics: llm, llm-compression, llm-inference
- Language: Python
- Homepage: https://picovoice.ai/
- Size: 13.7 MB
- Stars: 2
- Watchers: 6
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# LLM Compression Benchmark
Made in Vancouver, Canada by [Picovoice](https://picovoice.ai)
This repository is a minimalist and extensible framework for benchmarking LLM compression algorithms.
## Table of Contents
- [Algorithms](#algorithms)
- [GPTQ](#gptq)
- [picoLLM Compression](#picollm-compression)
- [Tasks](#tasks)
- [MMLU Score](#mmlu-score)
- [ARC Score](#arc-score)
- [Perplexity Loss](#perplexity-loss)
- [Data](#data)
- [MMLU](#mmlu)
- [ARC](#arc)
- [Perplexity (C4)](#perplexity-c4)
- [Quantization (C4)](#quantization-c4)
- [Models](#models)
- [Usage](#usage)
- [Results](#results)
- [MMLU](#mmlu-1)
- [ARC-Easy](#arc-easy)
- [ARC-Challenge](#arc-challenge)
- [Perplexity](#perplexity)## Algorithms
### GPTQ
[GPTQ](https://arxiv.org/abs/2210.17323) is arguably the most popular quantization algorithm for LLMs. GPTQ fully
reconstructs weights so that the quantized version closely mimics the full-precision one.### picoLLM Compression
picoLLM Compression is Picovoice's in-house LLM compression algorithm. Given a target size, picoLLM optimally
distributes available bits within and across LLM's weights.## Tasks
### MMLU Score
[MMLU](https://huggingface.co/datasets/lukaemon/mmlu) (Massive Multitask Language Understanding) is a
multiple-choice dataset that can measure the models' ability to understand natural language.### ARC Score
[ARC](https://allenai.org/data/arc) (AI2 Reasoning Challenge) is a multiple-choice dataset that measures
the models' reasoning ability. The ARC dataset has two partitions: `Easy` and `Challenge`. We perform the benchmark on
both partitions and report the results separately.### Perplexity Loss
Perplexity measures the models' language modeling capabilities.
## Data
The'/res' folder contains all required data for the benchmark. To reproduce it, follow the sections below.
### MMLU
Download the [MMLU dataset](https://huggingface.co/datasets/lukaemon/mmlu) and run the following from the
repository's root to extract and format it:```console
python3 data/mmlu.py --dataset-folder ${DATASET_FOLDER}
```### ARC
Download the [ARC dataset](https://allenai.org/data/arc) and run the following from the repository's root to extract and
format the `Challenge` portion:```console
python3 data/arc.py --dataset-folder ${DATASET_FOLDER}
```Perform the above for the `Easy` portion:
```console
python3 data/arc.py --dataset-folder ${DATASET_FOLDER} --easy
```### Perplexity (C4)
For the perplexity measurement, we use 128 randomly selected text snippets from the validation portion of the
[C4 dataset](https://huggingface.co/datasets/c4). Once you download the dataset, run the following from the root of the
repository to extract and normalize the data:```console
python3 data/c4-normalize.py \
--repository-folder ${REPOSITORY_FOLDER} \
--normalized-folder ${VALIDATION_FOLDER} \
--portion validation
```Replace `${REPOSITORY_FOLDER}` with the path to the downloaded dataset repository and `${VALIDATION_FOLDER}` with a
folder to hold onto the normalized data.Then we sample 128 sequences from the normalized data:
```console
python3 data/c4-sample.py \
--dataset-folder ${VALIDATION_FOLDER} \
--portion valid
```### Quantization (C4)
We need a sample dataset for quantization algorithms (GPTQ, picoLLM). We use 128 randomly selected text snippets from
the train portion of the [C4 dataset](https://huggingface.co/datasets/c4). Once you download the dataset, run the
following from the root of the repository to extract and normalize the data:```console
python3 data/c4-normalize.py \
--repository-folder ${REPOSITORY_FOLDER} \
--normalized-folder ${TRAIN_FOLDER} \
--portion train
```Replace `${REPOSITORY_FOLDER}` with the path to the downloaded dataset repository and `${TRAIN_FOLDER}` with a
folder to hold onto the normalized data.Then we sample 128 sequences from the normalized data:
```console
python3 data/c4-sample.py \
--dataset-folder ${TRAIN_FOLDER} \
--portion train
```## Models
We use six models:
- `Gemma-2b`
- `Gemma-7b`
- `Llama-2-7b`
- `Llama-3-8b`
- `Mistral-7b-v0.1`
- `Phi-2`The corresponding picoLLM compressed models are on [Picovoice Console](https://console.picovoice.ai/). We create GPTQ
models using the package [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ). You can quantize the models by running the
following:```console
python3 model/autogptq.py \
--model-uri ${MODEL_URI} \
--quantized-model-folder ${QUANTIZED_MODEL_FOLDER} \
--bits ${BITS}
```## Usage
To measure the MMLU score for a given model, run the following:
```console
python3 mmlu.py \
--compression ${COMPRESSION} \
--model-uri ${MODEL_URI}
```Replace `${COMPRESSION}` with the model's compression. i.e., `NONE` for full-precision models, `GPTQ,` or `picoLLM.`
To measure the ARC score for a given model, run the following:
```console
python3 arc.py \
--compression ${COMPRESSION} \
--model-uri ${MODEL_URI}
```Replace `${COMPRESSION}` with the model's compression. i.e., `NONE` for full-precision models, `GPTQ,` or `picoLLM.`
To measure the perplexity for a given model, run the following:
```console
python3 perplexity.py \
--compression ${COMPRESSION} \
--model-uri ${MODEL_URI}
```Replace `${COMPRESSION}` with the model's compression. i.e., `NONE` for full-precision models, `GPTQ,` or `picoLLM.`
When running picoLLM Compressed models, you must also provide your Picovoice AccessKey, which is available on
[Picovoice Console](https://console.picovoice.ai/).```console
... --picollm-access-key ${PICOLLM_ACCESS_KEY}
```## Results
Below are our benchmark results comparing GPTQ against picoLLM for all [models](model). We perform 2, 3, and 4-bit
quantization using GPTQ, then find the model size in GB and set that as the target size for picoLLM Compression. Hence,
both models have the same size in terms of the number of bytes. When performing GPTQ, we set the group size parameter to
128, set the damp percent to 0.1 and enabled activation reordering.### MMLU
The table below depicts the MMLU score of the original models.
Model
MMLU
Gemma-2b 5.0G
40.21
Gemma-7b 17.1G
64.48
Llama-3-8b 16.1G
64.88
Llama-2-7b 13.5G
46.38
Mistral-7b-v0.1 15.0G
62.41
Phi-2 5.6G
56.04
The table below depicts the MMLU score of the quantized models.
Model
GPTQ
picoLLM
Gemma-2b 3.1G
39.07
41.12
Gemma-2b 2.9G
27.51
41.12
Gemma-2b 2.6G
24.93
41.12
Gemma-7b 7.2G
62.58
64.98
Gemma-7b 6.2G
53.30
64.57
Gemma-7b 5.2G
25.58
64.32
Llama-2-7b 3.9G
45.26
44.99
Llama-2-7b 3.1G
40.40
40.68
Llama-2-7b 2.3G
25.36
28.72
Llama-3-8b 5.7G
63.09
64.96
Llama-3-8b 4.9G
53.86
64.76
Llama-3-8b 4.0G
25.05
61.26
Mistral-7b-v0.1 4.2G
61.00
59.19
Mistral-7b-v0.1 3.3G
23.73
57.72
Mistral-7b-v0.1 2.4G
25.70
43.53
Phi-2 1.8G
54.61
54.11
Phi-2 1.5G
50.64
52.24
Phi-2 1.2G
26.05
48.86
### ARC Easy
The table below depicts the ARC Easy score of the original models.
Model
ARC Easy
Gemma-2b 5.0G
33.75
Gemma-7b 17.1G
75.51
Llama-2-7b 13.5G
44.87
Llama-3-8b 16.1G
75.80
Mistral-7b-v0.1 15.0G
80.56
Phi-2 5.6G
75.25
The table below depicts the ARC Easy score of the quantized models.
Model
GPTQ
picoLLM
Gemma-2b 3.1G
30.39
34.39
Gemma-2b 2.9G
24.37
34.39
Gemma-2b 2.6G
23.82
34.39
Gemma-7b 7.2G
76.52
84.18
Gemma-7b 6.2G
44.28
84.51
Gemma-7b 5.2G
23.95
84.13
Llama-2-7b 3.9G
39.23
41.96
Llama-2-7b 3.1G
32.95
33.96
Llama-2-7b 2.3G
23.91
24.49
Llama-3-8b 5.7G
72.85
78.83
Llama-3-8b 4.9G
43.39
77.02
Llama-3-8b 4.0G
24.71
71.76
Mistral-7b-v0.1 4.2G
77.27
73.95
Mistral-7b-v0.1 3.3G
23.91
72.10
Mistral-7b-v0.1 2.4G
24.92
46.46
Phi-2 1.8G
70.45
75.04
Phi-2 1.5G
56.61
70.66
Phi-2 1.2G
22.10
62.42
### ARC Challenge
The table below depicts the ARC Challenge score of the original models.
Model
ARC Challenge
Gemma-2b 5.0G
30.38
Gemma-7b 17.1G
64.93
Llama-2-7b 13.5G
37.03
Llama-3-8b 16.1G
63.05
Mistral-7b-v0.1 15.0G
67.49
Phi-2 5.6G
61.60
The table below depicts the ARC Challenge score of the quantized models.
Model
GPTQ
picoLLM
Gemma-2b 3.1G
26.37
30.97
Gemma-2b 2.9G
23.55
30.97
Gemma-2b 2.6G
24.83
30.97
Gemma-7b 7.2G
66.30
72.35
Gemma-7b 6.2G
33.62
72.35
Gemma-7b 5.2G
24.06
72.61
Llama-2-7b 3.9G
32.42
34.30
Llama-2-7b 3.1G
27.56
28.24
Llama-2-7b 2.3G
21.16
23.63
Llama-3-8b 5.7G
60.24
64.33
Llama-3-8b 4.9G
36.18
63.48
Llama-3-8b 4.0G
23.29
57.85
Mistral-7b-v0.1 4.2G
64.42
60.49
Mistral-7b-v0.1 3.3G
24.06
59.04
Mistral-7b-v0.1 2.4G
23.21
37.80
Phi-2 1.8G
57.42
62.46
Phi-2 1.5G
44.97
57.51
Phi-2 1.2G
24.49
47.87
### Perplexity
The table below depicts the perplexity of the original models.
Model
Perplexity
Gemma-2b 5.0G
16.79
Gemma-7b 17.1G
14.67
Llama-2-7b 13.5G
8.40
Llama-3-8b 16.1G
11.61
Mistral-7b-v0.1 15.0G
10.50
Phi-2 5.6G
17.38
The table below depicts the perplexity of the quantized models.
Model
GPTQ
picoLLM
Gemma-2b 3.1G
17.85
16.86
Gemma-2b 2.9G
24.11
16.86
Gemma-2b 2.6G
8377.74
16.86
Gemma-7b 7.2G
15.47
14.82
Gemma-7b 6.2G
27.29
14.84
Gemma-7b 5.2G
33370970.40
15.08
Llama-2-7b 3.9G
8.59
8.50
Llama-2-7b 3.1G
9.66
8.86
Llama-2-7b 2.3G
67.43
10.87
Llama-3-8b 5.7G
12.31
11.73
Llama-3-8b 4.9G
17.47
11.90
Llama-3-8b 4.0G
712.70
12.67
Mistral-7b-v0.1 4.2G
10.43
10.62
Mistral-7b-v0.1 3.3G
2909.83
10.81
Mistral-7b-v0.1 2.4G
1176.43
14.87
Phi-2 1.8G
18.15
17.76
Phi-2 1.5G
19.94
18.14
Phi-2 1.2G
76.55
20.22