https://github.com/andreyanufr/who_what_benchmark

Last synced: 9 days ago
JSON representation

Host: GitHub
URL: https://github.com/andreyanufr/who_what_benchmark
Owner: andreyanufr
License: apache-2.0
Created: 2023-12-01T10:01:46.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-07-03T10:23:57.000Z (10 months ago)
Last Synced: 2025-04-15T08:51:43.824Z (9 days ago)
Language: Python
Size: 54.7 KB
Stars: 20
Watchers: 2
Forks: 7
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-openvino - who_what_benchmark - Simple and quick accuracy test for compressed, quantized, pruned, distilled LLMs from [NNCF](https://github.com/openvinotoolkit/nncf), Bitsandbytes, GPTQ, and BigDL-LLM. (Table of content / Miscellaneous)

README

        # Simple Accuracy Benchmark for Optimized LLMs

Was moved to the [openvino.genai](https://github.com/openvinotoolkit/openvino.genai/tree/master/llm_bench/python/who_what_benchmark)

Simple and quick accuracy test for compressed, quantized, pruned, distilled LLMs. It works with any model that suppors HuggingFace Transformers text generation API including:

* HuggingFace Transformers compressed models via [Bitsandbytes](https://huggingface.co/docs/transformers/main_classes/quantization#transformers.BitsAndBytesConfig)

* [GPTQ](https://huggingface.co/docs/transformers/main_classes/quantization#transformers.GPTQConfig) via HuggingFace API

* Llama.cpp via [BigDL-LLM](https://github.com/intel-analytics/BigDL/tree/main/python/llm)

* [OpenVINO](https://github.com/openvinotoolkit/openvino) and [NNCF](https://github.com/openvinotoolkit/nncf) via [Optimum-Intel](https://github.com/huggingface/optimum-intel)

The main idea is to compare similarity of text generation between baseline and optimized LLMs.

The API provides a way to access to investigate the worst generated text examples.

```python

from transformers import AutoModelForCausalLM, AutoTokenizer

import whowhatbench

model_id = "facebook/opt-1.3b"

base_small = AutoModelForCausalLM.from_pretrained(model_id)

optimized_model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True, device_map="auto")

tokenizer = AutoTokenizer.from_pretrained(model_id)

evaluator = whowhatbench.Evaluator(base_model=base_small, tokenizer=tokenizer)

metrics_per_prompt, metrics = evaluator.score(optimized_model)

metric_of_interest = "similarity"

print(metric_of_interest, ": ", metrics["similarity"][0])

worst_examples = evaluator.worst_examples(top_k=5, metric=metric_of_interest)

print("Metric: ", metric_of_interest)

for e in worst_examples:

    print("\t=========================")

    print("\tPrompt: ", e["prompt"])

    print("\tBaseline Model:\n ", "\t" + e["source_model"])

    print("\tOptimized Model:\n ", "\t" + e["optimized_model"])

```

Use your own list of prompts to compare (e.g. from a dataset):

```python

from datasets import load_dataset

val = load_dataset("lambada", split="validation[20:40]")

prompts = val["text"]

...

metrics_per_prompt, metrics = evaluator.score(optimized_model, test_data=prompts)

```

### Installing

* git clone https://github.com/andreyanufr/who_what_benchmark.git

* python -m venv eval_env

* source eval_env/bin/activate

* pip install -r requirements.txt

### CLI example

```

# run text generation for original model

python3 generate.py --modeltype causal --model meta-llama/Llama-2-7b-chat-hf --save_generations_path gold_llama-2-7b-chat-hf.csv --csv simple.csv --trust_remote_code

# convert and compress llama with optimum-intel and nncf and save it to some folder

...

#run text generation for compressed models

python3 generate.py --modeltype ov_causal --model /home/user/models/meta-llama/Llama-2-7b-chat-hf-int8 --save_generations_path predict_llama-2-7b-chat-hf_int8.csv --csv simple.csv --trust_remote_code

python3 generate.py --modeltype ov_causal --model /home/user/models/meta-llama/Llama-2-7b-chat-hf-int4_sym --save_generations_path predict_llama-2-7b-chat-hf_int4_sym.csv --csv simple.csv --trust_remote_code

python3 generate.py --modeltype ov_causal --model /home/user/models/meta-llama/Llama-2-7b-chat-hf-int4_asym --save_generations_path predict_llama-2-7b-chat-hf_int4_asym.csv --csv simple.csv --trust_remote_code

for file in predict_llama-2-7b*; do

python3 evaluate.py --gold gold_llama-2-7b-chat-hf.csv --prediction $file --save_evaluation_path eval_$file 2>&1 | tee -a eval.log

done

```

### Supported metrics

* `similarity` - averaged similarity measured by neural network trained for sentence embeddings. The best is 1.0, the minimum is 0.0, higher-better.

* `FDT` - Average position of the first divergent token betwen sentences generated by differnrt LLMs. The worst is 0, higher-better. [Paper.](https://arxiv.org/abs/2311.01544)

* `FDT norm` - Average share of matched tokens until first divergent one betwen sentences generated by differnrt LLMs. The best is 1, higher-better.[Paper.](https://arxiv.org/abs/2311.01544)

* `SDT` - Average number of divergent tokens in the evaluated outputs betwen sentences generated by differnrt LLMs. The best is 0, lower-better. [Paper.](https://arxiv.org/abs/2311.01544)

* `SDT norm` - Average share of divergent tokens in the evaluated outputs betwen sentences generated by differnrt LLMs. The best is 0, the maximum is 1, lower-better. [Paper.](https://arxiv.org/abs/2311.01544)

### Notes

* In the file save_evaluation_path you can see per sample similarity metrics.

* Input CSV file for generation must contain column with name `questions`. For example see simple.csv

* You can see example of generation in file generations.csv

* evaluate.py uses for similarity measurement [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) but you can use other similar network.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/andreyanufr/who_what_benchmark

Awesome Lists containing this project

README