https://github.com/arbml/taqyim
Python intefrace for evaluation on chatgpt models
https://github.com/arbml/taqyim
Last synced: 5 months ago
JSON representation
Python intefrace for evaluation on chatgpt models
- Host: GitHub
- URL: https://github.com/arbml/taqyim
- Owner: ARBML
- License: mit
- Created: 2023-05-27T10:43:49.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2024-02-13T12:56:11.000Z (about 1 year ago)
- Last Synced: 2024-08-02T01:25:51.348Z (9 months ago)
- Language: Jupyter Notebook
- Homepage:
- Size: 5.72 MB
- Stars: 19
- Watchers: 4
- Forks: 3
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-ChatGPT-repositories - Taqyim - Python intefrace for evaluation on chatgpt models (Others)
README
# Taqyim تقييم
![]()
A library for evaluting Arabic NLP datasets on chatgpt models.
## Installation
```
pip install -e .
```## Example
```python
import taqyim as tq
pipeline = tq.Pipeline(
eval_name="ajgt-test",
dataset_name="arbml/ajgt_ubc_split",
task_class="classification",
task_description= "Sentiment Analysis",
input_column_name="content",
target_column_name="label",
prompt="Predict the sentiment",
api_key="",
train_split="train",
test_split="test",
model_name="gpt-3.5-turbo-0301",
max_samples=1,)# run the evaluation
pipeline.run()# show the output data frame
pipeline.show_results()# show the eval metrics
pipeline.get_final_report()```
## Run on custom dataset
[custom_dataset.ipynb](notebooks/custom_dataset.ipynb) has a complete example on how to run evaluation on a custom dataset.
## parameters
- `eval_name` choose an eval name
- `task_class` class name from supported class names
- `task_description` short description about the task
- `dataset_name` dataset name for evaluation
- `subset` If the dataset has subset
- `train_split` train split name in the dataset
- `test_split`test split name in the dataset
- `input_column_name` input column name in the dataset
- `target_column_name` target column name in the dataset
- `prompt` the prompt to be fed to the model
- `task_description` short string explaining the task
- `api_key` api key from [keys](https://platform.openai.com/account/api-keys)
- `preprocessing_fn` function used to process inputs and targets
- `threads` number of threads used to fetch the api
- `threads_timeout` thread timeout
- `max_samples` max samples used for evaluation from the dataset
- `model_name` choose either `gpt-3.5-turbo-0301` or `gpt-4-0314`
- `temperature` temperature passed to the model between 0 and 2, higher temperature means more random results
- `num_few_shot` number of fewshot samples to be used for evaluation
- `resume_from_record` if `True` it will continue the run from the sample that has no results.
- `seed` seed to redproduce the results## Supported Classes and Tasks
* `Classification` classification tasks see [classification.py](examples/classification.py).
* `Pos_Tagging` part of speech tagging tasks [pos_tagging.py](examples/pos_tagging.py).
* `Translation` machine translation [translation.py](examples/translation.py).
* `Summarization` machine translation [summarization.py](examples/summarization.py).
* `MCQ` multiple choice question answering [mcq.py](examples/mcq.py).
* `Rating` rating multiple LLMs outputs [rating.py](examples/rating.py).
* `Diacritization` machine translation [diacritization.py](examples/diacritization.py).# Evaluation on Arabic Tasks
|Tasks |Dataset |Size |Metrics |GPT-3.5 |GPT-4 |SoTA|
| :--- | :---: | :---: | :---: | :---: | :---: |:---:|
|Summarization |[EASC](https://huggingface.co/datasets/arbml/EASC) |153 |RougeL |23.5 |18.25 |13.3|
|PoS Tagging |[PADT](https://huggingface.co/datasets/universal_dependencies/viewer/ar_padt/train) |680 |Accuracy |75.91 |86.29 |96.83|
|classification |[AJGT](https://huggingface.co/datasets/ajgt_twitter_ar) |360 |Accuracy |86.94 |90.30 |96.11|
|transliteration |[BOLT Egyptian](https://catalog.ldc.upenn.edu/LDC2021T17)✢ |6,653 |BLEU |13.76 |27.66 |65.88|
|translation |[UN v1](https://drive.google.com/file/d/13GI1F1hvwpMUGBSa0QC6ov4eE57GC_Zx/view) |4,000 |BLEU |35.05 |38.83 |53.29|
|Paraphrasing |[APB](https://github.com/marwah2001/Arabic-Paraphrasing-Benchmark) |1,010 |BLEU |4.295 |6.104 |17.52|
|Diacritization |[WikiNews](https://aclanthology.org/W17-1302/)✢✢ |393 |WER/DER |32.74/10.29 | 38.06/11.64 |4.49/1.21|✢ BOLT requires LDC subscription
✢✢ WikiNews not public, contact [authors](https://aclanthology.org/W17-1302/) to access the dataset
```
@misc{alyafeai2023taqyim,
title={Taqyim: Evaluating Arabic NLP Tasks Using ChatGPT Models},
author={Zaid Alyafeai and Maged S. Alshaibani and Badr AlKhamissi and Hamzah Luqman and Ebrahim Alareqi and Ali Fadel},
year={2023},
eprint={2306.16322},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```