https://github.com/arbml/taqyim

Python intefrace for evaluation on chatgpt models
https://github.com/arbml/taqyim

Last synced: 2 months ago
JSON representation

Python intefrace for evaluation on chatgpt models

Host: GitHub
URL: https://github.com/arbml/taqyim
Owner: ARBML
License: mit
Created: 2023-05-27T10:43:49.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2024-02-13T12:56:11.000Z (over 1 year ago)
Last Synced: 2024-11-24T18:13:53.795Z (7 months ago)
Language: Jupyter Notebook
Homepage:
Size: 5.72 MB
Stars: 19
Watchers: 4
Forks: 3
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-ChatGPT-repositories - Taqyim - Python intefrace for evaluation on chatgpt models (Others)

README

        # Taqyim تقييم



    



A library for evaluting Arabic NLP datasets on chatgpt models. 

## Installation

```

pip install -e .

```

## Example 

```python

import taqyim as tq

pipeline = tq.Pipeline(

    eval_name="ajgt-test",

    dataset_name="arbml/ajgt_ubc_split",

    task_class="classification",

    task_description= "Sentiment Analysis",

    input_column_name="content",

    target_column_name="label",

    prompt="Predict the sentiment",

    api_key="",

    train_split="train",

    test_split="test",

    model_name="gpt-3.5-turbo-0301",

    max_samples=1,)

# run the evaluation

pipeline.run()

# show the output data frame

pipeline.show_results()

# show the eval metrics

pipeline.get_final_report()

```

## Run on custom dataset

[custom_dataset.ipynb](notebooks/custom_dataset.ipynb) has a complete example on how to run evaluation on a custom dataset. 

## parameters

-    `eval_name` choose an eval name

-    `task_class` class name from supported class names

-    `task_description` short description about the task

-    `dataset_name` dataset name for evaluation

-    `subset` If the dataset has subset

-    `train_split` train split name in the dataset

-    `test_split`test split name in the dataset

-    `input_column_name` input column name in the dataset

-    `target_column_name` target column name in the dataset

-    `prompt` the prompt to be fed to the model

-    `task_description` short string explaining the task

-    `api_key` api key from [keys](https://platform.openai.com/account/api-keys)

-    `preprocessing_fn` function used to process inputs and targets 

-    `threads` number of threads used to fetch the api

-    `threads_timeout` thread timeout 

-    `max_samples` max samples used for evaluation from the dataset 

-    `model_name` choose either `gpt-3.5-turbo-0301` or `gpt-4-0314`

-    `temperature` temperature passed to the model between 0 and 2, higher temperature means more random results

-    `num_few_shot` number of fewshot samples to be used for evaluation

-    `resume_from_record` if `True` it will continue the run from the sample that has no results. 

-    `seed` seed to redproduce the results

## Supported Classes and Tasks

* `Classification` classification tasks see [classification.py](examples/classification.py).

* `Pos_Tagging` part of speech tagging tasks [pos_tagging.py](examples/pos_tagging.py).

* `Translation` machine translation [translation.py](examples/translation.py).

* `Summarization` machine translation [summarization.py](examples/summarization.py).

* `MCQ` multiple choice question answering [mcq.py](examples/mcq.py).

* `Rating` rating multiple LLMs outputs [rating.py](examples/rating.py).

* `Diacritization` machine translation [diacritization.py](examples/diacritization.py).

# Evaluation on Arabic Tasks 

|Tasks              |Dataset        |Size       |Metrics	    |GPT-3.5 	    |GPT-4      |SoTA|

| :---              | :---:         | :---:     | :---:         | :---:         | :---:     |:---:|

|Summarization      |[EASC](https://huggingface.co/datasets/arbml/EASC)	        |153	    |RougeL	        |23.5		    |18.25	    |13.3|

|PoS Tagging	    |[PADT](https://huggingface.co/datasets/universal_dependencies/viewer/ar_padt/train)	        |680        |Accuracy	    |75.91		    |86.29	    |96.83|

|classification	    |[AJGT](https://huggingface.co/datasets/ajgt_twitter_ar)	        |360        |Accuracy	    |86.94		    |90.30	    |96.11|	

|transliteration	|[BOLT Egyptian](https://catalog.ldc.upenn.edu/LDC2021T17)✢  |6,653      |BLEU           |13.76		    |27.66	    |65.88|

|translation	    |[UN v1](https://drive.google.com/file/d/13GI1F1hvwpMUGBSa0QC6ov4eE57GC_Zx/view)          |4,000	    |BLEU	        |35.05		    |38.83	    |53.29|

|Paraphrasing	    |[APB](https://github.com/marwah2001/Arabic-Paraphrasing-Benchmark)	        |1,010      |BLEU	        |4.295		    |6.104	    |17.52|

|Diacritization	    |[WikiNews](https://aclanthology.org/W17-1302/)✢✢      |393	    |WER/DER	    |32.74/10.29	| 38.06/11.64		   |4.49/1.21|

✢ BOLT requires LDC subscription

✢✢ WikiNews not public, contact [authors](https://aclanthology.org/W17-1302/) to access the dataset

```

@misc{alyafeai2023taqyim,

      title={Taqyim: Evaluating Arabic NLP Tasks Using ChatGPT Models}, 

      author={Zaid Alyafeai and Maged S. Alshaibani and Badr AlKhamissi and Hamzah Luqman and Ebrahim Alareqi and Ali Fadel},

      year={2023},

      eprint={2306.16322},

      archivePrefix={arXiv},

      primaryClass={cs.CL}

}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/arbml/taqyim

Awesome Lists containing this project

README