https://github.com/abdullahalzubaer/gpt-4-shows-comparable-performance-to-human-examiners-in-ranking-open-text-answers

Accompanying code for the paper "Can GPT-4 Replace Human Examiners? A Competition on Checking Open-Text Answers"
https://github.com/abdullahalzubaer/gpt-4-shows-comparable-performance-to-human-examiners-in-ranking-open-text-answers

Last synced: about 2 months ago
JSON representation

Accompanying code for the paper "Can GPT-4 Replace Human Examiners? A Competition on Checking Open-Text Answers"

Host: GitHub
URL: https://github.com/abdullahalzubaer/gpt-4-shows-comparable-performance-to-human-examiners-in-ranking-open-text-answers
Owner: abdullahalzubaer
License: apache-2.0
Created: 2024-05-16T12:30:35.000Z (about 1 year ago)
Default Branch: master
Last Pushed: 2024-10-21T12:30:54.000Z (7 months ago)
Last Synced: 2025-02-05T09:54:31.863Z (4 months ago)
Language: Python
Homepage:
Size: 503 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# GPT-4 Shows Comparable Performance to Human Examiners in Ranking Open-Text Answers

Accompanying code for the paper "Can GPT-4 Replace Human Examiners? A Competition on Checking Open-Text Answers" by the authors:

pre-print: https://doi.org/10.21203/rs.3.rs-4431780/v1

Zubaer, Abdullah Al; Granitzer, Michael;
Geschwind, Stephan; Graf Lambsdorff, Johann; Voss, Deborah

Affiliation (all authors): University of Passau, Passau, Germany.

Dataset DOI: [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.11085379.svg)](https://doi.org/10.5281/zenodo.11085379)

# Inter-Rater Reliability Evaluation

This repository includes Python scripts designed to evaluate Inter-Rater Reliability (IRR) using data from specific files, saving the results in a designated folder. Additionally, it provides scripts to generate rank and point assessments using GPT-4 for all the prompts mentioned in our paper.

It consists of two Python scripts:
1. `create_rank_point_gpt4.py`: Generates rank and point data using GPT-4 models.
2. `rank_point_assessment.py`: Evaluates IRR based on generated data or original data.

## Installation
Tested on Ubuntu 22.04.4 LTS.

>To set up the project, you will need to have [Anaconda](https://www.anaconda.com/) installed on your system. If you don't have it installed, you can download it from [here](https://www.anaconda.com/download/success).

> For setting up OpenAI GPT-4 API, you need to have an API key. You can get it from [here](https://beta.openai.com/signup/) and to set up the API key as environment variable, you can follow the instructions from [here](https://mkyong.com/linux/how-to-set-environment-variable-in-ubuntu/). The key must be set up in helper_functions.py file like this, `openai.api_key = os.getenv("OPENAI_API_KEY")`

1. Create a conda environment

```bash
conda create -n python=3.10
conda activate
```

2. **Clone the Repository**:
```bash
git clone https://github.com/abdullahalzubaer/Can-GPT-4-Replace-Human-Examiners-A-Competition-on-Checking-Open-Text-Answers.git

cd Can-GPT-4-Replace-Human-Examiners-A-Competition-on-Checking-Open-Text-Answers
```

3. **Install Dependencies**
Install the required Python packages using pip:
```bash
pip install -r requirements.txt
```

# Usage

The two scripts are used sequentially or independently, depending on the data you have. Below are instructions for each scenario.

## 1. Working with Original Data

If you want to evaluate the original data [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.11085379.svg)](https://doi.org/10.5281/zenodo.11085379) directly without generating any new ranks or points using GPT-4, use the following command:
>The data must be present in `./original_data/` directory.

### 1.1 Rank and Point Assessment for Per-Question IRR:

```bash
python rank_point_assessment.py -at rank point -pq
```
This will calculate the rank and point assessments for per-question IRR and save results in `./original_data_results/...` directories.

### 1.2 Rank and Point Assessment Without Per-Question IRR:
```bash
python rank_point_assessment.py -at rank point
```
This calculates the rank and point assessments and saves pooled results in `./original_data_results/...` directories.

## 2. Generating and Evaluating New Ranks and Points Using GPT-4.

To generate new ranks and points using GPT-4, follow these steps:

### 2.1 Run the script to generate ranks and points:
```bash
python create_rank_point_gpt4.py
```
The generated data will be saved in the `./newly_generated_data/` directory.
>Note: In rare instances, the GPT-4-generated data may not be perfectly parseable due to the non-deterministic nature of the model. You might encounter parsing issues, which can result in incomplete data. In such cases, it might be necessary to manually inspect and modify the data by reviewing the metadata file `metadata_gpt4_ranks_points` to correct the rank and point information in the `data_gpt4_ranks_points` file.

### 2.2 Evaluate the generated data using the rank and point assessment script:

#### 2.2.1 Rank and Point Assessment with Per-Question IRR:

```bash
python rank_point_assessment.py -at rank point -pq -nd
```
This will calculate the rank and point assessments for per-question IRR and save results. in `./newly_generated_data_results/...`

#### 2.2.2 Rank and Point Assessment Without Per-Question IRR:
```bash
python rank_point_assessment.py -at rank point -nd
```
This calculates the rank and point assessments and saves pooled results in `./newly_generated_data_results/...`

License

This code is licensed under the Apache-2.0 license. See the LICENSE file for details.
Dataset licence mentioned here: [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.11085379.svg)](https://doi.org/10.5281/zenodo.11085379)

# Citation

If you use our dataset or code, please cite the data source and our paper. Proper citation helps to ensure continued support for the project and acknowledges the work of the authors.

Dataset Citation:

```
@dataset{zubaer_2024_11085379,
author = {Zubaer, Abdullah Al and
Granitzer, Michael and
Geschwind, Stephan and
Graf Lambsdorff, Johann and
Voss, Deborah},
title = {{GPT-4 Shows Comparable Performance to Human
Examiners in Ranking Open-Text Answers}},
month = apr,
year = 2024,
publisher = {Zenodo},
doi = {10.5281/zenodo.11085379},
url = {https://doi.org/10.5281/zenodo.11085379}
}
```

Paper Citation [pre-print]:

```
Abdullah Al Zubaer, Michael Granitzer, Stephan Geschwind et al.
Can GPT-4 Replace Human Examiners? A Competition on Checking Open-Text Answers,
10 July 2024, PREPRINT (Version 1) available at Research Square
[https://doi.org/10.21203/rs.3.rs-4431780/v1]
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/abdullahalzubaer/gpt-4-shows-comparable-performance-to-human-examiners-in-ranking-open-text-answers

Awesome Lists containing this project

README