https://github.com/abdullahalzubaer/gpt-4-shows-comparable-performance-to-human-examiners-in-ranking-open-text-answers
Accompanying code for the paper "Can GPT-4 Replace Human Examiners? A Competition on Checking Open-Text Answers"
https://github.com/abdullahalzubaer/gpt-4-shows-comparable-performance-to-human-examiners-in-ranking-open-text-answers
Last synced: about 2 months ago
JSON representation
Accompanying code for the paper "Can GPT-4 Replace Human Examiners? A Competition on Checking Open-Text Answers"
- Host: GitHub
- URL: https://github.com/abdullahalzubaer/gpt-4-shows-comparable-performance-to-human-examiners-in-ranking-open-text-answers
- Owner: abdullahalzubaer
- License: apache-2.0
- Created: 2024-05-16T12:30:35.000Z (about 1 year ago)
- Default Branch: master
- Last Pushed: 2024-10-21T12:30:54.000Z (7 months ago)
- Last Synced: 2025-02-05T09:54:31.863Z (4 months ago)
- Language: Python
- Homepage:
- Size: 503 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# GPT-4 Shows Comparable Performance to Human Examiners in Ranking Open-Text Answers
Accompanying code for the paper "Can GPT-4 Replace Human Examiners? A Competition on Checking Open-Text Answers" by the authors:
pre-print: https://doi.org/10.21203/rs.3.rs-4431780/v1
Zubaer, Abdullah Al; Granitzer, Michael;
Geschwind, Stephan; Graf Lambsdorff, Johann; Voss, DeborahAffiliation (all authors): University of Passau, Passau, Germany.
Dataset DOI: [](https://doi.org/10.5281/zenodo.11085379)
# Inter-Rater Reliability Evaluation
This repository includes Python scripts designed to evaluate Inter-Rater Reliability (IRR) using data from specific files, saving the results in a designated folder. Additionally, it provides scripts to generate rank and point assessments using GPT-4 for all the prompts mentioned in our paper.
It consists of two Python scripts:
1. `create_rank_point_gpt4.py`: Generates rank and point data using GPT-4 models.
2. `rank_point_assessment.py`: Evaluates IRR based on generated data or original data.## Installation
Tested on Ubuntu 22.04.4 LTS.>To set up the project, you will need to have [Anaconda](https://www.anaconda.com/) installed on your system. If you don't have it installed, you can download it from [here](https://www.anaconda.com/download/success).
> For setting up OpenAI GPT-4 API, you need to have an API key. You can get it from [here](https://beta.openai.com/signup/) and to set up the API key as environment variable, you can follow the instructions from [here](https://mkyong.com/linux/how-to-set-environment-variable-in-ubuntu/). The key must be set up in helper_functions.py file like this, `openai.api_key = os.getenv("OPENAI_API_KEY")`
1. Create a conda environment
```bash
conda create -n python=3.10
conda activate
```2. **Clone the Repository**:
```bash
git clone https://github.com/abdullahalzubaer/Can-GPT-4-Replace-Human-Examiners-A-Competition-on-Checking-Open-Text-Answers.gitcd Can-GPT-4-Replace-Human-Examiners-A-Competition-on-Checking-Open-Text-Answers
```3. **Install Dependencies**
Install the required Python packages using pip:
```bash
pip install -r requirements.txt
```
# Usage
The two scripts are used sequentially or independently, depending on the data you have. Below are instructions for each scenario.
## 1. Working with Original Data
If you want to evaluate the original data [](https://doi.org/10.5281/zenodo.11085379) directly without generating any new ranks or points using GPT-4, use the following command:
>The data must be present in `./original_data/` directory.### 1.1 Rank and Point Assessment for Per-Question IRR:
```bash
python rank_point_assessment.py -at rank point -pq
```
This will calculate the rank and point assessments for per-question IRR and save results in `./original_data_results/...` directories.### 1.2 Rank and Point Assessment Without Per-Question IRR:
```bash
python rank_point_assessment.py -at rank point
```
This calculates the rank and point assessments and saves pooled results in `./original_data_results/...` directories.## 2. Generating and Evaluating New Ranks and Points Using GPT-4.
To generate new ranks and points using GPT-4, follow these steps:
### 2.1 Run the script to generate ranks and points:
```bash
python create_rank_point_gpt4.py
```
The generated data will be saved in the `./newly_generated_data/` directory.
>Note: In rare instances, the GPT-4-generated data may not be perfectly parseable due to the non-deterministic nature of the model. You might encounter parsing issues, which can result in incomplete data. In such cases, it might be necessary to manually inspect and modify the data by reviewing the metadata file `metadata_gpt4_ranks_points` to correct the rank and point information in the `data_gpt4_ranks_points` file.### 2.2 Evaluate the generated data using the rank and point assessment script:
#### 2.2.1 Rank and Point Assessment with Per-Question IRR:
```bash
python rank_point_assessment.py -at rank point -pq -nd
```
This will calculate the rank and point assessments for per-question IRR and save results. in `./newly_generated_data_results/...`#### 2.2.2 Rank and Point Assessment Without Per-Question IRR:
```bash
python rank_point_assessment.py -at rank point -nd
```
This calculates the rank and point assessments and saves pooled results in `./newly_generated_data_results/...`License
This code is licensed under the Apache-2.0 license. See the LICENSE file for details.
Dataset licence mentioned here: [](https://doi.org/10.5281/zenodo.11085379)# Citation
If you use our dataset or code, please cite the data source and our paper. Proper citation helps to ensure continued support for the project and acknowledges the work of the authors.
Dataset Citation:
```
@dataset{zubaer_2024_11085379,
author = {Zubaer, Abdullah Al and
Granitzer, Michael and
Geschwind, Stephan and
Graf Lambsdorff, Johann and
Voss, Deborah},
title = {{GPT-4 Shows Comparable Performance to Human
Examiners in Ranking Open-Text Answers}},
month = apr,
year = 2024,
publisher = {Zenodo},
doi = {10.5281/zenodo.11085379},
url = {https://doi.org/10.5281/zenodo.11085379}
}
```Paper Citation [pre-print]:
```
Abdullah Al Zubaer, Michael Granitzer, Stephan Geschwind et al.
Can GPT-4 Replace Human Examiners? A Competition on Checking Open-Text Answers,
10 July 2024, PREPRINT (Version 1) available at Research Square
[https://doi.org/10.21203/rs.3.rs-4431780/v1]
```