https://github.com/mideind/automaticqapipeline

Last synced: 8 months ago
JSON representation

Host: GitHub
URL: https://github.com/mideind/automaticqapipeline
Owner: mideind
License: apache-2.0
Created: 2024-06-25T14:24:35.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2024-06-25T15:28:27.000Z (almost 2 years ago)
Last Synced: 2025-01-26T03:08:13.018Z (over 1 year ago)
Language: Python
Size: 14.6 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Automatic QA Pipeline

A pipeline to automatically generate questions and answers, which pertain to Icelandic culture and/or history, from a dataset.

To run the script you will need a python3 environment, version 3.10.13 or older. Install the required dependencies by running
> pip install -r requirements.txt

The input dataset is assumed to be a jsonl file with the following keys for each document: "url", "title" and "text", where "text" is the text to be used for creating questions and answers. An example of such a dataset is the [Wikipedia dataset](https://huggingface.co/datasets/wikimedia/wikipedia).

The pipeline consists of the following steps:
1. Convert the documents to requests for a GPT model.
2. Make API calls to a GPT model to generate questions and answers.
3. Filter the generated questions and answers based on scores given by GPT.
4. Correct spelling and grammar in the questions and answers.
5. Add information to the questions and answers.

The pipeline can be run from the command line by running the following command:
> python run_qa_pipe.py --dataset-path path/to/dataset

The default GPT model which is used is 'gpt-4-turbo', but this can be changed by including the `gpt-model` flag with the relevant GPT model's name:
> python run_qa_pipe.py --dataset-path path/to/dataset --gpt-model name-of-gpt-model

In order to make an API call to GPT, you need an API key. Instructions on how to obtain such a key are in the [OpenAI API reference](https://platform.openai.com/docs/api-reference/authentication).

The pipeline defaults to correcting spelling and grammar in the questions and answers, but it can be skipped by including the `skip-correction` flag:
> python run_qa_pipe.py --dataset-path path/to/dataset --skip-correction

In order to correct spelling and grammar, the [Byte-Level Neural Error Correction Model for Icelandic](http://hdl.handle.net/20.500.12537/324) must be downloaded and placed within this repository.

The pipeline will output the final questions and answers to the file `data/queries.jsonl`. The output format is a jsonl file with the following format for each question and answer pair: "question", "answer", "question_id", "question_score", "document_score", "url", "title" and "context". Outputs from each step in the pipeline are written to `data/extra-data/`.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mideind/automaticqapipeline

Awesome Lists containing this project

README