Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/Itaymanes/K-QA

Dataset and Evaluation Code for the K-QA Benchmark.
https://github.com/Itaymanes/K-QA

Last synced: 2 months ago
JSON representation

Dataset and Evaluation Code for the K-QA Benchmark.

Awesome Lists containing this project

README

        

# K-QA Benchmark
Dataset and evaluation code for K-QA benchmark.

This repository provides the dataset and evaluation code for K-QA, a comprehensive Question and Answer dataset in real-world medical.
You can find detailed information on the dataset curation and evaluation metric computation in our full [paper](https://arxiv.org/abs/2401.14493).

To explore the results of 7 state-of-the-art models, check out [this space](https://huggingface.co/spaces/Itaykhealth/K-QA).

## The Dataset
K-QA consists of two portions - a medium-scale corpus of diverse real-world medical
inquires written by patients on an online platform, and a subset of rigorous and granular
answers, annotated by a team of in-house medical experts.

The dataset comprises 201 questions and answers, incorporating more than 1,589 ground-truth statements.
Additionally, we provide 1,212 authentic patient questions.


Screen-Shot-2023-12-25-at-20-23-57

## Evaluation Framework
To evaluate models against K-QA benchmark, we propose a Natural Language Inference (NLI) framework.
We consider a predicted answer as a `premise` and each gold statement derived from an annotated answer as an `hypothesis`. Intuitively, a correctly predicted answer should entail every gold statement.
This formulation aims to quantify the extent to which the model's answer captures the semantic meaning of the gold answer, abstracting over the wording chosen by a particular expert annotator.

Two evaluation metrics:
- **Hall** (Hallucination rate) - This metric measures how
many of the gold statements contradict the model’s
answer.
- **Comp** (Comprehensiveness) - This metric mea-
sures how many of the clinically crucial claims are
entailed from the predicted answer.

The figure below provides an example illustrating the complete process of evaluating a
generated answer and deriving these metrics.


Screen-Shot-2023-12-27-at-11-03-46

## How to Evaluate New Results
#### Organize Results in a Formatted Way
Before running the evaluation script, ensure that your results are stored in a JSON file with keys `Question` and `result`. Here's an example:
```python
[
{
'Question': "Alright so I dont know much about Lexapro would you tell me more about it?",
'result': "Lexapro is a medication that belongs to a class of drugs\ncalled selective serotonin reuptake inhibitors (SSRIs)"
},
{
'Question': "Also what is the oral option to get rid of scabies?" ,
'result': "The oral option to treat scabies is ivermectin, which is a prescription medication that is taken by mouth."
}
]
```

#### Align Your Evaluation Model with Physicians (optional)
Due to the variability of language models, we provide [here](https://github.com/Itaymanes/K-QA/blob/main/dataset/NLI_medical_annotator.csv) a dataset that contains almost 400 statements with different answers generated by various LLMs. Three different physicians labeled each statement with "Entailment," "Neutral," or "Contradiction." This dataset can be utilized to test different evaluation models, aiming to find the one most closely aligned with the physicians' majority vote.

#### Install Requirements
Clone the repository and run the following to install and to activate your virtual environment:
```
poetry install
poetry shell
```
Set keys for GPT-4, either for OpenAI or Azure (the original paper uses models in Azure).
```
export OPENAI_API_KEY=""
export OPENAI_API_BASE=""
```
And for Azure also set the following keys:
```
export OPENAI_API_VERSION=""
export OPENAI_TYPE=""
```

Then, run the evaluation script as follows:
```bash
python run_eval.py
--result_file
--version
--on_openai # Include this flag if using OpenAI
```

#### Cite Us
```markdown
@misc{manes2024kqa,
title={K-QA: A Real-World Medical Q&A Benchmark},
author={Itay Manes and Naama Ronn and David Cohen and Ran Ilan Ber and Zehavi Horowitz-Kugler and Gabriel Stanovsky},
year={2024},
eprint={2401.14493},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```