https://github.com/karthiksoman/biomixqa

Repository for BiomixQA benchmark dataset
https://github.com/karthiksoman/biomixqa

benchmark-datasets bioinformatics bioinformatics-data biomedical-informatics gpt large-language-models llama retrieval-augmented-generation

Last synced: about 1 month ago
JSON representation

Repository for BiomixQA benchmark dataset

Host: GitHub
URL: https://github.com/karthiksoman/biomixqa
Owner: karthiksoman
License: apache-2.0
Created: 2024-09-05T05:37:07.000Z (9 months ago)
Default Branch: main
Last Pushed: 2024-09-05T06:42:59.000Z (9 months ago)
Last Synced: 2025-01-12T14:45:04.650Z (4 months ago)
Topics: benchmark-datasets, bioinformatics, bioinformatics-data, biomedical-informatics, gpt, large-language-models, llama, retrieval-augmented-generation
Language: Jupyter Notebook
Homepage: https://huggingface.co/datasets/kg-rag/BiomixQA
Size: 13.7 KB
Stars: 3
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# BiomixQA Dataset

## Overview

BiomixQA is a curated biomedical question-answering dataset comprising two distinct components:
1. Multiple Choice Questions (MCQ)
2. True/False Questions

This dataset has been utilized to validate the Knowledge Graph based Retrieval-Augmented Generation (KG-RAG) framework across different Large Language Models (LLMs). The diverse nature of questions in this dataset, spanning multiple choice and true/false formats, along with its coverage of various biomedical concepts, makes it particularly suitable for assessing the performance of KG-RAG framework.

Hence, this dataset is designed to support research and development in biomedical natural language processing, knowledge graph reasoning, and question-answering systems.

## Dataset Description

- **Huggingface Repository:** https://huggingface.co/datasets/kg-rag/BiomixQA
- **Paper:** [Biomedical knowledge graph-optimized prompt generation for large language models](https://arxiv.org/abs/2311.17330)
- **Point of Contact:** [Karthik Soman](mailto:[email protected])

## Dataset Components

### 1. Multiple Choice Questions (MCQ)

- **File**: `mcq_biomix.csv`
- **Size**: 306 questions
- **Format**: Each question has five choices with a single correct answer

### 2. True/False Questions

- **File**: `true_false_biomix.csv`
- **Size**: 311 questions
- **Format**: Binary (True/False) questions

## Access data using Hugging Face

Following snippet shows how to load data in python

(i) MCQ data

```
from datasets import load_dataset

mcq_data = load_dataset("kg-rag/BiomixQA", "mcq")
```

(ii) True/False data

```
from datasets import load_dataset

tf_data = load_dataset("kg-rag/BiomixQA", "true_false")
```

## Potential Uses

1. Evaluating biomedical question-answering systems
2. Testing natural language processing models in the biomedical domain
3. Assessing retrieval capabilities of various RAG (Retrieval-Augmented Generation) frameworks
4. Supporting research in biomedical ontologies and knowledge graphs

## Performance Analysis

We conducted a comprehensive analysis of the performance of three Large Language Models (LLMs) - Llama-2-13b, GPT-3.5-Turbo (0613), and GPT-4 - on the BiomixQA dataset. We compared their performance using both a standard prompt-based approach (zero-shot) and our novel Knowledge Graph based Retrieval-Augmented Generation (KG-RAG) framework.

### Performance Summary

Table 1: Performance (accuracy) of LLMs on BiomixQA datasets using prompt-based (zero-shot) and KG-RAG approaches (For more details, refer [this](https://arxiv.org/abs/2311.17330) paper)

| Model | True/False Dataset | | MCQ Dataset | |
|-------|-------------------:|---:|------------:|---:|
| | Prompt-based | KG-RAG | Prompt-based | KG-RAG |
| Llama-2-13b | 0.89 ± 0.02 | 0.94 ± 0.01 | 0.31 ± 0.03 | 0.53 ± 0.03 |
| GPT-3.5-Turbo (0613) | 0.87 ± 0.02 | 0.95 ± 0.01 | 0.63 ± 0.03 | 0.79 ± 0.02 |
| GPT-4 | 0.90 ± 0.02 | 0.95 ± 0.01 | 0.68 ± 0.03 | 0.74 ± 0.03 |

### Key Observations

1. **Consistent Performance Enhancement**: We observed a consistent performance enhancement for all LLM models when using the KG-RAG framework on both True/False and MCQ datasets.

2. **Significant Improvement for Llama-2**: The KG-RAG framework significantly elevated the performance of Llama-2-13b, particularly on the more challenging MCQ dataset. We observed an impressive 71% increase in accuracy, from 0.31 ± 0.03 to 0.53 ± 0.03.

3. **GPT-4 vs GPT-3.5-Turbo on MCQ**: Intriguingly, we observed a small but statistically significant drop in the performance of the GPT-4 model (0.74 ± 0.03) compared to the GPT-3.5-Turbo model (0.79 ± 0.02) on the MCQ dataset when using the KG-RAG framework. This difference was not observed in the prompt-based approach.
- Statistical significance: T-test, p-value < 0.0001, t-statistic = -47.7, N = 1000

4. **True/False Dataset Performance**: All models showed high performance on the True/False dataset, with the KG-RAG approach yielding slightly better results across all models.

## Source Data

1. SPOKE: A large scale biomedical knowledge graph that consists of ~40 million biomedical concepts and ~140 million biologically meaningful relationships (Morris et al.
2023).
2. DisGeNET: Consolidates data about genes and genetic variants linked to human diseases from curated repositories, the GWAS catalog, animal models, and scientific literature (Piñero et
al. 2016).
3. MONDO: Provides information about the ontological classification of Disease entities in the Open Biomedical Ontologies (OBO) format (Vasilevsky et al. 2022).
4. SemMedDB: Contains semantic predications extracted from PubMed citations (Kilicoglu et al. 2012).
5. Monarch Initiative: A platform for disease-gene association data (Mungall et al. 2017).
6. ROBOKOP: A knowledge graph-based system for biomedical data integration and analysis (Bizon et al. 2019).

## Citation

If you use this dataset in your research, please cite the following paper:
```
@article{soman2023biomedical,
title={Biomedical knowledge graph-enhanced prompt generation for large language models},
author={Soman, Karthik and Rose, Peter W and Morris, John H and Akbas, Rabia E and Smith, Brett and Peetoom, Braian and Villouta-Reyes, Catalina and Cerono, Gabriel and Shi, Yongmei and Rizk-Jackson, Angela and others},
journal={arXiv preprint arXiv:2311.17330},
year={2023}
}
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/karthiksoman/biomixqa

Awesome Lists containing this project

README