https://github.com/FreedomIntelligence/Huatuo-26M

The Largest-scale Chinese Medical QA Dataset： with 26,000,000 question answer pairs.
https://github.com/FreedomIntelligence/Huatuo-26M

Last synced: about 2 months ago
JSON representation

The Largest-scale Chinese Medical QA Dataset： with 26,000,000 question answer pairs.

Host: GitHub
URL: https://github.com/FreedomIntelligence/Huatuo-26M
Owner: FreedomIntelligence
Created: 2023-05-02T14:59:21.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2024-03-14T04:54:23.000Z (over 1 year ago)
Last Synced: 2024-08-03T09:06:55.916Z (11 months ago)
Size: 671 KB
Stars: 188
Watchers: 8
Forks: 13
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

Awesome-Medical-Healthcare-Dataset-For-LLM - 下载链接
StarryDivineSky - FreedomIntelligence/Huatuo-26M

README

# Huatuo-26M

📃 Paper • 🤗 Huatuo-Lite • 🤗 huatuo_encyclopedia_qa • 🤗 knowledge_graph_qa • 🤗 huatuo_consultation_qa

中文 | English

## 👩🏻‍⚕Introduction

- Huatuo-26M is currently the largest Chinese medical question-and-answer dataset. This dataset contains over 26 million high-quality medical Q&A pairs, covering various aspects such as diseases, symptoms, treatment methods, and drug information.
- Huatuo-Lite is a refined and optimized dataset based on Huatuo-26M, having undergone multiple purifications and rewrites. It features more data dimensions and higher data quality.

## 📚Data Content

The Huatuo-26M dataset is collected and integrated from multiple sources, including:

- Online Medical Encyclopedia [huatuo_encyclopedia_qa](https://huggingface.co/datasets/FreedomIntelligence/huatuo_encyclopedia_qa)
- Online Medical Knowledge Bases [huatuo_knowledge_graph_qa](https://huggingface.co/datasets/FreedomIntelligence/huatuo_knowledge_graph_qa)
- Online Medical Consultation Records（answer in the form of URLs） [huatuo_consultation_qa](https://huggingface.co/datasets/FreedomIntelligence/huatuo_consultation_qa)
- Streamlined version [Huatuo-Lite](https://huggingface.co/datasets/FreedomIntelligence/Huatuo26M-Lite)

Each question-answer pair in the dataset contains the following fields：

- questions：Problem Description
- answers：Doctor/Expert Answers
- Huatuo-Lite dataset also includes **Hospital Department** and **Related Diseases** fields

The following is the huatuo test set we used in the paper, which consists of random sampling of data from multiple sources.

- Testdatasets：[huatuo26M-testdatasets](https://huggingface.co/datasets/FreedomIntelligence/huatuo26M-testdatasets)

## 🤖Data Usage

The Huatuo-26M dataset can be used for a variety of AI research and applications in the medical field, such as:

- Natural Language Processing: Including but not limited to Q&A systems, text classification, sentiment analysis, etc.
- Machine Learning model training: Such as disease prediction, personalized treatment recommendation, etc.
- AI applications in the medical field: Such as intelligent diagnosis systems, medical consultation chatbots, etc.

## 🚀Quick Start

To start using the Huatuo-26M dataset, you can follow the steps below:

```python
import datasets
# part 1
knowledge_graph_dataset = datasets.load_dataset('FreedomIntelligence/huatuo_knowledge_graph_qa')
# part 2
encyclopedia_dataset = datasets.load_dataset('FreedomIntelligence/huatuo_encyclopedia_qa')
# part 3 (only url)
consultation_dataset = datasets.load_dataset('FreedomIntelligence/huatuo_consultation_qa')

# testdatasets (6k)
huatuo_testdatasets = datasets.load_dataset('FreedomIntelligence/huatuo26M-testdatasets')
```

## 👩🏻‍🔬Experiment Record

### Benchmark

- Retrieval Evaluation:

Click to expand
retrieve

- Answer Generation Evaluation:

Click to expand
retrieve

### Application

- Zero-shot transfer to other QA datasets:

Click to expand
retrieve

- As external knowledge for RAG:

Click to expand
retrieve

- As pre-training data for language model (LM):

Click to expand
retrieve

- As fine-tuning data for Medical LLM:

Click to expand

## 🚁License

The Huatuo-26M dataset is licensed under Apache 2.0. Please make sure you have read and agreed to the license terms before using it.

## 📱Contact Us

If you have any questions or need help, please feel free to ask us via email （[[email protected]](mailto:[email protected])）or in the Issues section.

------

## 😁Citation

```
@misc{li2023huatuo26m,
title={Huatuo-26M, a Large-scale Chinese Medical QA Dataset},
author={Jianquan Li and Xidong Wang and Xiangbo Wu and Zhiyi Zhang and Xiaolong Xu and Jie Fu and Prayag Tiwari and Xiang Wan and Benyou Wang},
year={2023},
eprint={2305.01526},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/FreedomIntelligence/Huatuo-26M

Awesome Lists containing this project

README