Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/FreedomIntelligence/Huatuo-26M
The Largest-scale Chinese Medical QA Dataset: with 26,000,000 question answer pairs.
https://github.com/FreedomIntelligence/Huatuo-26M
Last synced: 3 months ago
JSON representation
The Largest-scale Chinese Medical QA Dataset: with 26,000,000 question answer pairs.
- Host: GitHub
- URL: https://github.com/FreedomIntelligence/Huatuo-26M
- Owner: FreedomIntelligence
- Created: 2023-05-02T14:59:21.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-03-14T04:54:23.000Z (8 months ago)
- Last Synced: 2024-04-28T04:30:58.125Z (6 months ago)
- Size: 671 KB
- Stars: 162
- Watchers: 8
- Forks: 10
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- StarryDivineSky - FreedomIntelligence/Huatuo-26M
- Awesome-Medical-Healthcare-Dataset-For-LLM - 下载链接
README
# Huatuo-26M
📃 Paper • 🤗 Huatuo-Lite • 🤗 huatuo_encyclopedia_qa • 🤗 knowledge_graph_qa • 🤗 huatuo_consultation_qa
中文 | English## 👩🏻⚕Introduction
- Huatuo-26M is currently the largest Chinese medical question-and-answer dataset. This dataset contains over 26 million high-quality medical Q&A pairs, covering various aspects such as diseases, symptoms, treatment methods, and drug information.
- Huatuo-Lite is a refined and optimized dataset based on Huatuo-26M, having undergone multiple purifications and rewrites. It features more data dimensions and higher data quality.## 📚Data Content
The Huatuo-26M dataset is collected and integrated from multiple sources, including:
- Online Medical Encyclopedia [huatuo_encyclopedia_qa](https://huggingface.co/datasets/FreedomIntelligence/huatuo_encyclopedia_qa)
- Online Medical Knowledge Bases [huatuo_knowledge_graph_qa](https://huggingface.co/datasets/FreedomIntelligence/huatuo_knowledge_graph_qa)
- Online Medical Consultation Records(answer in the form of URLs) [huatuo_consultation_qa](https://huggingface.co/datasets/FreedomIntelligence/huatuo_consultation_qa)
- Streamlined version [Huatuo-Lite](https://huggingface.co/datasets/FreedomIntelligence/Huatuo26M-Lite)Each question-answer pair in the dataset contains the following fields:
- questions:Problem Description
- answers:Doctor/Expert Answers
- Huatuo-Lite dataset also includes **Hospital Department** and **Related Diseases** fieldsThe following is the huatuo test set we used in the paper, which consists of random sampling of data from multiple sources.
- Testdatasets:[huatuo26M-testdatasets](https://huggingface.co/datasets/FreedomIntelligence/huatuo26M-testdatasets)
## 🤖Data Usage
The Huatuo-26M dataset can be used for a variety of AI research and applications in the medical field, such as:
- Natural Language Processing: Including but not limited to Q&A systems, text classification, sentiment analysis, etc.
- Machine Learning model training: Such as disease prediction, personalized treatment recommendation, etc.
- AI applications in the medical field: Such as intelligent diagnosis systems, medical consultation chatbots, etc.## 🚀Quick Start
To start using the Huatuo-26M dataset, you can follow the steps below:
```python
import datasets
# part 1
knowledge_graph_dataset = datasets.load_dataset('FreedomIntelligence/huatuo_knowledge_graph_qa')
# part 2
encyclopedia_dataset = datasets.load_dataset('FreedomIntelligence/huatuo_encyclopedia_qa')
# part 3 (only url)
consultation_dataset = datasets.load_dataset('FreedomIntelligence/huatuo_consultation_qa')# testdatasets (6k)
huatuo_testdatasets = datasets.load_dataset('FreedomIntelligence/huatuo26M-testdatasets')
```## 👩🏻🔬Experiment Record
### Benchmark
- Retrieval Evaluation:
Click to expand
- Answer Generation Evaluation:
Click to expand
### Application
- Zero-shot transfer to other QA datasets:
Click to expand
- As external knowledge for RAG:
Click to expand
- As pre-training data for language model (LM):
Click to expand
- As fine-tuning data for Medical LLM:
Click to expand
## 🚁License
The Huatuo-26M dataset is licensed under Apache 2.0. Please make sure you have read and agreed to the license terms before using it.
## 📱Contact Us
If you have any questions or need help, please feel free to ask us via email ([[email protected]](mailto:[email protected]))or in the Issues section.
------
## 😁Citation
```
@misc{li2023huatuo26m,
title={Huatuo-26M, a Large-scale Chinese Medical QA Dataset},
author={Jianquan Li and Xidong Wang and Xiangbo Wu and Zhiyi Zhang and Xiaolong Xu and Jie Fu and Prayag Tiwari and Xiang Wan and Benyou Wang},
year={2023},
eprint={2305.01526},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```