{"id":13754165,"url":"https://github.com/FreedomIntelligence/Huatuo-26M","last_synced_at":"2025-05-09T22:31:12.346Z","repository":{"id":209814363,"uuid":"635374839","full_name":"FreedomIntelligence/Huatuo-26M","owner":"FreedomIntelligence","description":"The Largest-scale Chinese Medical QA Dataset： with 26,000,000 question answer pairs.","archived":false,"fork":false,"pushed_at":"2024-03-14T04:54:23.000Z","size":687,"stargazers_count":188,"open_issues_count":0,"forks_count":13,"subscribers_count":8,"default_branch":"main","last_synced_at":"2024-08-03T09:06:55.916Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/FreedomIntelligence.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-05-02T14:59:21.000Z","updated_at":"2024-07-23T02:23:59.000Z","dependencies_parsed_at":"2024-08-03T09:17:17.896Z","dependency_job_id":null,"html_url":"https://github.com/FreedomIntelligence/Huatuo-26M","commit_stats":null,"previous_names":["freedomintelligence/huatuo-26m"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FreedomIntelligence%2FHuatuo-26M","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FreedomIntelligence%2FHuatuo-26M/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FreedomIntelligence%2FHuatuo-26M/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FreedomIntelligence%2FHuatuo-26M/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/FreedomIntelligence","download_url":"https://codeload.github.com/FreedomIntelligence/Huatuo-26M/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224884612,"owners_count":17386121,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T09:01:46.211Z","updated_at":"2024-11-16T06:31:47.381Z","avatar_url":"https://github.com/FreedomIntelligence.png","language":null,"funding_links":[],"categories":["Datasets","A01_文本生成_文本对话"],"sub_categories":["中文","大语言对话模型及数据"],"readme":"# Huatuo-26M \n\n\u003cp align=\"center\"\u003e\n   📃 \u003ca href=\"https://arxiv.org/abs/2305.01526\" target=\"_blank\"\u003ePaper\u003c/a\u003e  • 🤗 \u003ca href=\"https://huggingface.co/datasets/FreedomIntelligence/Huatuo26M-Lite\" target=\"_blank\"\u003eHuatuo-Lite\u003c/a\u003e • 🤗 \u003ca href=\"https://huggingface.co/datasets/FreedomIntelligence/huatuo_encyclopedia_qa\" target=\"_blank\"\u003ehuatuo_encyclopedia_qa\u003c/a\u003e  • 🤗 \u003ca href=\"https://huggingface.co/datasets/FreedomIntelligence/huatuo_knowledge_graph_qa\" target=\"_blank\"\u003eknowledge_graph_qa\u003c/a\u003e  • 🤗 \u003ca href=\"https://huggingface.co/datasets/FreedomIntelligence/huatuo_consultation_qa\" target=\"_blank\"\u003ehuatuo_consultation_qa\u003c/a\u003e  \n   \u003cbr\u003e  \u003ca href=\"README_zh.md\"\u003e   中文\u003c/a\u003e | \u003ca href=\"README.md\"\u003e English\n\u003c/p\u003e\n\n## 👩🏻‍⚕Introduction\n\n- Huatuo-26M is currently the largest Chinese medical question-and-answer dataset. This dataset contains over 26 million high-quality medical Q\u0026A pairs, covering various aspects such as diseases, symptoms, treatment methods, and drug information.\n- Huatuo-Lite is a refined and optimized dataset based on Huatuo-26M, having undergone multiple purifications and rewrites. It features more data dimensions and higher data quality.\n\n\n## 📚Data Content\n\nThe Huatuo-26M dataset is collected and integrated from multiple sources, including:\n\n- Online Medical Encyclopedia [huatuo_encyclopedia_qa](https://huggingface.co/datasets/FreedomIntelligence/huatuo_encyclopedia_qa)\n- Online Medical Knowledge Bases [huatuo_knowledge_graph_qa](https://huggingface.co/datasets/FreedomIntelligence/huatuo_knowledge_graph_qa)\n- Online Medical Consultation Records（answer in the form of URLs） [huatuo_consultation_qa](https://huggingface.co/datasets/FreedomIntelligence/huatuo_consultation_qa)\n- Streamlined version [Huatuo-Lite](https://huggingface.co/datasets/FreedomIntelligence/Huatuo26M-Lite)\n\n\nEach question-answer pair in the dataset contains the following fields：\n\n- questions：Problem Description \n- answers：Doctor/Expert Answers\n- Huatuo-Lite dataset also includes **Hospital Department** and **Related Diseases** fields\n\n\nThe following is the huatuo test set we used in the paper, which consists of random sampling of data from multiple sources.\n\n- Testdatasets：[huatuo26M-testdatasets](https://huggingface.co/datasets/FreedomIntelligence/huatuo26M-testdatasets)\n\n\n\n## 🤖Data Usage\n\nThe Huatuo-26M dataset can be used for a variety of AI research and applications in the medical field, such as:\n\n- Natural Language Processing: Including but not limited to Q\u0026A systems, text classification, sentiment analysis, etc.\n- Machine Learning model training: Such as disease prediction, personalized treatment recommendation, etc.\n- AI applications in the medical field: Such as intelligent diagnosis systems, medical consultation chatbots, etc.\n\n\n## 🚀Quick Start\n\nTo start using the Huatuo-26M dataset, you can follow the steps below:\n\n```python\nimport datasets\n# part 1\nknowledge_graph_dataset = datasets.load_dataset('FreedomIntelligence/huatuo_knowledge_graph_qa')\n# part 2\nencyclopedia_dataset = datasets.load_dataset('FreedomIntelligence/huatuo_encyclopedia_qa')\n# part 3 (only url)\nconsultation_dataset = datasets.load_dataset('FreedomIntelligence/huatuo_consultation_qa')\n\n# testdatasets (6k)\nhuatuo_testdatasets = datasets.load_dataset('FreedomIntelligence/huatuo26M-testdatasets')\n```\n\n\n\n## 👩🏻‍🔬Experiment Record\n\n### Benchmark\n\n- Retrieval Evaluation:\n\n  \u003cdetails\u003e\u003csummary\u003eClick to expand\u003c/summary\u003e\n  \u003cimg src=\"img/retrieve.png\" alt=\"retrieve\" style=\"zoom:100%;\" /\u003e\n  \u003c/details\u003e\n\n- Answer Generation Evaluation:\n\n  \u003cdetails\u003e\u003csummary\u003eClick to expand\u003c/summary\u003e\n  \u003cimg src=\"img/NLG.png\" alt=\"retrieve\" style=\"zoom:100%;\" /\u003e\n  \u003c/details\u003e\n\n### Application\n\n- Zero-shot transfer to other QA datasets:\n\n  \u003cdetails\u003e\u003csummary\u003eClick to expand\u003c/summary\u003e\n  \u003cimg src=\"img/zero-shot.png\" alt=\"retrieve\" style=\"zoom:100%;\" /\u003e\n  \u003c/details\u003e\n\n\n- As external knowledge for RAG:\n\n  \u003cdetails\u003e\u003csummary\u003eClick to expand\u003c/summary\u003e\n  \u003cimg src=\"img/rag.png\" alt=\"retrieve\" style=\"zoom:100%;\" /\u003e\n  \u003c/details\u003e\n\n\n- As pre-training data for language model (LM):\n\n  \u003cdetails\u003e\u003csummary\u003eClick to expand\u003c/summary\u003e\n  \u003cimg src=\"img/cblue.png\" alt=\"retrieve\" style=\"zoom:100%;\" /\u003e\n  \u003c/details\u003e\n\n\n- As fine-tuning data for Medical LLM:\n\n  \u003cdetails\u003e\u003csummary\u003eClick to expand\u003c/summary\u003e\n  \u003cimg src=\"img/sft.png\" alt \"retrieve\" style=\"zoom:100%;\" /\u003e\n  \u003c/details\u003e\n\n\n\n## 🚁License\n\nThe Huatuo-26M dataset is licensed under Apache 2.0. Please make sure you have read and agreed to the license terms before using it.\n\n\n## 📱Contact Us\n\nIf you have any questions or need help, please feel free to ask us via email （[xidongw@163.com](mailto:xidongw@163.com)）or in the Issues section.\n\n------\n\n\n\n## 😁Citation\n\n```\n@misc{li2023huatuo26m,\n      title={Huatuo-26M, a Large-scale Chinese Medical QA Dataset}, \n      author={Jianquan Li and Xidong Wang and Xiangbo Wu and Zhiyi Zhang and Xiaolong Xu and Jie Fu and Prayag Tiwari and Xiang Wan and Benyou Wang},\n      year={2023},\n      eprint={2305.01526},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FFreedomIntelligence%2FHuatuo-26M","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FFreedomIntelligence%2FHuatuo-26M","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FFreedomIntelligence%2FHuatuo-26M/lists"}