{"id":13639210,"url":"https://github.com/voidful/awesome-chatgpt-dataset","last_synced_at":"2025-09-04T02:31:20.546Z","repository":{"id":154220762,"uuid":"631036297","full_name":"voidful/awesome-chatgpt-dataset","owner":"voidful","description":"Unlock the Power of LLM: Explore These Datasets to Train Your Own ChatGPT!","archived":false,"fork":false,"pushed_at":"2024-04-26T23:17:28.000Z","size":1286,"stargazers_count":735,"open_issues_count":0,"forks_count":61,"subscribers_count":12,"default_branch":"main","last_synced_at":"2025-08-22T11:02:09.451Z","etag":null,"topics":["awesome","chatgpt","dataset","gpt4","instructions"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/voidful.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-04-21T18:58:39.000Z","updated_at":"2025-08-07T12:28:41.000Z","dependencies_parsed_at":null,"dependency_job_id":"9b4b0450-b29a-4321-b6f6-515ffd1ebb02","html_url":"https://github.com/voidful/awesome-chatgpt-dataset","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/voidful/awesome-chatgpt-dataset","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/voidful%2Fawesome-chatgpt-dataset","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/voidful%2Fawesome-chatgpt-dataset/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/voidful%2Fawesome-chatgpt-dataset/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/voidful%2Fawesome-chatgpt-dataset/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/voidful","download_url":"https://codeload.github.com/voidful/awesome-chatgpt-dataset/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/voidful%2Fawesome-chatgpt-dataset/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":273541897,"owners_count":25124056,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-04T02:00:08.968Z","response_time":61,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["awesome","chatgpt","dataset","gpt4","instructions"],"created_at":"2024-08-02T01:00:58.676Z","updated_at":"2025-09-04T02:31:20.243Z","avatar_url":"https://github.com/voidful.png","language":"Python","funding_links":[],"categories":["A01_文本生成_文本对话","Awesome-lists","Other","Natural Language Processing","Python","NLP","Building","Other Lists","Topics","Awesome Surveys"],"sub_categories":["大语言对话模型及数据","Other sdk/libraries","Datasets","TeX Lists","LLM Training Datasets","Previous Venues"],"readme":"# awesome-chatgpt-dataset\n![Alt Text](https://github.com/voidful/awesome-chatgpt-dataset/raw/main/A%20cat%20%20to%20Unlock%20the%20Power%20of%20LLM%20Explore%20These%20Datasets%20to%20Train%20Your%20Own%20ChatGPT!.gif)    \n\n## Unlock the Power of LLM: Explore These Datasets to Train Your Own ChatGPT!\n\n## Select your own mixed dataset\n\u003e ```bash\n\u003e git clone https://github.com/voidful/awesome-chatgpt-dataset.git\n\u003e cd awesome-chatgpt-dataset/mixed/dataset\n\u003e ```\n\u003e pick whatever dataset you want to use, then merge and upload:\n\u003e ```bash\n\u003e python preprocess.py your_dataset_name_to_HuggingFaceHub\n\u003e ```\n\n## Dataset Detail\n| Dataset Name | Size | Languages | Source | License |\n|---|---|---|---|---|\n| [TheoremQA](https://huggingface.co/datasets/wenhu/TheoremQA) | 1K | English | We annotated 800 QA pairs covering 350+ theorems spanning across Math, EE\u0026CS, Physics and Finance. | mit |\n| [lima](https://huggingface.co/datasets/GAIR/lima) | 1K | English | LIMA: Less Is More for Alignment | CC BY-NC-SA |\n| [im-feeling-curious](https://huggingface.co/datasets/xiyuez/im-feeling-curious) | 3K | English | This public dataset is an extract from Google's \"i'm feeling curious\" feature. To learn more about this feature, search for \"i'm feeling curious\" on Google. | - |\n| [Puffin](https://huggingface.co/datasets/LDJnr/Puffin) | 3K | English | Puffin dataset. Exactly 3,000 examples with each response created using GPT-4. | apache-2.0 |\n| [cc_sbu_align](https://huggingface.co/datasets/Vision-CAIR/cc_sbu_align) | 4K | English | MiniGPT-4  datadset | BSD 3-Clause License |\n| [qa_feedback](https://github.com/allenai/FineGrainedRLHF/tree/main/tasks/qa_feedback/data) | 4K | English | we re-construct the ASQA data and collect human feedback for it. We name the resulting dataset as qa-feedback. | - |\n| [SLF5K](https://huggingface.co/datasets/JeremyAlain/SLF5K) | 5K | English | The Summarization with Language Feedback (SLF5K) dataset is an English-language dataset containing 5K unique samples that can be used for the task of abstraction summarization. | apache-2.0 |\n| [blended_skill_talk](https://huggingface.co/datasets/blended_skill_talk) | 7K | English | A dataset of 7k conversations explicitly designed to exhibit multiple conversation modes: displaying personality, having empathy, and demonstrating knowledge. | - |\n| [GSM-IC](https://github.com/google-research-datasets/GSM-IC) | 8K | English | Grade-School Math with Irrelevant Context (GSM-IC) | - |\n| [ChatAlpaca](https://github.com/cascip/ChatAlpaca) | 10K | English | The data currently contain a total of 10,000 conversations with 95,558 utterances. | Apache-2.0 license |\n| [PKU-SafeRLHF-10K](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF-10K) | 10K | English | PKU-SafeRLHF-10K, which is the first dataset of its kind and contains 10k instances with safety preferences. | - |\n| [Dolly](https://github.com/databrickslabs/dolly/tree/master/data) | 15K | English | databricks-dolly-15k is a corpus of more than 15,000 records generated by thousands of Databricks employees to enable large language models to exhibit the magical interactivity of ChatGPT. | CC 3.0 |\n| [WebGPT](https://huggingface.co/datasets/openai/webgpt_comparisons) | 20K | English | This is the dataset of all comparisons that were marked as suitable for reward modeling by the end of the WebGPT project. | - |\n| [Code Alpaca](https://github.com/sahil280114/codealpaca) | 20K | English | Code generation task involving 20,022 samples | - |\n| [openapi-function-invocations-25k](unaidedelf87777/openapi-function-invocations-25k) | 25K | English | The construction of this dataset involved a systematic procedure combining manual extraction and AI-assisted synthesis. | mit |\n| [LongForm](https://github.com/akoksal/LongForm/tree/main/dataset) | 28K | English | The LongForm dataset is created by leveraging English corpus examples with augmented instructions. | The LongForm project is subject to a MIT License with custom limitations for restrictions imposed by OpenAI (for the instruction generation part), as well as the license of language models (OPT, LLaMA, and T5). |\n| [chatbot_arena_conversations](https://huggingface.co/datasets/lmsys/chatbot_arena_conversations) | 33K | English | This dataset contains 33K cleaned conversations with pairwise human preferences. It is collected from 13K unique IP addresses on the Chatbot Arena from April to June 2023. |  |\n| [HC3](https://huggingface.co/datasets/Hello-SimpleAI/HC3) | 37K | English, Chinese | 37,175 instructions generated by ChatGPT and human | - |\n| [Anthropic_HH_Golden](https://huggingface.co/datasets/Unified-Language-Model-Alignment/Anthropic_HH_Golden) | 45K | English | This repository contains a new preference dataset extending the harmless dataset of Anthropic's Helpful and Harmless (HH) datasets. The origin positive response in HH is generated by a supervised fined-tuned model of Anthropic, where harmful and unhelpful responses are freqently encountered. In this dataset, the positive responses are replaced by re-rewritten responses generated by GPT4. |  |\n| [Mol-Instructions](https://huggingface.co/datasets/zjunlp/Mol-Instructions) | 48K | English | An open, large-scale biomolecular instruction dataset for large language models. | CC BY 4.0 |\n| [RefGPT](https://github.com/ziliwangnlp/RefGPT) | 50K | English,chinese | we introduce a cost-effective method called RefGPT, which generates a vast amount of high-quality multi-turn Q\u0026A content. | - |\n| [arxiv-math-instruct-50k](https://huggingface.co/datasets/ArtifactAI/arxiv-math-instruct-50k) | 50K | English | Dataset consists of question-answer pairs derived from ArXiv abstracts from math categories | - |\n| [arxiv-math-instruct-50k](https://huggingface.co/datasets/ArtifactAI/arxiv-math-instruct-50k) | 51K | English | The \"ArtifactAI/arxiv-math-instruct-50k\" dataset consists of question-answer pairs derived from ArXiv abstracts from the math categories,Questions are generated using the t5-base model, while the answers are generated using the GPT-3.5-turbo model. |  |\n| [Traditional Chinese Alpaca Dataset](https://github.com/ntunlplab/traditional-chinese-alpaca) | 52K | Traditional Chinese | Translated from Alpaca Data by ChatGPT API | Apache-2.0 license |\n| [Cabrita Dataset](https://github.com/22-hours/cabrita) | 52K | Portuguese | Translated from Alpaca Data |  |\n| [Japanese Alpaca Dataset](https://github.com/shi3z/alpaca_ja) | 52K | Japanese | Translated from Alpaca Data by ChatGPT API | CC By NC 4.0; OpenAI terms of use |\n| [Alpaca Dataset](https://github.com/tatsu-lab/stanford_alpaca) | 52K | English | 175 seed instructions by OpenAI API | CC By NC 4.0; OpenAI terms of use |\n| [Alpaca Data Cleaned](https://github.com/gururise/AlpacaDataCleaned) | 52K | English | Revised version of Alpaca Dataset | - |\n| [Alpaca GPT-4 Data](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM) | 52K | English | Generated by GPT-4 using Alpaca prompts | - |\n| [Alpaca GPT-4 Data (Chinese)](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM) | 52K | Chinese | Generated by GPT-4 using Chinese prompts translated from Alpaca by ChatGPT | - |\n| [Dynosaur](https://dynosaur-it.github.io) | 66K | English | Dynosaur, a dynamic growth paradigm for instruction-tuning data curation. | Apache-2.0 license |\n| [Finance](https://huggingface.co/datasets/gbharti/finance-alpaca) | 69K | English | 68,912 financial related instructions | - |\n| [evol](https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_70k) | 70K | English | This is the training data of WizardLM. | - |\n| [Vicuna Dataset](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered) | 75K | English | ~100k ShareGPT conversations | - |\n| [InstructionTranslation](https://huggingface.co/datasets/theblackcat102/instruction_translations) | 80K | Multi-lingual | Translations were generated by M2M 12B and the output generations were limited at 512 tokens due to VRAM limit (40G). | MIT |\n| [Self-Instruct](https://github.com/yizhongw/self-instruct/tree/main) | 82K | English | We release a dataset that contains 52k instructions, paired with 82K instance inputs and outputs. | - |\n| [OASST1](https://huggingface.co/datasets/OpenAssistant/oasst1) | 89K | Multi-lingual | a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages in 35 different languages, annotated with 461,292 quality ratings, resulting in over 10,000 fully annotated conversation trees. | apache-2.0 |\n| [HH-RLHF](https://github.com/anthropics/hh-rlhf/tree/master) | 91K | English | The data are described in the paper: Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. | MIT |\n| [Guanaco Dataset](https://huggingface.co/datasets/JosephusCheung/GuanacoDataset) | 98K | English, Simplified Chinese, Traditional Chinese HK \u0026 TW, Japanese | 175 tasks from the Alpaca model | GPLv3 |\n| [InstructionWild](https://github.com/XueFuzhao/InstructionWild) | 104K | English, Chinese | 429 seed instructions and follow Alpaca to generate 52K | Research only; OpenAI terms of use |\n| [Camel Dataset](https://github.com/lightaime/camel) | 107K | Multi-lingual | Role-playing between AIs (Open AI API) | - |\n| [Tapir-Cleaned](https://huggingface.co/datasets/MattiaL/tapir-cleaned-116k) | 117K | English | This is a revised version of the DAISLab dataset of IFTTT rules, which has been thoroughly cleaned, scored, and adjusted for the purpose of instruction-tuning. | CC BY-NC 4.0 |\n| [WizardLM_evol_instruct_V2_196k](https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k) | 143K | English | This datasets contains 143K mixture evolved data of Alpaca and ShareGPT. | - |\n| [LLaVA Visual Instruct](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) | 150K | English | LLaVA Visual Instruct 150K is a set of GPT-generated multimodal instruction-following data. It is constructed for visual instruction tuning and for building large multimodal towards GPT-4 vision/language capability. | cc-by-nc-4.0 |\n| [Prosocial Dialog](https://huggingface.co/datasets/allenai/prosocial-dialog) | 166K | English | 165,681 instructions produced by GPT-3 rewrites questions and human feedback | - |\n| [COIG](https://huggingface.co/datasets/BAAI/COIG) | 191K | Chinese | Chinese Open Instruction Generalist (COIG) project to maintain a harmless, helpful, and diverse set of Chinese instruction corpora. | apache-2.0 |\n| [orca-chat](https://huggingface.co/datasets/shahules786/orca-chat) | 198K | English | This is a cleaned, pruned, and clustered version of orca to form a conversation-style dataset. The the process involves removing samples with very high similarity and also grouping instructions to form conversation. |  |\n| [Unnatural Instructions](https://github.com/orhonovich/unnatural-instructions) | 241K | English | a large dataset of cre- ative and diverse instructions, collected with virtually no human labor. | MIT |\n| [SHP](https://huggingface.co/datasets/stanfordnlp/SHP) | 358K | English | SHP is a dataset of 385K collective human preferences over responses to questions/instructions in 18 different subject areas, from cooking to legal advice. | Reddit non-exclusive, non-transferable, non-sublicensable, and revocable license |\n| [dromedary](https://huggingface.co/datasets/zhiqings/dromedary-65b-verbose-clone-v0) | 361K | English | Dromedary-Verbose-Clone is a synthetic dataset of 360k instructions and demonstrations. | cc-by-nc-4.0 |\n| [ultrachat](https://huggingface.co/datasets/stingning/ultrachat) | 404K | English | To ensure generation quality, two separate ChatGPT Turbo APIs are adopted in generation, where one plays the role of the user to generate queries and the other generates the response. | cc-by-nc-4.0 |\n| [ign_clean_instruct_dataset_500k](https://huggingface.co/datasets/ignmilton/ign_clean_instruct_dataset_500k) | 509K | English | This dataset contains ~508k prompt-instruction pairs with high quality responses. It was synthetically created from a subset of Ultrachat prompts. It does not contain any alignment focused responses or NSFW content. | apache-2.0 |\n| [ELI5](https://huggingface.co/datasets/eli5) | 559K | English | The ELI5 dataset is an English-language dataset of questions and answers gathered from three subreddits where users ask factual questions requiring paragraph-length or longer answers. | - |\n| [GPT4All Dataset](https://github.com/nomic-ai/gpt4all) | 806K | Multi-lingual | Subset of LAION OIG, StackOverflow Question, BigSciense/p3 dataset. Answered by OpenAI API. | - |\n| [Instruct](https://huggingface.co/datasets/swype/instruct) | 889K | English | 888,969 English instructions, augmentation using AllenAI NLP tools | MIT |\n| [MOSS](https://github.com/OpenLMLab/MOSS#数据) | 1M | Chinese | Generated by gpt-3.5-turbo | Apache-2.0, AGPL-3.0 licenses |\n| [LaMini-Instruction](https://huggingface.co/datasets/MBZUAI/LaMini-instruction) | 3M | English | a total of 2.58M pairs of instructions and responses using gpt-3.5-turbo based on several existing resources of prompts | cc-by-nc-4.0 |\n| [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca) | 3M | English | The OpenOrca dataset is a collection of augmented FLAN Collection data. Currently ~1M GPT-4 completions, and ~3.2M GPT-3.5 completions. |  |\n| [Natural Instructions](https://github.com/allenai/natural-instructions) | 5M | Multi-lingual | 5,040,134 instructions collected from diverse NLP tasks | - |\n| [BELLE](https://github.com/LianjiaTech/BELLE/tree/main/data) | 10M | Chinese | The 10M Chinese dataset is composed of subsets spanning multiple (instruction) types and multiple fields. | Research only; OpenAI terms of use |\n| [Firefly](https://github.com/yangjianxin1/Firefly) | 16M | Chinese | 1,649,398 Chinese instructions in 23 NLP tasks | - |\n| [OIG-43M Dataset](https://laion.ai/blog/oig-dataset/) | 43M | Multi-lingual | Together, LAION, and Ontocord.ai. | - |\n| [xP3](https://huggingface.co/datasets/bigscience/xP3) | 79M | Multi-lingual | 78,883,588 instructions collected by prompts \u0026 datasets across 46 languages \u0026 16 NLP tasks | - |\n| [CodeParrot](https://github.com/huggingface/transformers/tree/main/examples/research_projects/codeparrot#dataset) | - | python | The database was queried for all Python files with less than 1MB in size resulting in a 180GB dataset with over 20M files. | - |\n| [Alpaca-CoT Dataset](https://github.com/PhoebusSi/Alpaca-CoT/tree/main/data) | - | Multi-lingual | Instruction Data Collection | ODC-By |\n| [stack-exchange-paired](https://huggingface.co/datasets/lvwerra/stack-exchange-paired) | - | English | This dataset contains questions and answers from the Stack Overflow Data Dump for the purpose of preference model training. | cc-by-sa-4.0 |\n| [LangChainDatasets](https://huggingface.co/LangChainDatasets) | - | English | This is a community-drive dataset repository for datasets that can be used to evaluate LangChain chains and agents. | - |\n| [ParlAI](https://github.com/facebookresearch/ParlAI/tree/main/parlai/tasks) | - | English | 100+ popular datasets available all in one place, dialogue models, from open-domain chitchat, to task-oriented dialogue, to visual question answering. | - |\n| [GPTeacher](https://github.com/teknium1/GPTeacher) | - | English | A collection of modular datasets generated by GPT-4, General-Instruct - Roleplay-Instruct - Code-Instruct - and Toolformer | - |\n| [silk-road/Wizard-LM-Chinese-instruct-evol](https://huggingface.co/datasets/silk-road/Wizard-LM-Chinese-instruct-evol) | - | chinese | Wizard-LM-Chinese | - |\n| [MultiWOZ](https://huggingface.co/datasets/multi_woz_v22) | - | English | Multi-Domain Wizard-of-Oz dataset (MultiWOZ), a fully-labeled collection of human-human written conversations spanning over multiple domains and topics. | apache-2.0 |\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvoidful%2Fawesome-chatgpt-dataset","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvoidful%2Fawesome-chatgpt-dataset","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvoidful%2Fawesome-chatgpt-dataset/lists"}