Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/voidful/awesome-chatgpt-dataset

Unlock the Power of LLM: Explore These Datasets to Train Your Own ChatGPT!
https://github.com/voidful/awesome-chatgpt-dataset

List: awesome-chatgpt-dataset

awesome chatgpt dataset gpt4 instructions

Last synced: 26 days ago
JSON representation

Unlock the Power of LLM: Explore These Datasets to Train Your Own ChatGPT!

Awesome Lists containing this project

README

        

# awesome-chatgpt-dataset
![Alt Text](https://github.com/voidful/awesome-chatgpt-dataset/raw/main/A%20cat%20%20to%20Unlock%20the%20Power%20of%20LLM%20Explore%20These%20Datasets%20to%20Train%20Your%20Own%20ChatGPT!.gif)

## Unlock the Power of LLM: Explore These Datasets to Train Your Own ChatGPT!

## Select your own mixed dataset
> ```bash
> git clone https://github.com/voidful/awesome-chatgpt-dataset.git
> cd awesome-chatgpt-dataset/mixed/dataset
> ```
> pick whatever dataset you want to use, then merge and upload:
> ```bash
> python preprocess.py your_dataset_name_to_HuggingFaceHub
> ```

## Dataset Detail
| Dataset Name | Size | Languages | Source | License |
|---|---|---|---|---|
| [TheoremQA](https://huggingface.co/datasets/wenhu/TheoremQA) | 1K | English | We annotated 800 QA pairs covering 350+ theorems spanning across Math, EE&CS, Physics and Finance. | mit |
| [lima](https://huggingface.co/datasets/GAIR/lima) | 1K | English | LIMA: Less Is More for Alignment | CC BY-NC-SA |
| [im-feeling-curious](https://huggingface.co/datasets/xiyuez/im-feeling-curious) | 3K | English | This public dataset is an extract from Google's "i'm feeling curious" feature. To learn more about this feature, search for "i'm feeling curious" on Google. | - |
| [Puffin](https://huggingface.co/datasets/LDJnr/Puffin) | 3K | English | Puffin dataset. Exactly 3,000 examples with each response created using GPT-4. | apache-2.0 |
| [cc_sbu_align](https://huggingface.co/datasets/Vision-CAIR/cc_sbu_align) | 4K | English | MiniGPT-4 datadset | BSD 3-Clause License |
| [qa_feedback](https://github.com/allenai/FineGrainedRLHF/tree/main/tasks/qa_feedback/data) | 4K | English | we re-construct the ASQA data and collect human feedback for it. We name the resulting dataset as qa-feedback. | - |
| [SLF5K](https://huggingface.co/datasets/JeremyAlain/SLF5K) | 5K | English | The Summarization with Language Feedback (SLF5K) dataset is an English-language dataset containing 5K unique samples that can be used for the task of abstraction summarization. | apache-2.0 |
| [blended_skill_talk](https://huggingface.co/datasets/blended_skill_talk) | 7K | English | A dataset of 7k conversations explicitly designed to exhibit multiple conversation modes: displaying personality, having empathy, and demonstrating knowledge. | - |
| [GSM-IC](https://github.com/google-research-datasets/GSM-IC) | 8K | English | Grade-School Math with Irrelevant Context (GSM-IC) | - |
| [ChatAlpaca](https://github.com/cascip/ChatAlpaca) | 10K | English | The data currently contain a total of 10,000 conversations with 95,558 utterances. | Apache-2.0 license |
| [PKU-SafeRLHF-10K](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF-10K) | 10K | English | PKU-SafeRLHF-10K, which is the first dataset of its kind and contains 10k instances with safety preferences. | - |
| [Dolly](https://github.com/databrickslabs/dolly/tree/master/data) | 15K | English | databricks-dolly-15k is a corpus of more than 15,000 records generated by thousands of Databricks employees to enable large language models to exhibit the magical interactivity of ChatGPT. | CC 3.0 |
| [WebGPT](https://huggingface.co/datasets/openai/webgpt_comparisons) | 20K | English | This is the dataset of all comparisons that were marked as suitable for reward modeling by the end of the WebGPT project. | - |
| [Code Alpaca](https://github.com/sahil280114/codealpaca) | 20K | English | Code generation task involving 20,022 samples | - |
| [openapi-function-invocations-25k](unaidedelf87777/openapi-function-invocations-25k) | 25K | English | The construction of this dataset involved a systematic procedure combining manual extraction and AI-assisted synthesis. | mit |
| [LongForm](https://github.com/akoksal/LongForm/tree/main/dataset) | 28K | English | The LongForm dataset is created by leveraging English corpus examples with augmented instructions. | The LongForm project is subject to a MIT License with custom limitations for restrictions imposed by OpenAI (for the instruction generation part), as well as the license of language models (OPT, LLaMA, and T5). |
| [chatbot_arena_conversations](https://huggingface.co/datasets/lmsys/chatbot_arena_conversations) | 33K | English | This dataset contains 33K cleaned conversations with pairwise human preferences. It is collected from 13K unique IP addresses on the Chatbot Arena from April to June 2023. | |
| [HC3](https://huggingface.co/datasets/Hello-SimpleAI/HC3) | 37K | English, Chinese | 37,175 instructions generated by ChatGPT and human | - |
| [Anthropic_HH_Golden](https://huggingface.co/datasets/Unified-Language-Model-Alignment/Anthropic_HH_Golden) | 45K | English | This repository contains a new preference dataset extending the harmless dataset of Anthropic's Helpful and Harmless (HH) datasets. The origin positive response in HH is generated by a supervised fined-tuned model of Anthropic, where harmful and unhelpful responses are freqently encountered. In this dataset, the positive responses are replaced by re-rewritten responses generated by GPT4. | |
| [Mol-Instructions](https://huggingface.co/datasets/zjunlp/Mol-Instructions) | 48K | English | An open, large-scale biomolecular instruction dataset for large language models. | CC BY 4.0 |
| [RefGPT](https://github.com/ziliwangnlp/RefGPT) | 50K | English,chinese | we introduce a cost-effective method called RefGPT, which generates a vast amount of high-quality multi-turn Q&A content. | - |
| [arxiv-math-instruct-50k](https://huggingface.co/datasets/ArtifactAI/arxiv-math-instruct-50k) | 50K | English | Dataset consists of question-answer pairs derived from ArXiv abstracts from math categories | - |
| [arxiv-math-instruct-50k](https://huggingface.co/datasets/ArtifactAI/arxiv-math-instruct-50k) | 51K | English | The "ArtifactAI/arxiv-math-instruct-50k" dataset consists of question-answer pairs derived from ArXiv abstracts from the math categories,Questions are generated using the t5-base model, while the answers are generated using the GPT-3.5-turbo model. | |
| [Traditional Chinese Alpaca Dataset](https://github.com/ntunlplab/traditional-chinese-alpaca) | 52K | Traditional Chinese | Translated from Alpaca Data by ChatGPT API | Apache-2.0 license |
| [Cabrita Dataset](https://github.com/22-hours/cabrita) | 52K | Portuguese | Translated from Alpaca Data | |
| [Japanese Alpaca Dataset](https://github.com/shi3z/alpaca_ja) | 52K | Japanese | Translated from Alpaca Data by ChatGPT API | CC By NC 4.0; OpenAI terms of use |
| [Alpaca Dataset](https://github.com/tatsu-lab/stanford_alpaca) | 52K | English | 175 seed instructions by OpenAI API | CC By NC 4.0; OpenAI terms of use |
| [Alpaca Data Cleaned](https://github.com/gururise/AlpacaDataCleaned) | 52K | English | Revised version of Alpaca Dataset | - |
| [Alpaca GPT-4 Data](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM) | 52K | English | Generated by GPT-4 using Alpaca prompts | - |
| [Alpaca GPT-4 Data (Chinese)](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM) | 52K | Chinese | Generated by GPT-4 using Chinese prompts translated from Alpaca by ChatGPT | - |
| [Dynosaur](https://dynosaur-it.github.io) | 66K | English | Dynosaur, a dynamic growth paradigm for instruction-tuning data curation. | Apache-2.0 license |
| [Finance](https://huggingface.co/datasets/gbharti/finance-alpaca) | 69K | English | 68,912 financial related instructions | - |
| [evol](https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_70k) | 70K | English | This is the training data of WizardLM. | - |
| [Vicuna Dataset](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered) | 75K | English | ~100k ShareGPT conversations | - |
| [InstructionTranslation](https://huggingface.co/datasets/theblackcat102/instruction_translations) | 80K | Multi-lingual | Translations were generated by M2M 12B and the output generations were limited at 512 tokens due to VRAM limit (40G). | MIT |
| [Self-Instruct](https://github.com/yizhongw/self-instruct/tree/main) | 82K | English | We release a dataset that contains 52k instructions, paired with 82K instance inputs and outputs. | - |
| [OASST1](https://huggingface.co/datasets/OpenAssistant/oasst1) | 89K | Multi-lingual | a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages in 35 different languages, annotated with 461,292 quality ratings, resulting in over 10,000 fully annotated conversation trees. | apache-2.0 |
| [HH-RLHF](https://github.com/anthropics/hh-rlhf/tree/master) | 91K | English | The data are described in the paper: Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. | MIT |
| [Guanaco Dataset](https://huggingface.co/datasets/JosephusCheung/GuanacoDataset) | 98K | English, Simplified Chinese, Traditional Chinese HK & TW, Japanese | 175 tasks from the Alpaca model | GPLv3 |
| [InstructionWild](https://github.com/XueFuzhao/InstructionWild) | 104K | English, Chinese | 429 seed instructions and follow Alpaca to generate 52K | Research only; OpenAI terms of use |
| [Camel Dataset](https://github.com/lightaime/camel) | 107K | Multi-lingual | Role-playing between AIs (Open AI API) | - |
| [Tapir-Cleaned](https://huggingface.co/datasets/MattiaL/tapir-cleaned-116k) | 117K | English | This is a revised version of the DAISLab dataset of IFTTT rules, which has been thoroughly cleaned, scored, and adjusted for the purpose of instruction-tuning. | CC BY-NC 4.0 |
| [WizardLM_evol_instruct_V2_196k](https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k) | 143K | English | This datasets contains 143K mixture evolved data of Alpaca and ShareGPT. | - |
| [LLaVA Visual Instruct](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) | 150K | English | LLaVA Visual Instruct 150K is a set of GPT-generated multimodal instruction-following data. It is constructed for visual instruction tuning and for building large multimodal towards GPT-4 vision/language capability. | cc-by-nc-4.0 |
| [Prosocial Dialog](https://huggingface.co/datasets/allenai/prosocial-dialog) | 166K | English | 165,681 instructions produced by GPT-3 rewrites questions and human feedback | - |
| [COIG](https://huggingface.co/datasets/BAAI/COIG) | 191K | Chinese | Chinese Open Instruction Generalist (COIG) project to maintain a harmless, helpful, and diverse set of Chinese instruction corpora. | apache-2.0 |
| [orca-chat](https://huggingface.co/datasets/shahules786/orca-chat) | 198K | English | This is a cleaned, pruned, and clustered version of orca to form a conversation-style dataset. The the process involves removing samples with very high similarity and also grouping instructions to form conversation. | |
| [Unnatural Instructions](https://github.com/orhonovich/unnatural-instructions) | 241K | English | a large dataset of cre- ative and diverse instructions, collected with virtually no human labor. | MIT |
| [SHP](https://huggingface.co/datasets/stanfordnlp/SHP) | 358K | English | SHP is a dataset of 385K collective human preferences over responses to questions/instructions in 18 different subject areas, from cooking to legal advice. | Reddit non-exclusive, non-transferable, non-sublicensable, and revocable license |
| [dromedary](https://huggingface.co/datasets/zhiqings/dromedary-65b-verbose-clone-v0) | 361K | English | Dromedary-Verbose-Clone is a synthetic dataset of 360k instructions and demonstrations. | cc-by-nc-4.0 |
| [ultrachat](https://huggingface.co/datasets/stingning/ultrachat) | 404K | English | To ensure generation quality, two separate ChatGPT Turbo APIs are adopted in generation, where one plays the role of the user to generate queries and the other generates the response. | cc-by-nc-4.0 |
| [ign_clean_instruct_dataset_500k](https://huggingface.co/datasets/ignmilton/ign_clean_instruct_dataset_500k) | 509K | English | This dataset contains ~508k prompt-instruction pairs with high quality responses. It was synthetically created from a subset of Ultrachat prompts. It does not contain any alignment focused responses or NSFW content. | apache-2.0 |
| [ELI5](https://huggingface.co/datasets/eli5) | 559K | English | The ELI5 dataset is an English-language dataset of questions and answers gathered from three subreddits where users ask factual questions requiring paragraph-length or longer answers. | - |
| [GPT4All Dataset](https://github.com/nomic-ai/gpt4all) | 806K | Multi-lingual | Subset of LAION OIG, StackOverflow Question, BigSciense/p3 dataset. Answered by OpenAI API. | - |
| [Instruct](https://huggingface.co/datasets/swype/instruct) | 889K | English | 888,969 English instructions, augmentation using AllenAI NLP tools | MIT |
| [MOSS](https://github.com/OpenLMLab/MOSS#数据) | 1M | Chinese | Generated by gpt-3.5-turbo | Apache-2.0, AGPL-3.0 licenses |
| [LaMini-Instruction](https://huggingface.co/datasets/MBZUAI/LaMini-instruction) | 3M | English | a total of 2.58M pairs of instructions and responses using gpt-3.5-turbo based on several existing resources of prompts | cc-by-nc-4.0 |
| [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca) | 3M | English | The OpenOrca dataset is a collection of augmented FLAN Collection data. Currently ~1M GPT-4 completions, and ~3.2M GPT-3.5 completions. | |
| [Natural Instructions](https://github.com/allenai/natural-instructions) | 5M | Multi-lingual | 5,040,134 instructions collected from diverse NLP tasks | - |
| [BELLE](https://github.com/LianjiaTech/BELLE/tree/main/data) | 10M | Chinese | The 10M Chinese dataset is composed of subsets spanning multiple (instruction) types and multiple fields. | Research only; OpenAI terms of use |
| [Firefly](https://github.com/yangjianxin1/Firefly) | 16M | Chinese | 1,649,398 Chinese instructions in 23 NLP tasks | - |
| [OIG-43M Dataset](https://laion.ai/blog/oig-dataset/) | 43M | Multi-lingual | Together, LAION, and Ontocord.ai. | - |
| [xP3](https://huggingface.co/datasets/bigscience/xP3) | 79M | Multi-lingual | 78,883,588 instructions collected by prompts & datasets across 46 languages & 16 NLP tasks | - |
| [CodeParrot](https://github.com/huggingface/transformers/tree/main/examples/research_projects/codeparrot#dataset) | - | python | The database was queried for all Python files with less than 1MB in size resulting in a 180GB dataset with over 20M files. | - |
| [Alpaca-CoT Dataset](https://github.com/PhoebusSi/Alpaca-CoT/tree/main/data) | - | Multi-lingual | Instruction Data Collection | ODC-By |
| [stack-exchange-paired](https://huggingface.co/datasets/lvwerra/stack-exchange-paired) | - | English | This dataset contains questions and answers from the Stack Overflow Data Dump for the purpose of preference model training. | cc-by-sa-4.0 |
| [LangChainDatasets](https://huggingface.co/LangChainDatasets) | - | English | This is a community-drive dataset repository for datasets that can be used to evaluate LangChain chains and agents. | - |
| [ParlAI](https://github.com/facebookresearch/ParlAI/tree/main/parlai/tasks) | - | English | 100+ popular datasets available all in one place, dialogue models, from open-domain chitchat, to task-oriented dialogue, to visual question answering. | - |
| [GPTeacher](https://github.com/teknium1/GPTeacher) | - | English | A collection of modular datasets generated by GPT-4, General-Instruct - Roleplay-Instruct - Code-Instruct - and Toolformer | - |
| [silk-road/Wizard-LM-Chinese-instruct-evol](https://huggingface.co/datasets/silk-road/Wizard-LM-Chinese-instruct-evol) | - | chinese | Wizard-LM-Chinese | - |
| [MultiWOZ](https://huggingface.co/datasets/multi_woz_v22) | - | English | Multi-Domain Wizard-of-Oz dataset (MultiWOZ), a fully-labeled collection of human-human written conversations spanning over multiple domains and topics. | apache-2.0 |