awesome-chatgpt-dataset

Unlock the Power of LLM: Explore These Datasets to Train Your Own ChatGPT!
https://github.com/voidful/awesome-chatgpt-dataset

Last synced: 6 days ago
JSON representation

Dataset Detail
- lima - NC-SA |
- im-feeling-curious - |
- qa_feedback - construct the ASQA data and collect human feedback for it. We name the resulting dataset as qa-feedback. | - |
- blended_skill_talk - |
- WebGPT - |
- Finance - |
- Vicuna Dataset - |
- InstructionTranslation - lingual | Translations were generated by M2M 12B and the output generations were limited at 512 tokens due to VRAM limit (40G). | MIT |
- Self-Instruct - |
- OASST1 - lingual | a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages in 35 different languages, annotated with 461,292 quality ratings, resulting in over 10,000 fully annotated conversation trees. | apache-2.0 |
- HH-RLHF
- Guanaco Dataset
- Tapir-Cleaned - tuning. | CC BY-NC 4.0 |
- WizardLM_evol_instruct_V2_196k - |
- LLaVA Visual Instruct - generated multimodal instruction-following data. It is constructed for visual instruction tuning and for building large multimodal towards GPT-4 vision/language capability. | cc-by-nc-4.0 |
- Prosocial Dialog - 3 rewrites questions and human feedback | - |
- COIG - 2.0 |
- SHP - exclusive, non-transferable, non-sublicensable, and revocable license |
- dromedary - Verbose-Clone is a synthetic dataset of 360k instructions and demonstrations. | cc-by-nc-4.0 |
- ultrachat - by-nc-4.0 |
- ign_clean_instruct_dataset_500k - instruction pairs with high quality responses. It was synthetically created from a subset of Ultrachat prompts. It does not contain any alignment focused responses or NSFW content. | apache-2.0 |
- Instruct
- LaMini-Instruction - 3.5-turbo based on several existing resources of prompts | cc-by-nc-4.0 |
- BELLE
- OIG-43M Dataset - lingual | Together, LAION, and Ontocord.ai. | - |
- xP3 - lingual | 78,883,588 instructions collected by prompts & datasets across 46 languages & 16 NLP tasks | - |
- Alpaca-CoT Dataset - | Multi-lingual | Instruction Data Collection | ODC-By |
- LangChainDatasets - | English | This is a community-drive dataset repository for datasets that can be used to evaluate LangChain chains and agents. | - |
- ParlAI - | English | 100+ popular datasets available all in one place, dialogue models, from open-domain chitchat, to task-oriented dialogue, to visual question answering. | - |
- silk-road/Wizard-LM-Chinese-instruct-evol - | chinese | Wizard-LM-Chinese | - |
- GSM-IC - School Math with Irrelevant Context (GSM-IC) | - |
- ChatAlpaca - 2.0 license |
- Code Alpaca - |
- Traditional Chinese Alpaca Dataset - 2.0 license |
- Cabrita Dataset
- Japanese Alpaca Dataset
- Alpaca Dataset
- Alpaca Data Cleaned - |
- InstructionWild
- Unnatural Instructions - ative and diverse instructions, collected with virtually no human labor. | MIT |
- GPT4All Dataset - lingual | Subset of LAION OIG, StackOverflow Question, BigSciense/p3 dataset. Answered by OpenAI API. | - |
- Natural Instructions - lingual | 5,040,134 instructions collected from diverse NLP tasks | - |
- Firefly - |
- GPTeacher - | English | A collection of modular datasets generated by GPT-4, General-Instruct - Roleplay-Instruct - Code-Instruct - and Toolformer | - |
- Alpaca GPT-4 Data (Chinese) - 4 using Chinese prompts translated from Alpaca by ChatGPT | - |
- Puffin - 4. | apache-2.0 |
- SLF5K - language dataset containing 5K unique samples that can be used for the task of abstraction summarization. | apache-2.0 |
- PKU-SafeRLHF-10K - SafeRLHF-10K, which is the first dataset of its kind and contains 10k instances with safety preferences. | - |
- chatbot_arena_conversations
- Anthropic_HH_Golden - tuned model of Anthropic, where harmful and unhelpful responses are freqently encountered. In this dataset, the positive responses are replaced by re-rewritten responses generated by GPT4. | |
- orca-chat - style dataset. The the process involves removing samples with very high similarity and also grouping instructions to form conversation. | |
- OpenOrca - 4 completions, and ~3.2M GPT-3.5 completions. | |
- CodeParrot - | python | The database was queried for all Python files with less than 1MB in size resulting in a 180GB dataset with over 20M files. | - |
- stack-exchange-paired - | English | This dataset contains questions and answers from the Stack Overflow Data Dump for the purpose of preference model training. | cc-by-sa-4.0 |
- MultiWOZ - | English | Multi-Domain Wizard-of-Oz dataset (MultiWOZ), a fully-labeled collection of human-human written conversations spanning over multiple domains and topics. | apache-2.0 |
- MOSS - 3.5-turbo | Apache-2.0, AGPL-3.0 licenses |
- TheoremQA
- cc_sbu_align - 4 datadset | BSD 3-Clause License |
- Dolly - dolly-15k is a corpus of more than 15,000 records generated by thousands of Databricks employees to enable large language models to exhibit the magical interactivity of ChatGPT. | CC 3.0 |
- LongForm
- HC3 - |
- Mol-Instructions - scale biomolecular instruction dataset for large language models. | CC BY 4.0 |
- RefGPT - effective method called RefGPT, which generates a vast amount of high-quality multi-turn Q&A content. | - |
- ELI5 - language dataset of questions and answers gathered from three subreddits where users ask factual questions requiring paragraph-length or longer answers. | - |
- Camel Dataset - lingual | Role-playing between AIs (Open AI API) | - |
- arxiv-math-instruct-50k - math-instruct-50k" dataset consists of question-answer pairs derived from ArXiv abstracts from the math categories,Questions are generated using the t5-base model, while the answers are generated using the GPT-3.5-turbo model. | |
- MultiWOZ - | English | Multi-Domain Wizard-of-Oz dataset (MultiWOZ), a fully-labeled collection of human-human written conversations spanning over multiple domains and topics. | apache-2.0 |
- ChatAlpaca - 2.0 license |
Unlock the Power of LLM: Explore These Datasets to Train Your Own ChatGPT!
- Dynosaur - tuning data curation. | Apache-2.0 license |

Programming Languages

Python 8 HTML 1 C++ 1 Jupyter Notebook 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

awesome-chatgpt-dataset

Dataset Detail

Unlock the Power of LLM: Explore These Datasets to Train Your Own ChatGPT!