awesome-chatgpt-dataset
Unlock the Power of LLM: Explore These Datasets to Train Your Own ChatGPT!
https://github.com/voidful/awesome-chatgpt-dataset
Last synced: about 11 hours ago
JSON representation
-
Dataset Detail
-
- im-feeling-curious
- Finance
- Vicuna Dataset - |
- qa_feedback - construct the ASQA data and collect human feedback for it. We name the resulting dataset as qa-feedback. | - |
- InstructionTranslation - lingual | Translations were generated by M2M 12B and the output generations were limited at 512 tokens due to VRAM limit (40G). | MIT |
- Self-Instruct - |
- OASST1 - lingual | Human‑generated assistant conversations (35 languages). | apache-2.0
- HH-RLHF
- blended_skill_talk - |
- Tapir-Cleaned - tuning. | CC BY-NC 4.0 |
- WizardLM_evol_instruct_V2_196k - |
- LLaVA Visual Instruct 150K - by-nc-4.0
- ProsocialDialog
- COIG - 2.0
- BELLE
- OIG-43M Dataset - lingual | Together, LAION, and Ontocord.ai. | - |
- xP3 - lingual | 78,883,588 instructions from prompted datasets across 46 languages & 16 tasks. | -
- SHP
- dromedary - Verbose-Clone is a synthetic dataset of 360k instructions and demonstrations. | cc-by-nc-4.0 |
- ultrachat - by-nc-4.0 |
- ign_clean_instruct_dataset_500k - instruction pairs with high quality responses. It was synthetically created from a subset of Ultrachat prompts. It does not contain any alignment focused responses or NSFW content. | apache-2.0 |
- Instruct
- Alpaca-CoT Dataset - | Multi-lingual | Instruction Data Collection | ODC-By |
- LangChainDatasets - | English | This is a community-drive dataset repository for datasets that can be used to evaluate LangChain chains and agents. | - |
- ParlAI - | English | 100+ popular datasets available all in one place, dialogue models, from open-domain chitchat, to task-oriented dialogue, to visual question answering. | - |
- silk-road/Wizard-LM-Chinese-instruct-evol - | chinese | Wizard-LM-Chinese | - |
- ChatAlpaca - 2.0 license |
- Code Alpaca - |
- Traditional Chinese Alpaca Dataset - 2.0 license |
- Cabrita Dataset
- Japanese Alpaca Dataset
- Alpaca Dataset
- Alpaca Data Cleaned - |
- GPT4All Dataset - lingual | Subset of LAION OIG, StackOverflow Question, BigSciense/p3 dataset. Answered by OpenAI API. | - |
- InstructionWild
- Unnatural Instructions - ative and diverse instructions, collected with virtually no human labor. | MIT |
- Natural Instructions - lingual | 5,040,134 instructions collected from diverse NLP tasks | - |
- Firefly - |
- GPTeacher - | English | A collection of modular datasets generated by GPT-4, General-Instruct - Roleplay-Instruct - Code-Instruct - and Toolformer | - |
- Chatbot Arena Conversations
- Anthropic_HH_Golden - tuned model of Anthropic, where harmful and unhelpful responses are freqently encountered. In this dataset, the positive responses are replaced by re-rewritten responses generated by GPT4. | |
- orca-chat - style dataset. The the process involves removing samples with very high similarity and also grouping instructions to form conversation. | |
- stack-exchange-paired - | English | This dataset contains questions and answers from the Stack Overflow Data Dump for the purpose of preference model training. | cc-by-sa-4.0 |
- MultiWOZ - | English | Multi-Domain Wizard-of-Oz dataset (MultiWOZ), a fully-labeled collection of human-human written conversations spanning over multiple domains and topics. | apache-2.0 |
- TheoremQA
- cc_sbu_align - 3-clause
- Dolly - dolly-15k is a corpus of more than 15,000 records generated by thousands of Databricks employees to enable large language models to exhibit the magical interactivity of ChatGPT. | CC 3.0 |
- LongForm
- HC3
- Mol‑Instructions - by-4.0
- ELI5 - language dataset of questions and answers gathered from three subreddits where users ask factual questions requiring paragraph-length or longer answers. | - |
- blended_skill_talk
- LaMini‑Instruction - by-nc-4.0
- arxiv‑math‑instruct‑50k (ArtifactAI)
- Alpaca GPT-4 Data (Chinese) - 4 using Chinese prompts translated from Alpaca by ChatGPT | - |
- MOSS - 3.5-turbo | Apache-2.0, AGPL-3.0 licenses |
- InstructionWild - only; OpenAI terms
- CAMEL Dataset
- TAPIR‑Cleaned - by-nc-4.0
- OASST2 (final) - lingual | Open Assistant Conversations Release 2 (train+val). | apache-2.0
- M2Lingual - lingual | Multilingual mixed‑modal (code+text) chat/instruct SFT. | -
- OpenOrca (full)
- OpenR1‑Math‑220k - 2.0
- Unnatural Instructions
- WildJailbreak - by
- UltraChat - by-nc-4.0
- IGN Clean Instruct 500K - 2.0
- GPT4All - lingual | LAION OIG + StackOverflow + P3 prompts; OpenAI outputs. | -
- TheoremQA
- LIMA - by-nc-sa-4.0
- WildGuardMix - annotator labels. | odc-by
- Berkeley Function Calling Leaderboard (BFCL) - calling eval covering parallel/multi-call scenarios across languages. | -
- Puffin - turn examples; each response via GPT‑4. | apache-2.0
- QA-Feedback
- SLF5K - 2.0
- GSM‑IC
- ChatAlpaca‑10K - 2.0
- PKU‑SafeRLHF‑10K
- Dolly‑15K - by-3.0
- WebGPT (comparisons)
- CodeAlpaca‑20K
- HelpSteer2 - source helpfulness data for reward models and preference learning. | cc-by-4.0
- openapi-function-invocations‑25k - call traces. | mit
- LongForm
- HH‑RLHF
- RefGPT
- arxiv‑math‑instruct‑50k
- Traditional Chinese Alpaca - 2.0
- Cabrita Dataset
- Japanese Alpaca - by-nc-4.0; OpenAI terms
- Alpaca Dataset - by-nc-4.0; OpenAI terms
- Alpaca Data Cleaned
- Alpaca GPT‑4 Data
- Alpaca GPT‑4 Chinese
- xLAM Function Calling 60K - calling data for executable agents. | apache-2.0
- Dynosaur - 2.0
- WizardLM evol
- Vicuna Dataset
- InstructionTranslation - lingual | M2M‑12B translated instructions (≤512 tokens). | mit
- Instruct
- Guanaco Dataset - 3.0
- MOSS - 2.0 + agpl-3.0
- WildChat‑4.8M (nontoxic subset) - by
- smolTalk - 2.0
- Open‑PerfectBlend - 2.0
- The Tome
- NaturalReasoning - by-nc-4.0
- Infinity‑Instruct - lingual | 7.4M base + ~1.5M chat instruction data. | cc-by-sa-4.0
- BELLE‑10M - only; OpenAI terms
- Firefly
- OIG‑43M - lingual | LAION + Together + OntoCord composite instruction pool. | -
-
Unknown / mixed-size (kept for completeness; format consistent with original)
- CodeParrot - | python | 180GB Python files (<1MB each), 20M+ files. | -
- Alpaca‑CoT - | Multi-lingual | Instruction data with chain‑of‑thought traces. | odc-by
- stack-exchange-paired - | English | StackExchange Q&A pairs for preference modeling. | cc-by-sa-4.0
- LangChainDatasets - | English | Community datasets to evaluate chains & agents. | -
- ParlAI - | English | Dialog research platform with many tasks/datasets. | -
- GPTeacher - | English | Instruction datasets consolidated for general SFT. | -
- Wizard‑LM Chinese Evol - | Chinese | Chinese evol‑instruct corpus. | -
- ToolACE - | English | Multi‑tool calling SFT (functions, API JSON, tool plans). | -
- UltraFeedback (cleaned binarized) - | English | UltraFeedback preferences cleaned & binarized. | cc-by-nc-4.0
- glaive‑function‑calling‑v2 - | English | Function‑calling SFT dataset with tool schemas & arguments. | apache-2.0
-
-
Unlock the Power of LLM: Explore These Datasets to Train Your Own ChatGPT!
- Dynosaur - tuning data curation. | Apache-2.0 license |
Programming Languages
Categories
Keywords