Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

awesome-chatgpt-dataset

Unlock the Power of LLM: Explore These Datasets to Train Your Own ChatGPT!
https://github.com/voidful/awesome-chatgpt-dataset

Last synced: 1 day ago
JSON representation

  • Dataset Detail

    • lima - NC-SA |
    • im-feeling-curious - |
    • qa_feedback - construct the ASQA data and collect human feedback for it. We name the resulting dataset as qa-feedback. | - |
    • blended_skill_talk - |
    • WebGPT - |
    • Finance - |
    • Vicuna Dataset - |
    • InstructionTranslation - lingual | Translations were generated by M2M 12B and the output generations were limited at 512 tokens due to VRAM limit (40G). | MIT |
    • Self-Instruct - |
    • OASST1 - lingual | a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages in 35 different languages, annotated with 461,292 quality ratings, resulting in over 10,000 fully annotated conversation trees. | apache-2.0 |
    • HH-RLHF
    • Guanaco Dataset
    • Tapir-Cleaned - tuning. | CC BY-NC 4.0 |
    • WizardLM_evol_instruct_V2_196k - |
    • LLaVA Visual Instruct - generated multimodal instruction-following data. It is constructed for visual instruction tuning and for building large multimodal towards GPT-4 vision/language capability. | cc-by-nc-4.0 |
    • Prosocial Dialog - 3 rewrites questions and human feedback | - |
    • COIG - 2.0 |
    • SHP - exclusive, non-transferable, non-sublicensable, and revocable license |
    • dromedary - Verbose-Clone is a synthetic dataset of 360k instructions and demonstrations. | cc-by-nc-4.0 |
    • ultrachat - by-nc-4.0 |
    • ign_clean_instruct_dataset_500k - instruction pairs with high quality responses. It was synthetically created from a subset of Ultrachat prompts. It does not contain any alignment focused responses or NSFW content. | apache-2.0 |
    • Instruct
    • LaMini-Instruction - 3.5-turbo based on several existing resources of prompts | cc-by-nc-4.0 |
    • BELLE
    • OIG-43M Dataset - lingual | Together, LAION, and Ontocord.ai. | - |
    • xP3 - lingual | 78,883,588 instructions collected by prompts & datasets across 46 languages & 16 NLP tasks | - |
    • Alpaca-CoT Dataset - | Multi-lingual | Instruction Data Collection | ODC-By |
    • LangChainDatasets - | English | This is a community-drive dataset repository for datasets that can be used to evaluate LangChain chains and agents. | - |
    • ParlAI - | English | 100+ popular datasets available all in one place, dialogue models, from open-domain chitchat, to task-oriented dialogue, to visual question answering. | - |
    • silk-road/Wizard-LM-Chinese-instruct-evol - | chinese | Wizard-LM-Chinese | - |
    • GSM-IC - School Math with Irrelevant Context (GSM-IC) | - |
    • ChatAlpaca - 2.0 license |
    • Code Alpaca - |
    • Traditional Chinese Alpaca Dataset - 2.0 license |
    • Cabrita Dataset
    • Japanese Alpaca Dataset
    • Alpaca Dataset
    • Alpaca Data Cleaned - |
    • InstructionWild
    • Unnatural Instructions - ative and diverse instructions, collected with virtually no human labor. | MIT |
    • GPT4All Dataset - lingual | Subset of LAION OIG, StackOverflow Question, BigSciense/p3 dataset. Answered by OpenAI API. | - |
    • Natural Instructions - lingual | 5,040,134 instructions collected from diverse NLP tasks | - |
    • Firefly - |
    • GPTeacher - | English | A collection of modular datasets generated by GPT-4, General-Instruct - Roleplay-Instruct - Code-Instruct - and Toolformer | - |
    • Alpaca GPT-4 Data (Chinese) - 4 using Chinese prompts translated from Alpaca by ChatGPT | - |
    • Camel Dataset - lingual | Role-playing between AIs (Open AI API) | - |
    • MOSS - 3.5-turbo | Apache-2.0, AGPL-3.0 licenses |
    • Puffin - 4. | apache-2.0 |
    • SLF5K - language dataset containing 5K unique samples that can be used for the task of abstraction summarization. | apache-2.0 |
    • PKU-SafeRLHF-10K - SafeRLHF-10K, which is the first dataset of its kind and contains 10k instances with safety preferences. | - |
    • chatbot_arena_conversations
    • Anthropic_HH_Golden - tuned model of Anthropic, where harmful and unhelpful responses are freqently encountered. In this dataset, the positive responses are replaced by re-rewritten responses generated by GPT4. | |
    • orca-chat - style dataset. The the process involves removing samples with very high similarity and also grouping instructions to form conversation. | |
    • OpenOrca - 4 completions, and ~3.2M GPT-3.5 completions. | |
    • CodeParrot - | python | The database was queried for all Python files with less than 1MB in size resulting in a 180GB dataset with over 20M files. | - |
    • stack-exchange-paired - | English | This dataset contains questions and answers from the Stack Overflow Data Dump for the purpose of preference model training. | cc-by-sa-4.0 |
    • MultiWOZ - | English | Multi-Domain Wizard-of-Oz dataset (MultiWOZ), a fully-labeled collection of human-human written conversations spanning over multiple domains and topics. | apache-2.0 |
    • TheoremQA
    • cc_sbu_align - 4 datadset | BSD 3-Clause License |
    • Dolly - dolly-15k is a corpus of more than 15,000 records generated by thousands of Databricks employees to enable large language models to exhibit the magical interactivity of ChatGPT. | CC 3.0 |
    • LongForm
    • HC3 - |
    • Mol-Instructions - scale biomolecular instruction dataset for large language models. | CC BY 4.0 |
    • RefGPT - effective method called RefGPT, which generates a vast amount of high-quality multi-turn Q&A content. | - |
    • ELI5 - language dataset of questions and answers gathered from three subreddits where users ask factual questions requiring paragraph-length or longer answers. | - |
    • arxiv-math-instruct-50k - math-instruct-50k" dataset consists of question-answer pairs derived from ArXiv abstracts from the math categories,Questions are generated using the t5-base model, while the answers are generated using the GPT-3.5-turbo model. | |
  • Unlock the Power of LLM: Explore These Datasets to Train Your Own ChatGPT!

    • Dynosaur - tuning data curation. | Apache-2.0 license |