Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

awesome-instruction-datasets

A collection of awesome-prompt-datasets, awesome-instruction-dataset, to train ChatLLM such as chatgpt 收录各种各样的指令数据集, 用于训练 ChatLLM 模型。
https://github.com/jianzhnie/awesome-instruction-datasets

Last synced: 4 days ago
JSON representation

  • Statistics

    • Chain of Thought - research/FLAN/tree/main/flan/v2/cot_data) \|[few_shot_data](https://github.com/google-research/FLAN/tree/main/flan/v2/niv2_few_shot_data) | Google | 74771 | EN/CN | MT | HG | instruct with cot reasoning | annotating CoT on existing data | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/Chain-of-Thought) |
    • GPT4all - ai/gpt4all-j-prompt-generations](https://huggingface.co/datasets/nomic-ai/gpt4all-j-prompt-generations) | nomic-ai | 806199 | EN | MT | COL | code, storys and dialogs | distillation from GPT-3.5-turbo | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/GPT4all) |
    • GPTeacher - 4 General-Instruct ](https://github.com/teknium1/GPTeacher/tree/main/Instruct)\|[Roleplay-Instruct](https://github.com/teknium1/GPTeacher/tree/main/Roleplay) \|[Code-Instruct ](https://github.com/teknium1/GPTeacher/tree/main/Codegen)\| [Toolformer](https://github.com/teknium1/GPTeacher/tree/main/Toolformer) | teknium1 | 29013 | EN | MT | SI | general, roleplay, toolformer | GPT-4 & toolformer | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/GPTeacher) |
    • AlpacaDataCleaned - cleaned](https://huggingface.co/datasets/yahma/alpaca-cleaned) | yahma | 52k | EN | MT | SI | general instruct | text-davinci-003 | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/alpaca) |
    • Luotuo - alpaca-lora/blob/main/data/trans_chinese_alpaca_data.json) | 52k | Zh |
    • Natural Instructions - ,.,-x) | Allen AI | 5040134 | ML | MT | COL | diverse nlp tasks | human annotated datasets collection | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/Natural-Instructions) |
    • 华驼(HuaTuo) - HI/Huatuo-Llama-Med-Chinese/blob/main/data/llama_data.json) \|[肝癌](https://github.com/SCIR-HI/Huatuo-Llama-Med-Chinese/blob/main/data-literature/liver_cancer.json) | SCIR-HI(哈工大) | 8K | CN | TS | SI | 公开和自建的中文医学知识库 | GPT3.5 | |
    • firefly - train-1.1M](https://huggingface.co/datasets/YeungNLP/firefly-train-1.1M) | | 1649398 | CN | MT | COL | 23 nlp tasks | human annotated datasets collection | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/firefly) |
    • Code Alpaca - davinci-003 | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/CodeAlpaca) |
    • dolly - lab/stanford_alpaca/blob/main/alpaca_data.json) | 52 k | En |
    • mosaicml/llm-foundry - 15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) dataset and a filtered subset of [Anthropic's HH-RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf). | human annotated | |
    • baize - baize/baize-chatbot/tree/main/data) \|[medical_chat_data.json](https://github.com/project-baize/baize-chatbot/blob/main/data/medical_chat_data.json) \| [quora_chat_data.json](https://github.com/project-baize/baize-chatbot/blob/main/data/quora_chat_data.json) \|[stackoverflow_chat_data.json](https://github.com/project-baize/baize-chatbot/blob/main/data/stackoverflow_chat_data.json) | project-baize | 653699 | EN | MT | COL | a collection from Alpaca, Quora, StackOverFlow and MedQuAD questions | human annotated datasets collection | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/baize) |
    • hh-rlhf - rlhf](https://huggingface.co/datasets/anthropic/hh-rlhf) | Anthropic | 284517 | EN | TS | MIX | dialogue | dialog between human and RLHF models | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/hh-rlhf) |
    • GAOKAO - in-the-blank_Questions](https://github.com/OpenLMLab/GAOKAO-Bench/tree/main/data/Fill-in-the-blank_Questions) \| [Multiple-choice_Questions](https://github.com/OpenLMLab/GAOKAO-Bench/tree/main/data/Multiple-choice_Questions) \| [Open-ended_Questions](https://github.com/OpenLMLab/GAOKAO-Bench/tree/main/data/Open-ended_Questions) | OpenLMLab | 2785 | CN | MT | COL | Multiple-choice, Fill-in-the-blank and Open-ended questions from examination | human annotated | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/GAOKAO) |
    • camel - ai/code](https://huggingface.co/datasets/camel-ai/ai_society)\|[camel-ai/biology](https://huggingface.co/datasets/camel-ai/biology) \|[camel-ai/physics](https://huggingface.co/datasets/camel-ai/physics) \|[camel-ai/chemistry](https://huggingface.co/datasets/camel-ai/chemistry) \|[camel-ai/math](https://huggingface.co/datasets/camel-ai/math) | camel-ai | 760620 | EN | MT | SI | Role-Playing conversations in AI Society, Code, Math, Physics, Chemistry, Biolog | gpt-3.5-turbo | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/camel) |
    • GPT4Tools - Or1of7TJuWvmrJpPoOx0cLdcWry/view?usp=share_link) | StevenGrove | 71446 | EN | MT | SI | a collection of tool-related instructions | gpt-3.5-turbo | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/gpt4tools) |
    • Auto CoT - takeshi188/zero_shot_cot/dataset](https://github.com/kojima-takeshi188/zero_shot_cot/tree/main/dataset) \|[kojima-takeshi188/zero_shot_cot/log](https://github.com/kojima-takeshi188/zero_shot_cot/tree/main/log) | amazon-science | | EN | | | | | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/Auto-CoT) |
    • MOSS - 002-sft-data](https://huggingface.co/datasets/fnlp/moss-002-sft-data)\| [moss-003-sft-data](https://github.com/OpenLMLab/MOSS/tree/main/SFT_data/conversations/conversation_without_plugins) | fnlp | 1583595 | EN/CN | SI | | | | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/MOSS) |
    • ultrachat - CoT/tree/main/ultrachat) |
    • LAION-AI/Open-Assistant - generated, human-annotated | |
    • akoksal/LongForm
    • sail-sg/symbolic-instruction-tuning - instruction-tuning](https://huggingface.co/datasets/sail/symbolic-instruction-tuning) | sail-sg | 800K | ML | SI | | | Human Synthetic Examples | |
    • michael-wzhu/PromptCBLUE - wzhu | 110113 | CN | SI | | | 互联网上的医疗问诊问题(110,113),反映了真实世界的不同用户/患者的医疗问诊需求。目前response都是由OpenAI `GPT-3.5`引擎回答的。 | |
    • mbzuai-nlp/LaMini-LM - instruction](https://huggingface.co/datasets/MBZUAI/LaMini-instruction) | MBZUAI/LaMini-instruction | **2.58M** | EN | MT | SI | | 通过离线蒸馏从大型语言模型中提取知识 | |
    • WizardLM
    • thu-coai/Safety-Prompts - coai/Safety-Prompts](https://huggingface.co/datasets/thu-coai/Safety-Prompts) | thu-coai | 100k | Chinese | 中文安全prompts,用于评测和提升大模型的安全性,将模型的输出与人类的价值观对齐。 |
    • Chatgpt-Comparison-Detection project - SimpleAI/HC3](https://huggingface.co/datasets/Hello-SimpleAI/HC3) | | 24.3K | English | Human ChatGPT Comparison Corpus, 60k human answers and 27K ChatGPT answers for around 24K questions. |
    • Chinese-Vicuna
    • ColossalChat
    • cerebras-lora-alpaca - GPT | 2.7B | [AlpacaDataCleaned](https://github.com/gururise/AlpacaDataCleaned) | 52k | En |
    • FLAN-Muffin - CoT/tree/main/FLAN-Muffin) |
    • ShareChat - CoT/tree/main/ShareGPT) |
    • Guanaco - davinci-003 | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/Guanaco) |
    • belle_cn - davinci-003 | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/belle_cn) |
    • prosocial dialog - dialog](https://huggingface.co/datasets/allenai/prosocial-dialog) | allenai | 165681 | EN | TS | MIX | dialogue | GPT-3 rewrites questions + humans feedback manually | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/prosocial-dialog) |
    • finance_en - alpaca](https://huggingface.co/datasets/allenai/prosocial-dialog) | | 68912 | EN | TS | COL | financial related qa | GPT3.5 | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/) |
    • xP3 - CoT/tree/main/xP3) |
    • instruct - source Meta datasets | augmentation performed using the advanced NLP tools provided by AllenAI | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/instruct) |
    • StackLLaMA - exchange-paired](lvwerra/stack-exchange-paired) | | todo | EN | | HG | | | |
    • Zhihu-KOL - KOL](https://huggingface.co/datasets/wangrui6/Zhihu-KOL) | Openassisent | 100 w | | SI | HG | Zhihu data for training Open Assitant | | |
    • rlhf-reward-datasets
    • Dahoas/full-hh-rlhf
    • Dahoas/synthetic-instruct-gptj-pairwise
    • Dahoas/rm-static - static](https://huggingface.co/datasets/Dahoas/static-hh) used for training reward models after supervised fine-tuning. |
    • guanaco
    • camel - ai/code](https://huggingface.co/datasets/camel-ai/ai_society)\|[camel-ai/biology](https://huggingface.co/datasets/camel-ai/biology) \|[camel-ai/physics](https://huggingface.co/datasets/camel-ai/physics) \|[camel-ai/chemistry](https://huggingface.co/datasets/camel-ai/chemistry) \|[camel-ai/math](https://huggingface.co/datasets/camel-ai/math) | camel-ai | 760620 | EN | MT | SI | Role-Playing conversations in AI Society, Code, Math, Physics, Chemistry, Biolog | gpt-3.5-turbo | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/camel) |
    • GPT4Tools - Or1of7TJuWvmrJpPoOx0cLdcWry/view?usp=share_link) | StevenGrove | 71446 | EN | MT | SI | a collection of tool-related instructions | gpt-3.5-turbo | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/gpt4tools) |
    • MOSS - 002-sft-data](https://huggingface.co/datasets/fnlp/moss-002-sft-data)\| [moss-003-sft-data](https://github.com/OpenLMLab/MOSS/tree/main/SFT_data/conversations/conversation_without_plugins) | fnlp | 1583595 | EN/CN | SI | | | | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/MOSS) |
    • Instruction-Tuning-with-GPT-4/GPT-4-LLM - Tuning-with-GPT-4 | 52k | English | Ranked responses (Note: Data is evaluated by `GPT-4` model NOT human) of Alpaca prompts from three models (GPT-4, GPT-3.5 and OPT-IML) by asking GPT-4 to rate the quality. Author believes "GPT-4 is capable of identifying and fixing its own mistakes, and accurately judging the quality of responses" |
    • Chinese-LLaMA-Alpaca - LLaMA-Alpaca/tree/main/data)、[pCLUE](https://github.com/CLUEbenchmark/pCLUE)、[translation2019zh](https://github.com/brightmart/nlp_chinese_corpus#5%E7%BF%BB%E8%AF%91%E8%AF%AD%E6%96%99translation2019zh)、[alpaca_data](https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json)、Self-Instruct | 2M | Zh |
    • alpaca-lora - lab/stanford_alpaca/blob/main/alpaca_data.json)、[alpaca_data_cleaned](https://github.com/tloen/alpaca-lora/blob/main/alpaca_data_cleaned.json) | 52 k | En |
    • HC3-Chinese - SimpleAI/HC3-Chinese](https://huggingface.co/datasets/Hello-SimpleAI/HC3-Chinese) | Hello-SimpleAI\|万得资讯 | 13k | CN | TS | MIX | dialogue evaluation | human or ChatGPT | |
    • HC3 - SimpleAI/HC3](https://huggingface.co/datasets/Hello-SimpleAI/HC3) | Hello-SimpleAI \| 万得资讯 | 37175 | EN/CN | TS | MIX | dialogue evaluation | human or ChatGPT | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/HC3) |
  • [Instruction in the Wild](https://github.com/XueFuzhao/InstructionWild)

  • [PhoebusSi/Alpaca-CoT](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT)

  • alpaca_chinese_dataset

  • Med-ChatGLM/data

  • [BigScience/P3](https://huggingface.co/datasets/bigscience/P3)

  • xMTF - BigScience

  • HH-RLHF - Anthropic

  • InstructDial

  • [Stanford Human Preferences Dataset (SHP)](https://huggingface.co/datasets/stanfordnlp/SHP)

  • [allenai/prosocial-dialog](https://huggingface.co/datasets/allenai/prosocial-dialog)

  • [allenai/natural-instructions](https://github.com/allenai/natural-instructions)

  • [nomic-ai/gpt4all](https://github.com/nomic-ai/gpt4all)

  • [bigscience/xP3](https://huggingface.co/datasets/bigscience/xP3)

  • [orhonovich/unnatural-instructions](https://github.com/orhonovich/unnatural-instructions)

  • [Instruction-Tuning-with-GPT-4/GPT-4-LLM](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM)

  • [databrickslabs/dolly](https://github.com/databrickslabs/dolly/tree/master/data)

  • [OpenAssistant/oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1)

  • BELLE/data/1.5M

  • pCLUE

  • COIG

  • [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf)

  • [HuggingFaceH4/stack-exchange-preferences](https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences)

  • Natural Instruction / Super-Natural Instruction

  • [Unnatural Instruction](https://github.com/orhonovich/unnatural-instructions)

  • [Self-Instruct](https://github.com/yizhongw/self-instruct)

  • [UnifiedSKG - HKU](https://unifiedskg.com/)

  • [Google/Flan Collection](https://github.com/google-research/FLAN/tree/main/flan/v2)

  • OpenAI WebGPT.

  • OpenAI Summarization.

  • Open Instruction Generalist (OIG).

    • Instruction Generalist dataset - school-math-instructions, the poetry-to-songs, and the plot-screenplay-books-dialogue datasets. This results in a total of around 30k examples.
  • ChatGPT Distillation Data

Categories
Statistics 53 [UnifiedSKG - HKU](https://unifiedskg.com/) 3 [Unnatural Instruction](https://github.com/orhonovich/unnatural-instructions) 3 pCLUE 3 [Google/Flan Collection](https://github.com/google-research/FLAN/tree/main/flan/v2) 3 BELLE/data/1.5M 3 InstructDial 3 alpaca_chinese_dataset 3 HH-RLHF - Anthropic 3 xMTF - BigScience 3 [Instruction in the Wild](https://github.com/XueFuzhao/InstructionWild) 3 [nomic-ai/gpt4all](https://github.com/nomic-ai/gpt4all) 2 OpenAI Summarization. 2 [Stanford Human Preferences Dataset (SHP)](https://huggingface.co/datasets/stanfordnlp/SHP) 2 Natural Instruction / Super-Natural Instruction 2 COIG 2 [Self-Instruct](https://github.com/yizhongw/self-instruct) 2 [BigScience/P3](https://huggingface.co/datasets/bigscience/P3) 2 OpenAI WebGPT. 2 [allenai/prosocial-dialog](https://huggingface.co/datasets/allenai/prosocial-dialog) 1 [Instruction-Tuning-with-GPT-4/GPT-4-LLM](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM) 1 [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) 1 [allenai/natural-instructions](https://github.com/allenai/natural-instructions) 1 [OpenAssistant/oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1) 1 ChatGPT Distillation Data 1 [HuggingFaceH4/stack-exchange-preferences](https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences) 1 Open Instruction Generalist (OIG). 1 [PhoebusSi/Alpaca-CoT](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT) 1 Med-ChatGLM/data 1 [bigscience/xP3](https://huggingface.co/datasets/bigscience/xP3) 1 [databrickslabs/dolly](https://github.com/databrickslabs/dolly/tree/master/data) 1 [orhonovich/unnatural-instructions](https://github.com/orhonovich/unnatural-instructions) 1
Sub Categories