awesome-instruction-datasets

A collection of awesome-prompt-datasets, awesome-instruction-dataset, to train ChatLLM such as chatgpt 收录各种各样的指令数据集, 用于训练 ChatLLM 模型。
https://github.com/jianzhnie/awesome-instruction-datasets

Last synced: 15 days ago
JSON representation

Statistics
- Chain of Thought - research/FLAN/tree/main/flan/v2/cot_data) \|[few_shot_data](https://github.com/google-research/FLAN/tree/main/flan/v2/niv2_few_shot_data) | Google | 74771 | EN/CN | MT | HG | instruct with cot reasoning | annotating CoT on existing data | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/Chain-of-Thought) |
- GPT4all - ai/gpt4all-j-prompt-generations](https://huggingface.co/datasets/nomic-ai/gpt4all-j-prompt-generations) | nomic-ai | 806199 | EN | MT | COL | code, storys and dialogs | distillation from GPT-3.5-turbo | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/GPT4all) |
- GPTeacher - 4 General-Instruct ](https://github.com/teknium1/GPTeacher/tree/main/Instruct)\|[Roleplay-Instruct](https://github.com/teknium1/GPTeacher/tree/main/Roleplay) \|[Code-Instruct ](https://github.com/teknium1/GPTeacher/tree/main/Codegen)\| [Toolformer](https://github.com/teknium1/GPTeacher/tree/main/Toolformer) | teknium1 | 29013 | EN | MT | SI | general, roleplay, toolformer | GPT-4 & toolformer | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/GPTeacher) |
- AlpacaDataCleaned - cleaned](https://huggingface.co/datasets/yahma/alpaca-cleaned) | yahma | 52k | EN | MT | SI | general instruct | text-davinci-003 | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/alpaca) |
- Natural Instructions - ,.,-x) | Allen AI | 5040134 | ML | MT | COL | diverse nlp tasks | human annotated datasets collection | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/Natural-Instructions) |
- 华驼(HuaTuo) - HI/Huatuo-Llama-Med-Chinese/blob/main/data/llama_data.json) \|[肝癌](https://github.com/SCIR-HI/Huatuo-Llama-Med-Chinese/blob/main/data-literature/liver_cancer.json) | SCIR-HI(哈工大) | 8K | CN | TS | SI | 公开和自建的中文医学知识库 | GPT3.5 | |
- firefly - train-1.1M](https://huggingface.co/datasets/YeungNLP/firefly-train-1.1M) | | 1649398 | CN | MT | COL | 23 nlp tasks | human annotated datasets collection | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/firefly) |
- Code Alpaca - davinci-003 | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/CodeAlpaca) |
- mosaicml/llm-foundry - 15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) dataset and a filtered subset of [Anthropic's HH-RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf). | human annotated | |
- baize - baize/baize-chatbot/tree/main/data) \|[medical_chat_data.json](https://github.com/project-baize/baize-chatbot/blob/main/data/medical_chat_data.json) \| [quora_chat_data.json](https://github.com/project-baize/baize-chatbot/blob/main/data/quora_chat_data.json) \|[stackoverflow_chat_data.json](https://github.com/project-baize/baize-chatbot/blob/main/data/stackoverflow_chat_data.json) | project-baize | 653699 | EN | MT | COL | a collection from Alpaca, Quora, StackOverFlow and MedQuAD questions | human annotated datasets collection | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/baize) |
- hh-rlhf - rlhf](https://huggingface.co/datasets/anthropic/hh-rlhf) | Anthropic | 284517 | EN | TS | MIX | dialogue | dialog between human and RLHF models | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/hh-rlhf) |
- GAOKAO - in-the-blank_Questions](https://github.com/OpenLMLab/GAOKAO-Bench/tree/main/data/Fill-in-the-blank_Questions) \| [Multiple-choice_Questions](https://github.com/OpenLMLab/GAOKAO-Bench/tree/main/data/Multiple-choice_Questions) \| [Open-ended_Questions](https://github.com/OpenLMLab/GAOKAO-Bench/tree/main/data/Open-ended_Questions) | OpenLMLab | 2785 | CN | MT | COL | Multiple-choice, Fill-in-the-blank and Open-ended questions from examination | human annotated | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/GAOKAO) |
- camel - ai/code](https://huggingface.co/datasets/camel-ai/ai_society)\|[camel-ai/biology](https://huggingface.co/datasets/camel-ai/biology) \|[camel-ai/physics](https://huggingface.co/datasets/camel-ai/physics) \|[camel-ai/chemistry](https://huggingface.co/datasets/camel-ai/chemistry) \|[camel-ai/math](https://huggingface.co/datasets/camel-ai/math) | camel-ai | 760620 | EN | MT | SI | Role-Playing conversations in AI Society, Code, Math, Physics, Chemistry, Biolog | gpt-3.5-turbo | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/camel) |
- GPT4Tools - Or1of7TJuWvmrJpPoOx0cLdcWry/view?usp=share_link) | StevenGrove | 71446 | EN | MT | SI | a collection of tool-related instructions | gpt-3.5-turbo | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/gpt4tools) |
- Auto CoT - takeshi188/zero_shot_cot/dataset](https://github.com/kojima-takeshi188/zero_shot_cot/tree/main/dataset) \|[kojima-takeshi188/zero_shot_cot/log](https://github.com/kojima-takeshi188/zero_shot_cot/tree/main/log) | amazon-science | | EN | | | | | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/Auto-CoT) |
- MOSS - 002-sft-data](https://huggingface.co/datasets/fnlp/moss-002-sft-data)\| [moss-003-sft-data](https://github.com/OpenLMLab/MOSS/tree/main/SFT_data/conversations/conversation_without_plugins) | fnlp | 1583595 | EN/CN | SI | | | | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/MOSS) |
- ultrachat - CoT/tree/main/ultrachat) |
- LAION-AI/Open-Assistant - generated, human-annotated | |
- akoksal/LongForm
- sail-sg/symbolic-instruction-tuning - instruction-tuning](https://huggingface.co/datasets/sail/symbolic-instruction-tuning) | sail-sg | 800K | ML | SI | | | Human Synthetic Examples | |
- michael-wzhu/PromptCBLUE - wzhu | 110113 | CN | SI | | | 互联网上的医疗问诊问题(110,113)，反映了真实世界的不同用户/患者的医疗问诊需求。目前response都是由OpenAI `GPT-3.5`引擎回答的。 | |
- mbzuai-nlp/LaMini-LM - instruction](https://huggingface.co/datasets/MBZUAI/LaMini-instruction) | MBZUAI/LaMini-instruction | **2.58M** | EN | MT | SI | | 通过离线蒸馏从大型语言模型中提取知识 | |
- WizardLM
- thu-coai/Safety-Prompts - coai/Safety-Prompts](https://huggingface.co/datasets/thu-coai/Safety-Prompts) | thu-coai | 100k | Chinese | 中文安全prompts，用于评测和提升大模型的安全性，将模型的输出与人类的价值观对齐。 |
- Chatgpt-Comparison-Detection project - SimpleAI/HC3](https://huggingface.co/datasets/Hello-SimpleAI/HC3) | | 24.3K | English | Human ChatGPT Comparison Corpus, 60k human answers and 27K ChatGPT answers for around 24K questions. |
- Chinese-Vicuna
- ColossalChat
- cerebras-lora-alpaca - GPT | 2.7B | [AlpacaDataCleaned](https://github.com/gururise/AlpacaDataCleaned) | 52k | En |
- FLAN-Muffin - CoT/tree/main/FLAN-Muffin) |
- ShareChat - CoT/tree/main/ShareGPT) |
- Guanaco - davinci-003 | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/Guanaco) |
- belle_cn - davinci-003 | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/belle_cn) |
- prosocial dialog - dialog](https://huggingface.co/datasets/allenai/prosocial-dialog) | allenai | 165681 | EN | TS | MIX | dialogue | GPT-3 rewrites questions + humans feedback manually | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/prosocial-dialog) |
- finance_en - alpaca](https://huggingface.co/datasets/allenai/prosocial-dialog) | | 68912 | EN | TS | COL | financial related qa | GPT3.5 | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/) |
- xP3 - CoT/tree/main/xP3) |
- instruct - source Meta datasets | augmentation performed using the advanced NLP tools provided by AllenAI | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/instruct) |
- StackLLaMA - exchange-paired](lvwerra/stack-exchange-paired) | | todo | EN | | HG | | | |
- Zhihu-KOL - KOL](https://huggingface.co/datasets/wangrui6/Zhihu-KOL) | Openassisent | 100 w | | SI | HG | Zhihu data for training Open Assitant | | |
- rlhf-reward-datasets
- Dahoas/full-hh-rlhf
- Dahoas/synthetic-instruct-gptj-pairwise
- Dahoas/rm-static - static](https://huggingface.co/datasets/Dahoas/static-hh) used for training reward models after supervised fine-tuning. |
- guanaco
- MOSS - 002-sft-data](https://huggingface.co/datasets/fnlp/moss-002-sft-data)\| [moss-003-sft-data](https://github.com/OpenLMLab/MOSS/tree/main/SFT_data/conversations/conversation_without_plugins) | fnlp | 1583595 | EN/CN | SI | | | | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/MOSS) |
- alpaca-lora - lab/stanford_alpaca/blob/main/alpaca_data.json)、[alpaca_data_cleaned](https://github.com/tloen/alpaca-lora/blob/main/alpaca_data_cleaned.json) | 52 k | En |
- dolly - lab/stanford_alpaca/blob/main/alpaca_data.json) | 52 k | En |
- HC3-Chinese - SimpleAI/HC3-Chinese](https://huggingface.co/datasets/Hello-SimpleAI/HC3-Chinese) | Hello-SimpleAI\|万得资讯 | 13k | CN | TS | MIX | dialogue evaluation | human or ChatGPT | |
- Chinese-LLaMA-Alpaca - LLaMA-Alpaca/tree/main/data)、[pCLUE](https://github.com/CLUEbenchmark/pCLUE)、[translation2019zh](https://github.com/brightmart/nlp_chinese_corpus#5%E7%BF%BB%E8%AF%91%E8%AF%AD%E6%96%99translation2019zh)、[alpaca_data](https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json)、Self-Instruct | 2M | Zh |
- HC3 - SimpleAI/HC3](https://huggingface.co/datasets/Hello-SimpleAI/HC3) | Hello-SimpleAI \| 万得资讯 | 37175 | EN/CN | TS | MIX | dialogue evaluation | human or ChatGPT | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/HC3) |
- camel - ai/code](https://huggingface.co/datasets/camel-ai/ai_society)\|[camel-ai/biology](https://huggingface.co/datasets/camel-ai/biology) \|[camel-ai/physics](https://huggingface.co/datasets/camel-ai/physics) \|[camel-ai/chemistry](https://huggingface.co/datasets/camel-ai/chemistry) \|[camel-ai/math](https://huggingface.co/datasets/camel-ai/math) | camel-ai | 760620 | EN | MT | SI | Role-Playing conversations in AI Society, Code, Math, Physics, Chemistry, Biolog | gpt-3.5-turbo | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/camel) |
- GPT4Tools - Or1of7TJuWvmrJpPoOx0cLdcWry/view?usp=share_link) | StevenGrove | 71446 | EN | MT | SI | a collection of tool-related instructions | gpt-3.5-turbo | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/gpt4tools) |
- Instruction-Tuning-with-GPT-4/GPT-4-LLM - Tuning-with-GPT-4 | 52k | English | Ranked responses (Note: Data is evaluated by `GPT-4` model NOT human) of Alpaca prompts from three models (GPT-4, GPT-3.5 and OPT-IML) by asking GPT-4 to rate the quality. Author believes "GPT-4 is capable of identifying and fixing its own mistakes, and accurately judging the quality of responses" |
- Luotuo - alpaca-lora/blob/main/data/trans_chinese_alpaca_data.json) | 52k | Zh |
alpaca_chinese_dataset
- https://github.com/hikariming/alpaca_chinese_dataset
- https://github.com/hikariming/alpaca_chinese_dataset
- stanford_alpaca - lab/stanford_alpaca/blob/main/seed_tasks.jsonl)中查到全部任务
Med-ChatGLM/data
- https://github.com/SCIR-HI/Med-ChatGLM
[BigScience/P3](https://huggingface.co/datasets/bigscience/P3)
- Paper/Project Link
- Dataset Link
xMTF - BigScience
- Project Link
- Dataset Link
- Project Link
InstructDial
- Paper/Project Link
- Paper/Project Link
- Dataset Link
[Instruction in the Wild](https://github.com/XueFuzhao/InstructionWild)
- Dataset Link
- Paper/Project Link
- Paper/Project Link
[Stanford Human Preferences Dataset (SHP)](https://huggingface.co/datasets/stanfordnlp/SHP)
- SteamSHP
- DataLinks
[allenai/prosocial-dialog](https://huggingface.co/datasets/allenai/prosocial-dialog)
- ProsocialDialog: A Prosocial Backbone for Conversational Agents
[allenai/natural-instructions](https://github.com/allenai/natural-instructions)
- Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks
[nomic-ai/gpt4all](https://github.com/nomic-ai/gpt4all)
- laion/OIG - questions](https://huggingface.co/datasets/pacovaldez/stackoverflow-questions) 3. subset of [bigscience/bloomz-p3](https://huggingface.co/bigscience/bloomz-p3)
- GPT4All: Training an Assistant-style Chatbot with Large Scale Data Distillation from GPT-3.5-Turbo
[bigscience/xP3](https://huggingface.co/datasets/bigscience/xP3)
- Crosslingual Generalization through Multitask Finetuning
[orhonovich/unnatural-instructions](https://github.com/orhonovich/unnatural-instructions)
- Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor
[Instruction-Tuning-with-GPT-4/GPT-4-LLM](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM)
- Instruction Tuning with GPT-4
[databrickslabs/dolly](https://github.com/databrickslabs/dolly/tree/master/data)
- Free Dolly
[OpenAssistant/oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1)
- OpenAssistant Conversations - Democratizing Large Language Model Alignment
BELLE/data/1.5M
- https://github.com/LianjiaTech/BELLE/tree/main/data/1.5M
- https://github.com/LianjiaTech/BELLE/blob/main/data/1.5M/zh_seed_tasks.json
- https://huggingface.co/datasets
pCLUE
- prompt
- https://github.com/CLUEbenchmark/pCLUE
- prompt
- 文本分类
COIG
- Chinese Open Instruction Generalist: A Preliminary Release
- https://huggingface.co/datasets/BAAI/COIG
[Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf)
- Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
[HuggingFaceH4/stack-exchange-preferences](https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences)
- A General Language Assistant as a Laboratory for Alignment
Natural Instruction / Super-Natural Instruction
- Paper/Project link
- Dataset link
HH-RLHF - Anthropic
- Paper/Project Link
- Dataset Link
- Paper/Project Link
[Unnatural Instruction](https://github.com/orhonovich/unnatural-instructions)
- Paper/Project Link
- Dataset Link
- Paper/Project Link
[Self-Instruct](https://github.com/yizhongw/self-instruct)
- Paper/Project Link
- Dataset Link
[UnifiedSKG - HKU](https://unifiedskg.com/)
- Paper/Project Link
- DataSet Link
- Paper/Project Link
[Google/Flan Collection](https://github.com/google-research/FLAN/tree/main/flan/v2)
- Paper/Project Link
- Dataset Link
- Paper/Project Link
OpenAI WebGPT.
- WebGPT paper
- Dataset Link
OpenAI Summarization.
- Dataset Link
- Paper/Project Link
[PhoebusSi/Alpaca-CoT](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT)
- Github Repo
Open Instruction Generalist (OIG).
- Instruction Generalist dataset - school-math-instructions, the poetry-to-songs, and the plot-screenplay-books-dialogue datasets. This results in a total of around 30k examples.
ChatGPT Distillation Data
- HC3 english dataset - answer examples.

Programming Languages

Python 27 Jupyter Notebook 7 HTML 1 C 1 C++ 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

awesome-instruction-datasets

Statistics

alpaca_chinese_dataset

Med-ChatGLM/data

[BigScience/P3](https://huggingface.co/datasets/bigscience/P3)

xMTF - BigScience

InstructDial

[Instruction in the Wild](https://github.com/XueFuzhao/InstructionWild)

[Stanford Human Preferences Dataset (SHP)](https://huggingface.co/datasets/stanfordnlp/SHP)

[allenai/prosocial-dialog](https://huggingface.co/datasets/allenai/prosocial-dialog)

[allenai/natural-instructions](https://github.com/allenai/natural-instructions)

[nomic-ai/gpt4all](https://github.com/nomic-ai/gpt4all)

[bigscience/xP3](https://huggingface.co/datasets/bigscience/xP3)

[orhonovich/unnatural-instructions](https://github.com/orhonovich/unnatural-instructions)

[Instruction-Tuning-with-GPT-4/GPT-4-LLM](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM)

[databrickslabs/dolly](https://github.com/databrickslabs/dolly/tree/master/data)

[OpenAssistant/oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1)

BELLE/data/1.5M

pCLUE

COIG

[Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf)

[HuggingFaceH4/stack-exchange-preferences](https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences)

Natural Instruction / Super-Natural Instruction

HH-RLHF - Anthropic

[Unnatural Instruction](https://github.com/orhonovich/unnatural-instructions)

[Self-Instruct](https://github.com/yizhongw/self-instruct)

[UnifiedSKG - HKU](https://unifiedskg.com/)

[Google/Flan Collection](https://github.com/google-research/FLAN/tree/main/flan/v2)

OpenAI WebGPT.

OpenAI Summarization.

[PhoebusSi/Alpaca-CoT](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT)

Open Instruction Generalist (OIG).

ChatGPT Distillation Data