Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
awesome-instruction-datasets
A collection of awesome-prompt-datasets, awesome-instruction-dataset, to train ChatLLM such as chatgpt 收录各种各样的指令数据集, 用于训练 ChatLLM 模型。
https://github.com/jianzhnie/awesome-instruction-datasets
Last synced: 4 days ago
JSON representation
-
Statistics
- Chain of Thought - research/FLAN/tree/main/flan/v2/cot_data) \|[few_shot_data](https://github.com/google-research/FLAN/tree/main/flan/v2/niv2_few_shot_data) | Google | 74771 | EN/CN | MT | HG | instruct with cot reasoning | annotating CoT on existing data | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/Chain-of-Thought) |
- GPT4all - ai/gpt4all-j-prompt-generations](https://huggingface.co/datasets/nomic-ai/gpt4all-j-prompt-generations) | nomic-ai | 806199 | EN | MT | COL | code, storys and dialogs | distillation from GPT-3.5-turbo | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/GPT4all) |
- GPTeacher - 4 General-Instruct ](https://github.com/teknium1/GPTeacher/tree/main/Instruct)\|[Roleplay-Instruct](https://github.com/teknium1/GPTeacher/tree/main/Roleplay) \|[Code-Instruct ](https://github.com/teknium1/GPTeacher/tree/main/Codegen)\| [Toolformer](https://github.com/teknium1/GPTeacher/tree/main/Toolformer) | teknium1 | 29013 | EN | MT | SI | general, roleplay, toolformer | GPT-4 & toolformer | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/GPTeacher) |
- AlpacaDataCleaned - cleaned](https://huggingface.co/datasets/yahma/alpaca-cleaned) | yahma | 52k | EN | MT | SI | general instruct | text-davinci-003 | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/alpaca) |
- Natural Instructions - ,.,-x) | Allen AI | 5040134 | ML | MT | COL | diverse nlp tasks | human annotated datasets collection | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/Natural-Instructions) |
- 华驼(HuaTuo) - HI/Huatuo-Llama-Med-Chinese/blob/main/data/llama_data.json) \|[肝癌](https://github.com/SCIR-HI/Huatuo-Llama-Med-Chinese/blob/main/data-literature/liver_cancer.json) | SCIR-HI(哈工大) | 8K | CN | TS | SI | 公开和自建的中文医学知识库 | GPT3.5 | |
- firefly - train-1.1M](https://huggingface.co/datasets/YeungNLP/firefly-train-1.1M) | | 1649398 | CN | MT | COL | 23 nlp tasks | human annotated datasets collection | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/firefly) |
- Code Alpaca - davinci-003 | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/CodeAlpaca) |
- mosaicml/llm-foundry - 15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) dataset and a filtered subset of [Anthropic's HH-RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf). | human annotated | |
- baize - baize/baize-chatbot/tree/main/data) \|[medical_chat_data.json](https://github.com/project-baize/baize-chatbot/blob/main/data/medical_chat_data.json) \| [quora_chat_data.json](https://github.com/project-baize/baize-chatbot/blob/main/data/quora_chat_data.json) \|[stackoverflow_chat_data.json](https://github.com/project-baize/baize-chatbot/blob/main/data/stackoverflow_chat_data.json) | project-baize | 653699 | EN | MT | COL | a collection from Alpaca, Quora, StackOverFlow and MedQuAD questions | human annotated datasets collection | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/baize) |
- hh-rlhf - rlhf](https://huggingface.co/datasets/anthropic/hh-rlhf) | Anthropic | 284517 | EN | TS | MIX | dialogue | dialog between human and RLHF models | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/hh-rlhf) |
- GAOKAO - in-the-blank_Questions](https://github.com/OpenLMLab/GAOKAO-Bench/tree/main/data/Fill-in-the-blank_Questions) \| [Multiple-choice_Questions](https://github.com/OpenLMLab/GAOKAO-Bench/tree/main/data/Multiple-choice_Questions) \| [Open-ended_Questions](https://github.com/OpenLMLab/GAOKAO-Bench/tree/main/data/Open-ended_Questions) | OpenLMLab | 2785 | CN | MT | COL | Multiple-choice, Fill-in-the-blank and Open-ended questions from examination | human annotated | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/GAOKAO) |
- camel - ai/code](https://huggingface.co/datasets/camel-ai/ai_society)\|[camel-ai/biology](https://huggingface.co/datasets/camel-ai/biology) \|[camel-ai/physics](https://huggingface.co/datasets/camel-ai/physics) \|[camel-ai/chemistry](https://huggingface.co/datasets/camel-ai/chemistry) \|[camel-ai/math](https://huggingface.co/datasets/camel-ai/math) | camel-ai | 760620 | EN | MT | SI | Role-Playing conversations in AI Society, Code, Math, Physics, Chemistry, Biolog | gpt-3.5-turbo | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/camel) |
- GPT4Tools - Or1of7TJuWvmrJpPoOx0cLdcWry/view?usp=share_link) | StevenGrove | 71446 | EN | MT | SI | a collection of tool-related instructions | gpt-3.5-turbo | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/gpt4tools) |
- Auto CoT - takeshi188/zero_shot_cot/dataset](https://github.com/kojima-takeshi188/zero_shot_cot/tree/main/dataset) \|[kojima-takeshi188/zero_shot_cot/log](https://github.com/kojima-takeshi188/zero_shot_cot/tree/main/log) | amazon-science | | EN | | | | | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/Auto-CoT) |
- MOSS - 002-sft-data](https://huggingface.co/datasets/fnlp/moss-002-sft-data)\| [moss-003-sft-data](https://github.com/OpenLMLab/MOSS/tree/main/SFT_data/conversations/conversation_without_plugins) | fnlp | 1583595 | EN/CN | SI | | | | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/MOSS) |
- ultrachat - CoT/tree/main/ultrachat) |
- LAION-AI/Open-Assistant - generated, human-annotated | |
- akoksal/LongForm
- sail-sg/symbolic-instruction-tuning - instruction-tuning](https://huggingface.co/datasets/sail/symbolic-instruction-tuning) | sail-sg | 800K | ML | SI | | | Human Synthetic Examples | |
- michael-wzhu/PromptCBLUE - wzhu | 110113 | CN | SI | | | 互联网上的医疗问诊问题(110,113),反映了真实世界的不同用户/患者的医疗问诊需求。目前response都是由OpenAI `GPT-3.5`引擎回答的。 | |
- mbzuai-nlp/LaMini-LM - instruction](https://huggingface.co/datasets/MBZUAI/LaMini-instruction) | MBZUAI/LaMini-instruction | **2.58M** | EN | MT | SI | | 通过离线蒸馏从大型语言模型中提取知识 | |
- WizardLM
- thu-coai/Safety-Prompts - coai/Safety-Prompts](https://huggingface.co/datasets/thu-coai/Safety-Prompts) | thu-coai | 100k | Chinese | 中文安全prompts,用于评测和提升大模型的安全性,将模型的输出与人类的价值观对齐。 |
- Chatgpt-Comparison-Detection project - SimpleAI/HC3](https://huggingface.co/datasets/Hello-SimpleAI/HC3) | | 24.3K | English | Human ChatGPT Comparison Corpus, 60k human answers and 27K ChatGPT answers for around 24K questions. |
- Chinese-Vicuna
- ColossalChat
- cerebras-lora-alpaca - GPT | 2.7B | [AlpacaDataCleaned](https://github.com/gururise/AlpacaDataCleaned) | 52k | En |
- Luotuo - alpaca-lora/blob/main/data/trans_chinese_alpaca_data.json) | 52k | Zh |
- FLAN-Muffin - CoT/tree/main/FLAN-Muffin) |
- ShareChat - CoT/tree/main/ShareGPT) |
- Guanaco - davinci-003 | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/Guanaco) |
- belle_cn - davinci-003 | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/belle_cn) |
- prosocial dialog - dialog](https://huggingface.co/datasets/allenai/prosocial-dialog) | allenai | 165681 | EN | TS | MIX | dialogue | GPT-3 rewrites questions + humans feedback manually | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/prosocial-dialog) |
- finance_en - alpaca](https://huggingface.co/datasets/allenai/prosocial-dialog) | | 68912 | EN | TS | COL | financial related qa | GPT3.5 | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/) |
- xP3 - CoT/tree/main/xP3) |
- instruct - source Meta datasets | augmentation performed using the advanced NLP tools provided by AllenAI | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/instruct) |
- StackLLaMA - exchange-paired](lvwerra/stack-exchange-paired) | | todo | EN | | HG | | | |
- Zhihu-KOL - KOL](https://huggingface.co/datasets/wangrui6/Zhihu-KOL) | Openassisent | 100 w | | SI | HG | Zhihu data for training Open Assitant | | |
- rlhf-reward-datasets
- Dahoas/full-hh-rlhf
- Dahoas/synthetic-instruct-gptj-pairwise
- Dahoas/rm-static - static](https://huggingface.co/datasets/Dahoas/static-hh) used for training reward models after supervised fine-tuning. |
- guanaco
- camel - ai/code](https://huggingface.co/datasets/camel-ai/ai_society)\|[camel-ai/biology](https://huggingface.co/datasets/camel-ai/biology) \|[camel-ai/physics](https://huggingface.co/datasets/camel-ai/physics) \|[camel-ai/chemistry](https://huggingface.co/datasets/camel-ai/chemistry) \|[camel-ai/math](https://huggingface.co/datasets/camel-ai/math) | camel-ai | 760620 | EN | MT | SI | Role-Playing conversations in AI Society, Code, Math, Physics, Chemistry, Biolog | gpt-3.5-turbo | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/camel) |
- GPT4Tools - Or1of7TJuWvmrJpPoOx0cLdcWry/view?usp=share_link) | StevenGrove | 71446 | EN | MT | SI | a collection of tool-related instructions | gpt-3.5-turbo | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/gpt4tools) |
- MOSS - 002-sft-data](https://huggingface.co/datasets/fnlp/moss-002-sft-data)\| [moss-003-sft-data](https://github.com/OpenLMLab/MOSS/tree/main/SFT_data/conversations/conversation_without_plugins) | fnlp | 1583595 | EN/CN | SI | | | | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/MOSS) |
- Instruction-Tuning-with-GPT-4/GPT-4-LLM - Tuning-with-GPT-4 | 52k | English | Ranked responses (Note: Data is evaluated by `GPT-4` model NOT human) of Alpaca prompts from three models (GPT-4, GPT-3.5 and OPT-IML) by asking GPT-4 to rate the quality. Author believes "GPT-4 is capable of identifying and fixing its own mistakes, and accurately judging the quality of responses" |
- Chinese-LLaMA-Alpaca - LLaMA-Alpaca/tree/main/data)、[pCLUE](https://github.com/CLUEbenchmark/pCLUE)、[translation2019zh](https://github.com/brightmart/nlp_chinese_corpus#5%E7%BF%BB%E8%AF%91%E8%AF%AD%E6%96%99translation2019zh)、[alpaca_data](https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json)、Self-Instruct | 2M | Zh |
- alpaca-lora - lab/stanford_alpaca/blob/main/alpaca_data.json)、[alpaca_data_cleaned](https://github.com/tloen/alpaca-lora/blob/main/alpaca_data_cleaned.json) | 52 k | En |
- dolly - lab/stanford_alpaca/blob/main/alpaca_data.json) | 52 k | En |
- HC3-Chinese - SimpleAI/HC3-Chinese](https://huggingface.co/datasets/Hello-SimpleAI/HC3-Chinese) | Hello-SimpleAI\|万得资讯 | 13k | CN | TS | MIX | dialogue evaluation | human or ChatGPT | |
- HC3 - SimpleAI/HC3](https://huggingface.co/datasets/Hello-SimpleAI/HC3) | Hello-SimpleAI \| 万得资讯 | 37175 | EN/CN | TS | MIX | dialogue evaluation | human or ChatGPT | [download](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/HC3) |
-
[PhoebusSi/Alpaca-CoT](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT)
-
alpaca_chinese_dataset
- https://github.com/hikariming/alpaca_chinese_dataset
- https://github.com/hikariming/alpaca_chinese_dataset
- stanford_alpaca - lab/stanford_alpaca/blob/main/seed_tasks.jsonl)中查到全部任务
-
Med-ChatGLM/data
-
[BigScience/P3](https://huggingface.co/datasets/bigscience/P3)
-
xMTF - BigScience
-
InstructDial
-
[Instruction in the Wild](https://github.com/XueFuzhao/InstructionWild)
-
[Stanford Human Preferences Dataset (SHP)](https://huggingface.co/datasets/stanfordnlp/SHP)
-
[allenai/prosocial-dialog](https://huggingface.co/datasets/allenai/prosocial-dialog)
-
[allenai/natural-instructions](https://github.com/allenai/natural-instructions)
-
[nomic-ai/gpt4all](https://github.com/nomic-ai/gpt4all)
- laion/OIG - questions](https://huggingface.co/datasets/pacovaldez/stackoverflow-questions) 3. subset of [bigscience/bloomz-p3](https://huggingface.co/bigscience/bloomz-p3)
- GPT4All: Training an Assistant-style Chatbot with Large Scale Data Distillation from GPT-3.5-Turbo
-
[bigscience/xP3](https://huggingface.co/datasets/bigscience/xP3)
-
[orhonovich/unnatural-instructions](https://github.com/orhonovich/unnatural-instructions)
-
[Instruction-Tuning-with-GPT-4/GPT-4-LLM](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM)
-
[databrickslabs/dolly](https://github.com/databrickslabs/dolly/tree/master/data)
-
[OpenAssistant/oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1)
-
BELLE/data/1.5M
-
pCLUE
-
COIG
-
[Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf)
-
[HuggingFaceH4/stack-exchange-preferences](https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences)
-
Natural Instruction / Super-Natural Instruction
-
HH-RLHF - Anthropic
-
[Unnatural Instruction](https://github.com/orhonovich/unnatural-instructions)
-
[Self-Instruct](https://github.com/yizhongw/self-instruct)
-
[UnifiedSKG - HKU](https://unifiedskg.com/)
-
[Google/Flan Collection](https://github.com/google-research/FLAN/tree/main/flan/v2)
-
OpenAI WebGPT.
-
OpenAI Summarization.
-
Open Instruction Generalist (OIG).
- Instruction Generalist dataset - school-math-instructions, the poetry-to-songs, and the plot-screenplay-books-dialogue datasets. This results in a total of around 30k examples.
-
ChatGPT Distillation Data
- HC3 english dataset - answer examples.
Programming Languages
Categories
Statistics
53
[UnifiedSKG - HKU](https://unifiedskg.com/)
3
[Unnatural Instruction](https://github.com/orhonovich/unnatural-instructions)
3
pCLUE
3
[Google/Flan Collection](https://github.com/google-research/FLAN/tree/main/flan/v2)
3
BELLE/data/1.5M
3
InstructDial
3
alpaca_chinese_dataset
3
HH-RLHF - Anthropic
3
xMTF - BigScience
3
[Instruction in the Wild](https://github.com/XueFuzhao/InstructionWild)
3
[nomic-ai/gpt4all](https://github.com/nomic-ai/gpt4all)
2
OpenAI Summarization.
2
[Stanford Human Preferences Dataset (SHP)](https://huggingface.co/datasets/stanfordnlp/SHP)
2
Natural Instruction / Super-Natural Instruction
2
COIG
2
[Self-Instruct](https://github.com/yizhongw/self-instruct)
2
[BigScience/P3](https://huggingface.co/datasets/bigscience/P3)
2
OpenAI WebGPT.
2
[allenai/prosocial-dialog](https://huggingface.co/datasets/allenai/prosocial-dialog)
1
[Instruction-Tuning-with-GPT-4/GPT-4-LLM](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM)
1
[Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf)
1
[allenai/natural-instructions](https://github.com/allenai/natural-instructions)
1
[OpenAssistant/oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1)
1
ChatGPT Distillation Data
1
[HuggingFaceH4/stack-exchange-preferences](https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences)
1
Open Instruction Generalist (OIG).
1
[PhoebusSi/Alpaca-CoT](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT)
1
Med-ChatGLM/data
1
[bigscience/xP3](https://huggingface.co/datasets/bigscience/xP3)
1
[databrickslabs/dolly](https://github.com/databrickslabs/dolly/tree/master/data)
1
[orhonovich/unnatural-instructions](https://github.com/orhonovich/unnatural-instructions)
1
Sub Categories
Keywords
deep-learning
8
large-language-models
7
chatgpt
7
alpaca
6
llama
6
llm
6
nlp
6
instruction-tuning
5
natural-language-processing
4
chinese
4
ai
3
machine-learning
3
language-model
3
chatbot
3
chatglm
3
lora
3
pytorch
2
multi-agent-systems
2
agent
2
gpt-3
2
ai-societies
2
artificial-intelligence
2
dataset
2
prompt-engineering
2
communicative-ai
2
bloom
2
cooperative-ai
2
text-generation
2
gpt
2
zero-shot-learning
2
python
2
medqa
2
medical
2
chain-of-thought
1
gpt3-prompts
1
gpt3-resources
1
reasoning
1
dialogue-systems
1
assistant
1
llm-inference
1
aidoctor
1
huozi
1
medgpt
1
aquila
1
baichuan
1
gemma
1
internlm
1
llama2
1
llama3
1
minicpm
1