
An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects


A collection of awesome-prompt-datasets, awesome-instruction-dataset, to train ChatLLM such as chatgpt 收录各种各样的指令数据集, 用于训练 ChatLLM 模型。

Last synced: 4 days ago
JSON representation

  • Statistics

    • Chain of Thought - research/FLAN/tree/main/flan/v2/cot_data) \|[few_shot_data]( | Google | 74771 | EN/CN | MT | HG | instruct with cot reasoning | annotating CoT on existing data | [download]( |
    • GPT4all - ai/gpt4all-j-prompt-generations]( | nomic-ai | 806199 | EN | MT | COL | code, storys and dialogs | distillation from GPT-3.5-turbo | [download]( |
    • GPTeacher - 4 General-Instruct ](\|[Roleplay-Instruct]( \|[Code-Instruct ](\| [Toolformer]( | teknium1 | 29013 | EN | MT | SI | general, roleplay, toolformer | GPT-4 & toolformer | [download]( |
    • AlpacaDataCleaned - cleaned]( | yahma | 52k | EN | MT | SI | general instruct | text-davinci-003 | [download]( |
    • Luotuo - alpaca-lora/blob/main/data/trans_chinese_alpaca_data.json) | 52k | Zh |
    • Natural Instructions - ,.,-x) | Allen AI | 5040134 | ML | MT | COL | diverse nlp tasks | human annotated datasets collection | [download]( |
    • 华驼(HuaTuo) - HI/Huatuo-Llama-Med-Chinese/blob/main/data/llama_data.json) \|[肝癌]( | SCIR-HI(哈工大) | 8K | CN | TS | SI | 公开和自建的中文医学知识库 | GPT3.5 | |
    • firefly - train-1.1M]( | | 1649398 | CN | MT | COL | 23 nlp tasks | human annotated datasets collection | [download]( |
    • Code Alpaca - davinci-003 | [download]( |
    • dolly - lab/stanford_alpaca/blob/main/alpaca_data.json) | 52 k | En |
    • mosaicml/llm-foundry - 15k]( dataset and a filtered subset of [Anthropic's HH-RLHF]( | human annotated | |
    • baize - baize/baize-chatbot/tree/main/data) \|[medical_chat_data.json]( \| [quora_chat_data.json]( \|[stackoverflow_chat_data.json]( | project-baize | 653699 | EN | MT | COL | a collection from Alpaca, Quora, StackOverFlow and MedQuAD questions | human annotated datasets collection | [download]( |
    • hh-rlhf - rlhf]( | Anthropic | 284517 | EN | TS | MIX | dialogue | dialog between human and RLHF models | [download]( |
    • GAOKAO - in-the-blank_Questions]( \| [Multiple-choice_Questions]( \| [Open-ended_Questions]( | OpenLMLab | 2785 | CN | MT | COL | Multiple-choice, Fill-in-the-blank and Open-ended questions from examination | human annotated | [download]( |
    • camel - ai/code](\|[camel-ai/biology]( \|[camel-ai/physics]( \|[camel-ai/chemistry]( \|[camel-ai/math]( | camel-ai | 760620 | EN | MT | SI | Role-Playing conversations in AI Society, Code, Math, Physics, Chemistry, Biolog | gpt-3.5-turbo | [download]( |
    • GPT4Tools - Or1of7TJuWvmrJpPoOx0cLdcWry/view?usp=share_link) | StevenGrove | 71446 | EN | MT | SI | a collection of tool-related instructions | gpt-3.5-turbo | [download]( |
    • Auto CoT - takeshi188/zero_shot_cot/dataset]( \|[kojima-takeshi188/zero_shot_cot/log]( | amazon-science | | EN | | | | | [download]( |
    • MOSS - 002-sft-data](\| [moss-003-sft-data]( | fnlp | 1583595 | EN/CN | SI | | | | [download]( |
    • ultrachat - CoT/tree/main/ultrachat) |
    • LAION-AI/Open-Assistant - generated, human-annotated | |
    • akoksal/LongForm
    • sail-sg/symbolic-instruction-tuning - instruction-tuning]( | sail-sg | 800K | ML | SI | | | Human Synthetic Examples | |
    • michael-wzhu/PromptCBLUE - wzhu | 110113 | CN | SI | | | 互联网上的医疗问诊问题(110,113),反映了真实世界的不同用户/患者的医疗问诊需求。目前response都是由OpenAI `GPT-3.5`引擎回答的。 | |
    • mbzuai-nlp/LaMini-LM - instruction]( | MBZUAI/LaMini-instruction | **2.58M** | EN | MT | SI | | 通过离线蒸馏从大型语言模型中提取知识 | |
    • WizardLM
    • thu-coai/Safety-Prompts - coai/Safety-Prompts]( | thu-coai | 100k | Chinese | 中文安全prompts,用于评测和提升大模型的安全性,将模型的输出与人类的价值观对齐。 |
    • Chatgpt-Comparison-Detection project - SimpleAI/HC3]( | | 24.3K | English | Human ChatGPT Comparison Corpus, 60k human answers and 27K ChatGPT answers for around 24K questions. |
    • Chinese-Vicuna
    • ColossalChat
    • cerebras-lora-alpaca - GPT | 2.7B | [AlpacaDataCleaned]( | 52k | En |
    • FLAN-Muffin - CoT/tree/main/FLAN-Muffin) |
    • ShareChat - CoT/tree/main/ShareGPT) |
    • Guanaco - davinci-003 | [download]( |
    • belle_cn - davinci-003 | [download]( |
    • prosocial dialog - dialog]( | allenai | 165681 | EN | TS | MIX | dialogue | GPT-3 rewrites questions + humans feedback manually | [download]( |
    • finance_en - alpaca]( | | 68912 | EN | TS | COL | financial related qa | GPT3.5 | [download]( |
    • xP3 - CoT/tree/main/xP3) |
    • instruct - source Meta datasets | augmentation performed using the advanced NLP tools provided by AllenAI | [download]( |
    • StackLLaMA - exchange-paired](lvwerra/stack-exchange-paired) | | todo | EN | | HG | | | |
    • Zhihu-KOL - KOL]( | Openassisent | 100 w | | SI | HG | Zhihu data for training Open Assitant | | |
    • rlhf-reward-datasets
    • Dahoas/full-hh-rlhf
    • Dahoas/synthetic-instruct-gptj-pairwise
    • Dahoas/rm-static - static]( used for training reward models after supervised fine-tuning. |
    • guanaco
    • camel - ai/code](\|[camel-ai/biology]( \|[camel-ai/physics]( \|[camel-ai/chemistry]( \|[camel-ai/math]( | camel-ai | 760620 | EN | MT | SI | Role-Playing conversations in AI Society, Code, Math, Physics, Chemistry, Biolog | gpt-3.5-turbo | [download]( |
    • GPT4Tools - Or1of7TJuWvmrJpPoOx0cLdcWry/view?usp=share_link) | StevenGrove | 71446 | EN | MT | SI | a collection of tool-related instructions | gpt-3.5-turbo | [download]( |
    • MOSS - 002-sft-data](\| [moss-003-sft-data]( | fnlp | 1583595 | EN/CN | SI | | | | [download]( |
    • Instruction-Tuning-with-GPT-4/GPT-4-LLM - Tuning-with-GPT-4 | 52k | English | Ranked responses (Note: Data is evaluated by `GPT-4` model NOT human) of Alpaca prompts from three models (GPT-4, GPT-3.5 and OPT-IML) by asking GPT-4 to rate the quality. Author believes "GPT-4 is capable of identifying and fixing its own mistakes, and accurately judging the quality of responses" |
    • Chinese-LLaMA-Alpaca - LLaMA-Alpaca/tree/main/data)、[pCLUE](、[translation2019zh](、[alpaca_data](、Self-Instruct | 2M | Zh |
    • alpaca-lora - lab/stanford_alpaca/blob/main/alpaca_data.json)、[alpaca_data_cleaned]( | 52 k | En |
    • HC3-Chinese - SimpleAI/HC3-Chinese]( | Hello-SimpleAI\|万得资讯 | 13k | CN | TS | MIX | dialogue evaluation | human or ChatGPT | |
    • HC3 - SimpleAI/HC3]( | Hello-SimpleAI \| 万得资讯 | 37175 | EN/CN | TS | MIX | dialogue evaluation | human or ChatGPT | [download]( |
  • [Instruction in the Wild](

  • [PhoebusSi/Alpaca-CoT](

  • alpaca_chinese_dataset

  • Med-ChatGLM/data

  • [BigScience/P3](

  • xMTF - BigScience

  • HH-RLHF - Anthropic

  • InstructDial

  • [Stanford Human Preferences Dataset (SHP)](

  • [allenai/prosocial-dialog](

  • [allenai/natural-instructions](

  • [nomic-ai/gpt4all](

  • [bigscience/xP3](

  • [orhonovich/unnatural-instructions](

  • [Instruction-Tuning-with-GPT-4/GPT-4-LLM](

  • [databrickslabs/dolly](

  • [OpenAssistant/oasst1](

  • BELLE/data/1.5M

  • pCLUE

  • COIG

  • [Anthropic/hh-rlhf](

  • [HuggingFaceH4/stack-exchange-preferences](

  • Natural Instruction / Super-Natural Instruction

  • [Unnatural Instruction](

  • [Self-Instruct](

  • [UnifiedSKG - HKU](

  • [Google/Flan Collection](

  • OpenAI WebGPT.

  • OpenAI Summarization.

  • Open Instruction Generalist (OIG).

    • Instruction Generalist dataset - school-math-instructions, the poetry-to-songs, and the plot-screenplay-books-dialogue datasets. This results in a total of around 30k examples.
  • ChatGPT Distillation Data

Statistics 53 [UnifiedSKG - HKU]( 3 [Unnatural Instruction]( 3 pCLUE 3 [Google/Flan Collection]( 3 BELLE/data/1.5M 3 InstructDial 3 alpaca_chinese_dataset 3 HH-RLHF - Anthropic 3 xMTF - BigScience 3 [Instruction in the Wild]( 3 [nomic-ai/gpt4all]( 2 OpenAI Summarization. 2 [Stanford Human Preferences Dataset (SHP)]( 2 Natural Instruction / Super-Natural Instruction 2 COIG 2 [Self-Instruct]( 2 [BigScience/P3]( 2 OpenAI WebGPT. 2 [allenai/prosocial-dialog]( 1 [Instruction-Tuning-with-GPT-4/GPT-4-LLM]( 1 [Anthropic/hh-rlhf]( 1 [allenai/natural-instructions]( 1 [OpenAssistant/oasst1]( 1 ChatGPT Distillation Data 1 [HuggingFaceH4/stack-exchange-preferences]( 1 Open Instruction Generalist (OIG). 1 [PhoebusSi/Alpaca-CoT]( 1 Med-ChatGLM/data 1 [bigscience/xP3]( 1 [databrickslabs/dolly]( 1 [orhonovich/unnatural-instructions]( 1
Sub Categories