awesome-instruction-dataset

A collection of open-source dataset to train instruction-following LLMs (ChatGPT,LLaMA,Alpaca)
https://github.com/yaodongC/awesome-instruction-dataset

Last synced: 15 days ago
JSON representation

Uncategorized
- Uncategorized
[(Vision-CAIR/MiniGPT-4)|5K|EN|MT|MIX](https://minigpt-4.github.io/)
- `BSD 3-Clause`
- MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models
- Interactive ChatCaptioner for image and video
[(PhoebusSi/Alpaca-CoT)|500k|ML|MT|COL](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT)
- Github Repo
[(tatsu-lab/Alpaca)|52K|EN|MT|SI](https://github.com/tatsu-lab/stanford_alpaca)
- alpaca-blog
[(haotian-liu/LLaVA)|150K|EN|MT|MIX](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K)
- Visual Instruction Tuning
[(allenai/natural-instructions)|1.6K|ML|MT|HG](https://github.com/allenai/natural-instructions)
- Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks
[({sunrainyg}/{InstructCV)|EN|MT|MIX}]{https://github.com/AlaaLab/InstructCV}
- InstructCV
- InstructCV
[(JosephusCheung/GuanacoDataset)|534K|ML|MT|SI](https://huggingface.co/datasets/JosephusCheung/GuanacoDataset)
- `GPL-3.0`
[(allenai/prosocial-dialog)|58K|EN|MT|MIX](https://huggingface.co/datasets/allenai/prosocial-dialog)
- ProsocialDialog: A Prosocial Backbone for Conversational Agents
- `CC BY 4.0`
[(bigscience/xP3)|N/A|ML|MT|MIX](https://huggingface.co/datasets/bigscience/xP3)
- Crosslingual Generalization through Multitask Finetuning
[(nomic-ai/gpt4all)|437k|EN|MT|COL](https://github.com/nomic-ai/gpt4all)
- laion/OIG - questions](https://huggingface.co/datasets/pacovaldez/stackoverflow-questions) 3. subset of [bigscience/bloomz-p3](https://huggingface.co/bigscience/bloomz-p3)
- GPT4All: Training an Assistant-style Chatbot with Large Scale Data Distillation from GPT-3.5-Turbo
[(google-research/FLAN)|N/A|EN|MT|MIX](https://github.com/google-research/FLAN/tree/main/flan/v2)
- The Flan Collection: Designing Data and Methods for Effective Instruction Tuning
[(orhonovich/unnatural-instructions)|240K|EN|MT|MIX](https://github.com/orhonovich/unnatural-instructions)
- Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor
[(databrickslabs/dolly)|15K|EN|MT|HG](https://github.com/databrickslabs/dolly/tree/master/data)
- Free Dolly
- `CC BY-SA 3.0`
[(OpenAssistant/oasst1)|161K|ML|MT|HG](https://huggingface.co/datasets/OpenAssistant/oasst1)
- OpenAssistant Conversations - Democratizing Large Language Model Alignment
[(RyokoAI/ShareGPT52K)|90K|ML|MT|SI](https://huggingface.co/datasets/RyokoAI/ShareGPT52K)
- `CC0 1.0 Universal`
[(zjunlp/Mol-Instructions)|2043K|ML|MT|MIX](https://huggingface.co/datasets/zjunlp/Mol-Instructions)
- Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models
- `CC BY 4.0`
- Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models
[(Anthropic/hh-rlhf)|22k|EN|MT|MIX](https://huggingface.co/datasets/Anthropic/hh-rlhf)
- Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
- (Hello-SimpleAI/HC3-Chinese)|13K|CN|MT|MIX
- (Hello-SimpleAI/HC3)|24K|EN|MT|MIX
[(thu-coai/Safety-Prompts)|100k|CN|MT|MIX](https://github.com/thu-coai/Safety-Prompts)
- Safety Assessment of Chinese Large Language Models
- `Apache License 2.0`
[(HuggingFaceH4/stack-exchange-preferences)|10741k|EN|TS|HG](https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences)
- A General Language Assistant as a Laboratory for Alignment
- stack-exchange-paired
- `CC BY-SA 4.0`
[(Reddit/eli5)|500k|EN|MT|HG](https://huggingface.co/datasets/eli5)
- r/explainlikeimfive
- eli5 dataset - exchange-paired](https://huggingface.co/datasets/lvwerra/stack-exchange-paired).
[(Instruction-Tuning-with-GPT-4/GPT-4-LLM)|52K|EN|MT|MIX](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM)
- `CC BY-NC 4.0`
- Instruction Tuning with GPT-4
- (tatsu-lab/Alpaca)|52K|EN|MT|SI
- `CC BY-NC 4.0`
[(Hello-SimpleAI/HC3-Chinese)|13K|CN|MT|MIX](https://huggingface.co/datasets/Hello-SimpleAI/HC3-Chinese)
- How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection
[(Instruction-Tuning-with-GPT-4/GPT-4-LLM)|52K|EN|CN|MT|SI](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM)
- (orhonovich/unnatural-instructions)|240K|EN|MT|MIX

Programming Languages

Python 5 Jupyter Notebook 2 HTML 1 C++ 1

Categories

Sub Categories

Uncategorized 25

Keywords

chatgpt 5 llm 2 alpaca 2 instruction-tuning 2 llama 2 deep-learning 2 prompt 1 instruction 1 chinese-language 1 attack-defense 1 large-language-models 1 chatbot 1 llm-inference 1 ai-chat 1 open-source 1 awesome-lists 1 awesome-list 1 awesome 1 alternative 1 prompt-engineering 1 safety 1 chatglm 1 cot 1 lora 1 moss 1 p-tuning 1 parameter-efficient 1 pytorch 1 tabul 1 tabular-data 1 tabular-model 1 instruction-following 1 language-model 1 gpt-4 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

awesome-instruction-dataset

Uncategorized

Uncategorized

[(Vision-CAIR/MiniGPT-4)|5K|EN|MT|MIX](https://minigpt-4.github.io/)

[(PhoebusSi/Alpaca-CoT)|500k|ML|MT|COL](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT)

[(tatsu-lab/Alpaca)|52K|EN|MT|SI](https://github.com/tatsu-lab/stanford_alpaca)

[(haotian-liu/LLaVA)|150K|EN|MT|MIX](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K)

[(allenai/natural-instructions)|1.6K|ML|MT|HG](https://github.com/allenai/natural-instructions)

[({sunrainyg}/{InstructCV)|EN|MT|MIX}]{https://github.com/AlaaLab/InstructCV}

[(JosephusCheung/GuanacoDataset)|534K|ML|MT|SI](https://huggingface.co/datasets/JosephusCheung/GuanacoDataset)

[(allenai/prosocial-dialog)|58K|EN|MT|MIX](https://huggingface.co/datasets/allenai/prosocial-dialog)

[(bigscience/xP3)|N/A|ML|MT|MIX](https://huggingface.co/datasets/bigscience/xP3)

[(nomic-ai/gpt4all)|437k|EN|MT|COL](https://github.com/nomic-ai/gpt4all)

[(google-research/FLAN)|N/A|EN|MT|MIX](https://github.com/google-research/FLAN/tree/main/flan/v2)

[(orhonovich/unnatural-instructions)|240K|EN|MT|MIX](https://github.com/orhonovich/unnatural-instructions)

[(databrickslabs/dolly)|15K|EN|MT|HG](https://github.com/databrickslabs/dolly/tree/master/data)

[(OpenAssistant/oasst1)|161K|ML|MT|HG](https://huggingface.co/datasets/OpenAssistant/oasst1)

[(RyokoAI/ShareGPT52K)|90K|ML|MT|SI](https://huggingface.co/datasets/RyokoAI/ShareGPT52K)

[(zjunlp/Mol-Instructions)|2043K|ML|MT|MIX](https://huggingface.co/datasets/zjunlp/Mol-Instructions)

[(Anthropic/hh-rlhf)|22k|EN|MT|MIX](https://huggingface.co/datasets/Anthropic/hh-rlhf)

[(thu-coai/Safety-Prompts)|100k|CN|MT|MIX](https://github.com/thu-coai/Safety-Prompts)

[(HuggingFaceH4/stack-exchange-preferences)|10741k|EN|TS|HG](https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences)

[(Reddit/eli5)|500k|EN|MT|HG](https://huggingface.co/datasets/eli5)

[(Instruction-Tuning-with-GPT-4/GPT-4-LLM)|52K|EN|MT|MIX](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM)

[(Hello-SimpleAI/HC3-Chinese)|13K|CN|MT|MIX](https://huggingface.co/datasets/Hello-SimpleAI/HC3-Chinese)

[(Instruction-Tuning-with-GPT-4/GPT-4-LLM)|52K|EN|CN|MT|SI](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM)