https://github.com/upbit/awesome-datasets-for-llms
Awesome training/finetuning datasets for LLMs
https://github.com/upbit/awesome-datasets-for-llms
List: awesome-datasets-for-llms
Last synced: 5 months ago
JSON representation
Awesome training/finetuning datasets for LLMs
- Host: GitHub
- URL: https://github.com/upbit/awesome-datasets-for-llms
- Owner: upbit
- License: unlicense
- Created: 2023-04-05T13:20:08.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2023-04-12T11:10:20.000Z (about 3 years ago)
- Last Synced: 2025-03-15T08:01:59.875Z (about 1 year ago)
- Size: 8.79 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Awesome datasets for LLMs
Awesome training/finetuning datasets for LLMs
## fine-tuning
* [`alpaca_data.json`](https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json): 52K instruction data generated by `text-davinci-003` (from [tatsu-lab/stanford_alpaca](https://github.com/tatsu-lab/stanford_alpaca))
* [Anthropic HH RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf): Human preference data about helpfulness and harmlessness / Human-generated red teaming data
* [Stanford Human Preferences Dataset (SHP)](https://huggingface.co/datasets/stanfordnlp/SHP): 385K collective human preferences over responses to questions/instructions in 18 different subject areas, from cooking to legal advice
* BELLE/[1.0M](https://huggingface.co/datasets/BelleGroup/train_1M_CN), [0.5M](https://huggingface.co/datasets/BelleGroup/train_0.5M_CN): 由ChatGPT生成的1.5M中文数据集,包含数个不同指令类型、不同领域的子集 (from [LianjiaTech/BELLE](https://github.com/LianjiaTech/BELLE))
* BELLE/[`School Math`](https://huggingface.co/datasets/BelleGroup/school_math_0.25M), [`Multiturn Chat`](https://huggingface.co/datasets/BelleGroup/multiturn_chat_0.8M): 由BELLE项目开源的10M数据集,目前已开放数学习题和多轮对话 (from [LianjiaTech/BELLE](https://github.com/LianjiaTech/BELLE))
* [`Alpaca-CoT datasets`](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT) Finetuning bases on Stanford Alpaca project. The datasets include CoT, dialog, FastChat, instinwild (from [PhoebusSi/Alpaca-CoT](https://github.com/PhoebusSi/Alpaca-CoT/blob/main/CN_README.md#%E6%95%B0%E6%8D%AE%E7%BB%9F%E8%AE%A1))
* [ShareGPT Vicuna unfiltered](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/tree/main/HTML_cleaned_raw_dataset): 48k ShareGPT conversations used by Vicuna (from [lm-sys/FastChat, issue #90](https://github.com/lm-sys/FastChat/issues/90#issuecomment-1493250773))
* [`instinwild_en.json`](https://drive.google.com/file/d/1qOfrl0RIWgH2_b1rYCEVxjHF3u3Dwqay/view?usp=sharing), [`instinwild_ch.json`](https://drive.google.com/file/d/1OqfOUWYfrK6riE9erOx-Izp3nItfqz_K/view?usp=sharing): A larger set of instructions generated by 479 seed instructions (from [XueFuzhao/InstructionWild](https://github.com/XueFuzhao/InstructionWild))
* [lvwerra/stack-exchange-paired](https://huggingface.co/datasets/lvwerra/stack-exchange-paired): a processed version of the HuggingFaceH4/stack-exchange-preferences (from [trl-lib/llama-7b-se-rl-peft](https://huggingface.co/trl-lib/llama-7b-se-rl-peft))
## spec
* [HealthCareMagic-200k](https://drive.google.com/file/d/1lyfqIwlLSClhgrCutWuEe_IACNq6XNUt/view?usp=sharing): 200k real conversations between patients and doctors from HealthCareMagic.com (from [Kent0n-Li/ChatDoctor](https://github.com/Kent0n-Li/ChatDoctor))
* [icliniq-15k](https://drive.google.com/file/d/1ZKbqgYqWc7DJHs3N9TQYQVPdDQmZaClA/view?usp=sharing): 15k real conversations between patients and doctors from icliniq.com (from [Kent0n-Li/ChatDoctor](https://github.com/Kent0n-Li/ChatDoctor))
* [`ADGEN dataset`](https://drive.google.com/file/d/13_vf0xRTQsyneRKdD1bZIr93vBGOczrk/view)/[`清华备份`](https://cloud.tsinghua.edu.cn/f/b3f119a008264b1cabd1/?dl=1): 根据输入(content)生成一段广告词(summary) (from [THUDM/ChatGLM-6B](https://github.com/THUDM/ChatGLM-6B/tree/main/ptuning#%E4%B8%8B%E8%BD%BD%E6%95%B0%E6%8D%AE%E9%9B%86))