Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/chaoswork/sft_datasets
开源SFT数据集整理,随时补充
https://github.com/chaoswork/sft_datasets
chinese-dataset datasets large-language-models llms supervised-finetuning
Last synced: 3 months ago
JSON representation
开源SFT数据集整理,随时补充
- Host: GitHub
- URL: https://github.com/chaoswork/sft_datasets
- Owner: chaoswork
- Created: 2023-06-02T02:13:57.000Z (over 1 year ago)
- Default Branch: master
- Last Pushed: 2023-06-02T02:15:45.000Z (over 1 year ago)
- Last Synced: 2024-08-03T01:25:58.465Z (6 months ago)
- Topics: chinese-dataset, datasets, large-language-models, llms, supervised-finetuning
- Homepage:
- Size: 3.91 KB
- Stars: 398
- Watchers: 1
- Forks: 31
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- awesome-llm-and-aigc - chaoswork/sft_datasets
- awesome-llm-and-aigc - chaoswork/sft_datasets
README
# 开源SFT数据集整理
| 数据集 | 数目 | Lang | Task | Gen | 类型 | 来源 | 链接 |
|-------------------------------------------------------------------------------------- |-------- |----- |----- |--- |----------------------------------------- |----------------------- |------------------------------------------------------------------------------------------ |
| [belle\_cn](https://huggingface.co/BelleGroup) | 1079517 | CN | TS/MT | SI | 通用指令,数学推理,对话 | text-davunci-003 | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/belle_cn) |
| [firefly](https://github.com/yangjianxin1/Firefly) | 1649398 | CN | MT | COL | 23种nlp任务 | 收集中文数据集,人工书写指令模板 | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/firefly) |
| [GAOKAO](https://github.com/OpenLMLab/GAOKAO-Bench) | 2785 | CN | MT | COL | 高考中的多选,填空等问题 | 人工标注的数据集的收集 | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/GAOKAO) |
| [COIG](https://huggingface.co/datasets/BAAI/COIG) | 298428 | CN | MT | COL | 考试,翻译,价值观指令数据集搜集,基于知识图谱的反事实对话 | 自动化工具+人工验证 | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/COIG) |
| [pCLUE](https://github.com/CLUEbenchmark/pCLUE) | 1200705 | CN | MT | | 73个Prompt,分类,推理,关键词识别,阅读理解等9个NLP任务 | | [下载](https://github.com/CLUEbenchmark/pCLUE/tree/main/datasets) |
| [CSL](https://github.com/ydli-ai/CSL) | 396209 | CN | MT | | 40万中文论文元数据,26个Prompt | | [下载](https://drive.google.com/file/d/1xEDgtqHU4qm0Sp-dKjc5KerAmWydmh3-/view?usp=sharing) |
| [CNewSum](https://dqwang122.github.io/projects/CNewSum/) | 304307 | CN | TS | | 字节与UCSB发布的中文摘要数据集 | | [下载](https://drive.google.com/u/0/uc?id=1A_YcQ3cBAI7u9iVIoCeVLLgwU7UUzHHv&export=download) |
| [Coco-cn](https://github.com/li-xirong/coco-cn) | | CN | TS | | 图文多模态 | | [下载](https://github.com/li-xirong/coco-cn) |
| [news\_commentary](https://huggingface.co/datasets/news_commentary/viewer/en-zh/train) | 69200 | EN/CN | TS | | 中英文翻译数据 | | [下载](https://huggingface.co/datasets/news_commentary/viewer/en-zh/train) |
| [Chain of Thought](https://github.com/google-research/FLAN) | 74771 | EN/CN | MT | HG | CoT相关任务 | 人在现有数据集上标注CoT | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/Chain-of-Thought) |
| [HC3](https://huggingface.co/datasets/Hello-SimpleAI/HC3) | 37175 | EN/CN | TS | MIX | 对话评估 | gpt-3.5 或 人工 | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/HC3) |
| [instinwild](https://github.com/XueFuzhao/InstructionWild) | 52191 | EN/CN | MT | SI | 生成,开放域问答,头脑风暴 | text-davunci-003 | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/instinwild) |
| [Alpaca\_GPT4](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM) | 52002 | EN/CN | MT | SI | 通用指令 | GPT-4 生成的Alpaca数据 | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/alpacaGPT4) |
| [MOSS](https://github.com/OpenLMLab/MOSS) | 1583595 | EN/CN | SI | | | | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/MOSS) |
| [LLMZoo](https://github.com/FreedomIntelligence/LLMZoo) | | ML | | | | | [下载](https://huggingface.co/datasets/FreedomIntelligence/phoenix-sft-data-v1/tree/main) |
| [Guanaco](https://huggingface.co/datasets/JosephusCheung/GuanacoDataset) | 534610 | ML | MT | SI | 多种nlp任务 | text-davinci-003 | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/Guanaco) |
| [Natural Instructions](https://github.com/allenai/natural-instructions) | 5040134 | ML | MT | COL | 多种nlp任务 | 人工标注的数据集的收集 | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/Natural-Instructions) |
| [xP3](https://huggingface.co/datasets/bigscience/xP3) | 78883588 | ML | MT | COL | 多种nlp任务 | 人工标注的数据集的收集 | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/xP3) |
| [alpaca](https://github.com/tatsu-lab/stanford_alpaca) | 52002 | EN | MT | SI | 通用指令 | text-davinci-003 | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/alpaca) |
| [GPT4all](https://github.com/nomic-ai/gpt4all) | 806199 | EN | MT | COL | 代码,故事,对话 | GPT-3.5-turbo 蒸馏 | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/GPT4all) |
| [GPTeacher](https://github.com/teknium1/GPTeacher) | 29013 | EN | MT | SI | 通用,角色扮演,工具指令 | GPT-4 & toolformer | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/GPTeacher) |
| [prosocial dialog](https://huggingface.co/datasets/allenai/prosocial-dialog) | 165681 | EN | TS | MIX | 对话 | GPT-3改写问题,人工回复 | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/prosocial-dialog) |
| [finance\_en](https://huggingface.co/datasets/gbharti/finance-alpaca) | 68912 | EN | TS | COL | 金融领域问答 | GPT3.5 | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/) |
| [instruct](https://huggingface.co/datasets/swype/instruct) | 888969 | EN | MT | COL | GPT4All,Alpaca和开源数据集的增强 | 使用AllenAI提供的nlp增强工具 | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/instruct) |
| [Code Alpaca](https://github.com/sahil280114/codealpaca) | 20022 | EN | SI | SI | 代码生成,编辑,优化 | text-davinci-003 | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/CodeAlpaca) |
| [webGPT](https://huggingface.co/datasets/openai/webgpt_comparisons) | 18994 | EN | TS | MIX | 信息检索问答 | fine-tuned GPT-3 + 人工评估 | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/webGPT) |
| [dolly 2.0](https://github.com/databrickslabs/dolly) | 15015 | EN | TS | HG | 公开、封闭式问答、信息抽取、摘要生成、开放式构思、分类以及创意写作七类任务 | 人工标注 | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/dolly) |
| [baize](https://github.com/project-baize/baize-chatbot) | 653699 | EN | MT | COL | Alpaca和多种问答任务 | 人工标注的数据集的收集 | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/baize) |
| [hh-rlhf](https://github.com/anthropics/hh-rlhf) | 284517 | EN | TS | MIX | 对话 | RLHF models | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/hh-rlhf) |
| [OIG(part)](https://laion.ai/blog/oig-dataset/) | 49237 | EN | MT | COL | 多种nlp任务 | 人工标注的数据集的收集和数据增强 | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/OIG) |
| [camel](https://github.com/lightaime/camel) | 760620 | EN | MT | SI | 物理生物化学编程,数学,社会等领域的角色扮演对话人工标注的数据集的收集 | gpt-3.5-turbo 生成 | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/camel) |
| [FLAN-Muffin](https://huggingface.co/datasets/Muennighoff/flan) | 1764800 | EN | MT | COL | 60种nlp任务 | 人工标注的数据集的收集 | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/FLAN-Muffin) |
| [GPT4Tools](https://github.com/StevenGrove/GPT4Tools) | 71446 | EN | MT | SI | a collection of tool-related instructions | gpt-3.5-turbo | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/gpt4tools) |
| [ShareChat](https://huggingface.co/datasets/RyokoAI/ShareGPT52K) | 1663241 | EN | MT | MIX | general instruct | 收集ShareGPT | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/ShareGPT) |
| [Auto CoT](https://github.com/amazon-science/auto-cot) | | EN | | | | | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/Auto-CoT) |
| [ultrachat](https://github.com/thunlp/UltraChat) | 28247446 | EN | | | | | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/ultrachat) |
| [StackLLaMA](https://huggingface.co/datasets/lvwerra/stack-exchange-paired) | todo | EN | | | | | |