{"id":13642228,"url":"https://github.com/chaoswork/sft_datasets","last_synced_at":"2026-03-09T17:36:26.323Z","repository":{"id":171812708,"uuid":"648450331","full_name":"chaoswork/sft_datasets","owner":"chaoswork","description":"开源SFT数据集整理,随时补充","archived":false,"fork":false,"pushed_at":"2023-06-02T02:15:45.000Z","size":4,"stargazers_count":507,"open_issues_count":2,"forks_count":41,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-20T16:38:21.035Z","etag":null,"topics":["chinese-dataset","datasets","large-language-models","llms","supervised-finetuning"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/chaoswork.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2023-06-02T02:13:57.000Z","updated_at":"2025-04-17T07:49:43.000Z","dependencies_parsed_at":null,"dependency_job_id":"78c0e1ad-1889-4203-94c7-55a46af7beb2","html_url":"https://github.com/chaoswork/sft_datasets","commit_stats":null,"previous_names":["chaoswork/sft_datasets"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/chaoswork/sft_datasets","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chaoswork%2Fsft_datasets","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chaoswork%2Fsft_datasets/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chaoswork%2Fsft_datasets/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chaoswork%2Fsft_datasets/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/chaoswork","download_url":"https://codeload.github.com/chaoswork/sft_datasets/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chaoswork%2Fsft_datasets/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30304707,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-09T17:35:44.120Z","status":"ssl_error","status_checked_at":"2026-03-09T17:35:43.707Z","response_time":61,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chinese-dataset","datasets","large-language-models","llms","supervised-finetuning"],"created_at":"2024-08-02T01:01:28.755Z","updated_at":"2026-03-09T17:36:26.293Z","avatar_url":"https://github.com/chaoswork.png","language":null,"funding_links":[],"categories":["Datasets"],"sub_categories":["数据集"],"readme":"# 开源SFT数据集整理\n\n\n\n| 数据集                                                                                 | 数目     | Lang  | Task  | Gen | 类型                                      | 来源                    | 链接                                                                                       |\n|-------------------------------------------------------------------------------------- |-------- |----- |----- |--- |----------------------------------------- |----------------------- |------------------------------------------------------------------------------------------ |\n| [belle\\_cn](https://huggingface.co/BelleGroup)                                         | 1079517  | CN    | TS/MT | SI  | 通用指令，数学推理，对话                  | text-davunci-003        | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/belle_cn)             |\n| [firefly](https://github.com/yangjianxin1/Firefly)                                     | 1649398  | CN    | MT    | COL | 23种nlp任务                               | 收集中文数据集，人工书写指令模板 | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/firefly)              |\n| [GAOKAO](https://github.com/OpenLMLab/GAOKAO-Bench)                                    | 2785     | CN    | MT    | COL | 高考中的多选，填空等问题                  | 人工标注的数据集的收集  | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/GAOKAO)               |\n| [COIG](https://huggingface.co/datasets/BAAI/COIG)                                      | 298428   | CN    | MT    | COL | 考试，翻译，价值观指令数据集搜集，基于知识图谱的反事实对话 | 自动化工具+人工验证     | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/COIG)                 |\n| [pCLUE](https://github.com/CLUEbenchmark/pCLUE)                                        | 1200705  | CN    | MT    |     | 73个Prompt,分类，推理，关键词识别，阅读理解等9个NLP任务 |                         | [下载](https://github.com/CLUEbenchmark/pCLUE/tree/main/datasets)                          |\n| [CSL](https://github.com/ydli-ai/CSL)                                                  | 396209   | CN    | MT    |     | 40万中文论文元数据，26个Prompt            |                         | [下载](https://drive.google.com/file/d/1xEDgtqHU4qm0Sp-dKjc5KerAmWydmh3-/view?usp=sharing) |\n| [CNewSum](https://dqwang122.github.io/projects/CNewSum/)                               | 304307   | CN    | TS    |     | 字节与UCSB发布的中文摘要数据集            |                         | [下载](https://drive.google.com/u/0/uc?id=1A_YcQ3cBAI7u9iVIoCeVLLgwU7UUzHHv\u0026export=download) |\n| [Coco-cn](https://github.com/li-xirong/coco-cn)                                        |          | CN    | TS    |     | 图文多模态                                |                         | [下载](https://github.com/li-xirong/coco-cn)                                               |\n| [news\\_commentary](https://huggingface.co/datasets/news_commentary/viewer/en-zh/train) | 69200    | EN/CN | TS    |     | 中英文翻译数据                            |                         | [下载](https://huggingface.co/datasets/news_commentary/viewer/en-zh/train)                 |\n| [Chain of Thought](https://github.com/google-research/FLAN)                            | 74771    | EN/CN | MT    | HG  | CoT相关任务                               | 人在现有数据集上标注CoT | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/Chain-of-Thought)     |\n| [HC3](https://huggingface.co/datasets/Hello-SimpleAI/HC3)                              | 37175    | EN/CN | TS    | MIX | 对话评估                                  | gpt-3.5 或 人工         | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/HC3)                  |\n| [instinwild](https://github.com/XueFuzhao/InstructionWild)                             | 52191    | EN/CN | MT    | SI  | 生成，开放域问答，头脑风暴                | text-davunci-003        | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/instinwild)           |\n| [Alpaca\\_GPT4](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM)             | 52002    | EN/CN | MT    | SI  | 通用指令                                  | GPT-4 生成的Alpaca数据  | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/alpacaGPT4)           |\n| [MOSS](https://github.com/OpenLMLab/MOSS)                                              | 1583595  | EN/CN | SI    |     |                                           |                         | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/MOSS)                 |\n| [LLMZoo](https://github.com/FreedomIntelligence/LLMZoo)                                |          | ML    |       |     |                                           |                         | [下载](https://huggingface.co/datasets/FreedomIntelligence/phoenix-sft-data-v1/tree/main)  |\n| [Guanaco](https://huggingface.co/datasets/JosephusCheung/GuanacoDataset)               | 534610   | ML    | MT    | SI  | 多种nlp任务                               | text-davinci-003        | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/Guanaco)              |\n| [Natural Instructions](https://github.com/allenai/natural-instructions)                | 5040134  | ML    | MT    | COL | 多种nlp任务                               | 人工标注的数据集的收集  | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/Natural-Instructions) |\n| [xP3](https://huggingface.co/datasets/bigscience/xP3)                                  | 78883588 | ML    | MT    | COL | 多种nlp任务                               | 人工标注的数据集的收集  | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/xP3)                  |\n| [alpaca](https://github.com/tatsu-lab/stanford_alpaca)                                 | 52002    | EN    | MT    | SI  | 通用指令                                  | text-davinci-003        | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/alpaca)               |\n| [GPT4all](https://github.com/nomic-ai/gpt4all)                                         | 806199   | EN    | MT    | COL | 代码，故事，对话                          | GPT-3.5-turbo 蒸馏      | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/GPT4all)              |\n| [GPTeacher](https://github.com/teknium1/GPTeacher)                                     | 29013    | EN    | MT    | SI  | 通用，角色扮演，工具指令                  | GPT-4 \u0026 toolformer      | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/GPTeacher)            |\n| [prosocial dialog](https://huggingface.co/datasets/allenai/prosocial-dialog)           | 165681   | EN    | TS    | MIX | 对话                                      | GPT-3改写问题，人工回复 | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/prosocial-dialog)     |\n| [finance\\_en](https://huggingface.co/datasets/gbharti/finance-alpaca)                  | 68912    | EN    | TS    | COL | 金融领域问答                              | GPT3.5                  | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/)                     |\n| [instruct](https://huggingface.co/datasets/swype/instruct)                             | 888969   | EN    | MT    | COL | GPT4All，Alpaca和开源数据集的增强         | 使用AllenAI提供的nlp增强工具 | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/instruct)             |\n| [Code Alpaca](https://github.com/sahil280114/codealpaca)                               | 20022    | EN    | SI    | SI  | 代码生成，编辑，优化                      | text-davinci-003        | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/CodeAlpaca)           |\n| [webGPT](https://huggingface.co/datasets/openai/webgpt_comparisons)                    | 18994    | EN    | TS    | MIX | 信息检索问答                              | fine-tuned GPT-3 + 人工评估 | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/webGPT)               |\n| [dolly 2.0](https://github.com/databrickslabs/dolly)                                   | 15015    | EN    | TS    | HG  | 公开、封闭式问答、信息抽取、摘要生成、开放式构思、分类以及创意写作七类任务 | 人工标注                | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/dolly)                |\n| [baize](https://github.com/project-baize/baize-chatbot)                                | 653699   | EN    | MT    | COL | Alpaca和多种问答任务                      | 人工标注的数据集的收集  | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/baize)                |\n| [hh-rlhf](https://github.com/anthropics/hh-rlhf)                                       | 284517   | EN    | TS    | MIX | 对话                                      | RLHF models             | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/hh-rlhf)              |\n| [OIG(part)](https://laion.ai/blog/oig-dataset/)                                        | 49237    | EN    | MT    | COL | 多种nlp任务                               | 人工标注的数据集的收集和数据增强 | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/OIG)                  |\n| [camel](https://github.com/lightaime/camel)                                            | 760620   | EN    | MT    | SI  | 物理生物化学编程，数学，社会等领域的角色扮演对话人工标注的数据集的收集 | gpt-3.5-turbo 生成      | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/camel)                |\n| [FLAN-Muffin](https://huggingface.co/datasets/Muennighoff/flan)                        | 1764800  | EN    | MT    | COL | 60种nlp任务                               | 人工标注的数据集的收集  | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/FLAN-Muffin)          |\n| [GPT4Tools](https://github.com/StevenGrove/GPT4Tools)                                  | 71446    | EN    | MT    | SI  | a collection of tool-related instructions | gpt-3.5-turbo           | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/gpt4tools)            |\n| [ShareChat](https://huggingface.co/datasets/RyokoAI/ShareGPT52K)                       | 1663241  | EN    | MT    | MIX | general instruct                          | 收集ShareGPT            | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/ShareGPT)             |\n| [Auto CoT](https://github.com/amazon-science/auto-cot)                                 |          | EN    |       |     |                                           |                         | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/Auto-CoT)             |\n| [ultrachat](https://github.com/thunlp/UltraChat)                                       | 28247446 | EN    |       |     |                                           |                         | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/ultrachat)            |\n| [StackLLaMA](https://huggingface.co/datasets/lvwerra/stack-exchange-paired)            | todo     | EN    |       |     |                                           |                         |                                                                                            |\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchaoswork%2Fsft_datasets","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fchaoswork%2Fsft_datasets","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchaoswork%2Fsft_datasets/lists"}