{"id":24757999,"url":"https://github.com/upbit/awesome-datasets-for-llms","last_synced_at":"2026-01-04T20:44:05.120Z","repository":{"id":271163124,"uuid":"623968691","full_name":"upbit/awesome-datasets-for-LLMs","owner":"upbit","description":"Awesome training/finetuning datasets for LLMs","archived":false,"fork":false,"pushed_at":"2023-04-12T11:10:20.000Z","size":9,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-15T08:01:59.875Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"unlicense","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/upbit.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-04-05T13:20:08.000Z","updated_at":"2023-04-05T13:20:08.000Z","dependencies_parsed_at":null,"dependency_job_id":"e1571c54-6481-46f1-8cb9-9219a0fadbb4","html_url":"https://github.com/upbit/awesome-datasets-for-LLMs","commit_stats":null,"previous_names":["upbit/awesome-datasets-for-llms"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/upbit%2Fawesome-datasets-for-LLMs","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/upbit%2Fawesome-datasets-for-LLMs/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/upbit%2Fawesome-datasets-for-LLMs/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/upbit%2Fawesome-datasets-for-LLMs/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/upbit","download_url":"https://codeload.github.com/upbit/awesome-datasets-for-LLMs/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245078182,"owners_count":20557281,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-01-28T15:38:32.286Z","updated_at":"2026-01-04T20:44:05.087Z","avatar_url":"https://github.com/upbit.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# Awesome datasets for LLMs\n\nAwesome training/finetuning datasets for LLMs\n\n## fine-tuning\n\n* [`alpaca_data.json`](https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json): 52K instruction data generated by `text-davinci-003` (from [tatsu-lab/stanford_alpaca](https://github.com/tatsu-lab/stanford_alpaca))\n\n* [Anthropic HH RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf): Human preference data about helpfulness and harmlessness / Human-generated red teaming data\n\n* [Stanford Human Preferences Dataset (SHP)](https://huggingface.co/datasets/stanfordnlp/SHP): 385K collective human preferences over responses to questions/instructions in 18 different subject areas, from cooking to legal advice\n\n* BELLE/[1.0M](https://huggingface.co/datasets/BelleGroup/train_1M_CN), [0.5M](https://huggingface.co/datasets/BelleGroup/train_0.5M_CN): 由ChatGPT生成的1.5M中文数据集，包含数个不同指令类型、不同领域的子集 (from [LianjiaTech/BELLE](https://github.com/LianjiaTech/BELLE))\n\n* BELLE/[`School Math`](https://huggingface.co/datasets/BelleGroup/school_math_0.25M), [`Multiturn Chat`](https://huggingface.co/datasets/BelleGroup/multiturn_chat_0.8M): 由BELLE项目开源的10M数据集，目前已开放数学习题和多轮对话 (from [LianjiaTech/BELLE](https://github.com/LianjiaTech/BELLE))\n\n* [`Alpaca-CoT datasets`](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT) Finetuning bases on Stanford Alpaca project. The datasets include CoT, dialog, FastChat, instinwild (from [PhoebusSi/Alpaca-CoT](https://github.com/PhoebusSi/Alpaca-CoT/blob/main/CN_README.md#%E6%95%B0%E6%8D%AE%E7%BB%9F%E8%AE%A1))\n\n* [ShareGPT Vicuna unfiltered](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/tree/main/HTML_cleaned_raw_dataset): 48k ShareGPT conversations used by Vicuna (from [lm-sys/FastChat, issue #90](https://github.com/lm-sys/FastChat/issues/90#issuecomment-1493250773))\n\n* [`instinwild_en.json`](https://drive.google.com/file/d/1qOfrl0RIWgH2_b1rYCEVxjHF3u3Dwqay/view?usp=sharing), [`instinwild_ch.json`](https://drive.google.com/file/d/1OqfOUWYfrK6riE9erOx-Izp3nItfqz_K/view?usp=sharing): A larger set of instructions generated by 479 seed instructions (from [XueFuzhao/InstructionWild](https://github.com/XueFuzhao/InstructionWild))\n\n* [lvwerra/stack-exchange-paired](https://huggingface.co/datasets/lvwerra/stack-exchange-paired): a processed version of the HuggingFaceH4/stack-exchange-preferences (from [trl-lib/llama-7b-se-rl-peft](https://huggingface.co/trl-lib/llama-7b-se-rl-peft))\n\n## spec\n\n* [HealthCareMagic-200k](https://drive.google.com/file/d/1lyfqIwlLSClhgrCutWuEe_IACNq6XNUt/view?usp=sharing): 200k real conversations between patients and doctors from HealthCareMagic.com (from [Kent0n-Li/ChatDoctor](https://github.com/Kent0n-Li/ChatDoctor))\n\n* [icliniq-15k](https://drive.google.com/file/d/1ZKbqgYqWc7DJHs3N9TQYQVPdDQmZaClA/view?usp=sharing): 15k real conversations between patients and doctors from icliniq.com (from [Kent0n-Li/ChatDoctor](https://github.com/Kent0n-Li/ChatDoctor))\n\n* [`ADGEN dataset`](https://drive.google.com/file/d/13_vf0xRTQsyneRKdD1bZIr93vBGOczrk/view)/[`清华备份`](https://cloud.tsinghua.edu.cn/f/b3f119a008264b1cabd1/?dl=1): 根据输入（content）生成一段广告词（summary） (from [THUDM/ChatGLM-6B](https://github.com/THUDM/ChatGLM-6B/tree/main/ptuning#%E4%B8%8B%E8%BD%BD%E6%95%B0%E6%8D%AE%E9%9B%86))\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fupbit%2Fawesome-datasets-for-llms","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fupbit%2Fawesome-datasets-for-llms","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fupbit%2Fawesome-datasets-for-llms/lists"}