{"id":13541312,"url":"https://github.com/Zjh-819/LLMDataHub","last_synced_at":"2025-04-02T08:31:02.616Z","repository":{"id":160787203,"uuid":"625793423","full_name":"Zjh-819/LLMDataHub","owner":"Zjh-819","description":"A quick guide (especially) for trending instruction finetuning datasets ","archived":false,"fork":false,"pushed_at":"2023-11-28T09:41:28.000Z","size":5147,"stargazers_count":2576,"open_issues_count":3,"forks_count":167,"subscribers_count":48,"default_branch":"main","last_synced_at":"2024-11-03T06:33:11.620Z","etag":null,"topics":["chatbot","chatgpt","dataset","llm"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Zjh-819.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2023-04-10T05:38:52.000Z","updated_at":"2024-11-02T18:17:01.000Z","dependencies_parsed_at":"2023-11-28T10:46:49.876Z","dependency_job_id":null,"html_url":"https://github.com/Zjh-819/LLMDataHub","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Zjh-819%2FLLMDataHub","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Zjh-819%2FLLMDataHub/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Zjh-819%2FLLMDataHub/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Zjh-819%2FLLMDataHub/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Zjh-819","download_url":"https://codeload.github.com/Zjh-819/LLMDataHub/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246781968,"owners_count":20832944,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chatbot","chatgpt","dataset","llm"],"created_at":"2024-08-01T10:00:43.693Z","updated_at":"2025-04-02T08:31:02.313Z","avatar_url":"https://github.com/Zjh-819.png","language":null,"funding_links":[],"categories":["Survey","NLP","A01_文本生成_文本对话","Other Papers","Others","大型语言模型（LLM）排行榜","Documentation and examples"],"sub_categories":["大语言对话模型及数据","LLM 数据","Documentation, lists, guides, or examples"],"readme":"\u003cp align=\"center\" width=\"60%\"\u003e\n\u003cimg src=\"LOGO.png\"  width=\"40%\" height=\"40%\"\u003e\n\u003c/p\u003e\n  \n# \u003cdiv align=\"center\"\u003eLLMDataHub: Awesome Datasets for LLM Training \u003c/div\u003e\n----------------------------------\n\u003cp align=\"center\"\u003e\n  🔥 \u003ca href=\"#general_aligment\" target=\"_blank\"\u003eAlignment Datasets\u003c/a\u003e • 💡 \u003ca href=\"#domain-specific\" target=\"_blank\"\u003eDomain-specific Datasets\u003c/a\u003e • :atom: \u003ca href=\"#pretrain\" target=\"_blank\"\u003ePretraining Datasets\u003c/a\u003e 🖼️ \u003ca href=\"#multimodal\" target=\"_blank\"\u003eMultimodal Datasets\u003c/a\u003e \u003cbr\u003e \n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n\u003cimg alt=\"GitHub last commit\" src=\"https://img.shields.io/github/last-commit/Zjh-819/LLMDataHub\"\u003e \u003cimg alt=\"GitHub Repo stars\" src=\"https://img.shields.io/github/stars/Zjh-819/LLMDataHub\"\u003e\n \u003c/p\u003e\n\n## Introduction 📄\nLarge language models (LLMs), such as OpenAI's GPT series, Google's Bard, and Baidu's Wenxin Yiyan, are driving profound technological changes. Recently, with the emergence of open-source large model frameworks like LlaMa and ChatGLM, training an LLM is no longer the exclusive domain of resource-rich companies. Training LLMs by small organizations or individuals has become an important interest in the open-source community, with some notable works including Alpaca, Vicuna, and Luotuo. In addition to large model frameworks, large-scale and high-quality training corpora are also essential for training large language models. Currently, relevant open-source corpora in the community are still scattered. Therefore, the goal of this repository is to continuously collect high-quality training corpora for LLMs in the open-source community.\n\n\n\nTraining a chatbot LLM that can follow human instruction effectively requires access to high-quality datasets that cover a range of conversation domains and styles. In this repository, we provide a curated collection of datasets specifically designed for chatbot training, including links, size, language, usage, and a brief description of each dataset. Our goal is to make it easier for researchers and practitioners to identify and select the most relevant and useful datasets for their chatbot LLM training needs. Whether you're working on improving chatbot dialogue quality, response generation, or language understanding, this repository has something for you.\n\n### Contact 📬 \u003cbr/\u003e \nIf you want to contribute, you can contact: \n\n  [Junhao Zhao](zhaol9555@gmail.com) 📧 \u003cbr/\u003e\n  Advised by [Prof. Wanyun Cui](https://cuiwanyun.github.io/) [![](https://img.shields.io/badge/GitHub.io-@cuiwanyun-green.svg)](https://cuiwanyun.github.io/)\n\n## \u003cdiv id=\"general_aligment\"\u003eGeneral Open Access Datasets for Alignment 🟢:\u003c/div\u003e\n#### Type Tags 🏷️:\n- SFT: Supervised Finetune\n  - Dialog: Each entry contains continuous conversations \n  - Pairs: Each entry is an input-output pair \n  - Context: Each entry has a context text and related QA pairs\n- PT: pretrain\n- CoT: Chain-of-Thought Finetune\n- RLHF: train reward model in Reinforcement Learning with Human Feedback \n\n\n### Datasets released in November 2023\n| Dataset name                                                         | Used by | Type | Language | Size          | Description ️                                                                                                          |\n|----------------------------------------------------------------------|---------|------|----------|---------------|------------------------------------------------------------------------------------------------------------------------|\n| [helpSteer](https://huggingface.co/datasets/nvidia/HelpSteer)        | /       | RLHF | English  | 37k instances | An RLHF dataset that is annotated by human with helpfulness, correctness, coherence, complexity and verbosity measures |\n| [no_robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots) | /       | SFT  | English  | 10k instance  | High-quality human-created STF data, single turn.                                                                      |\n\n\n### Datasets released in September 2023\n| Dataset name                                                                                                     | Used by | Type       | Language | Size                    | Description ️                                                                                                                                                                                                                                                                                          |\n|------------------------------------------------------------------------------------------------------------------|---------|------------|----------|-------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| [Anthropic_\u003cbr/\u003eHH_Golden](https://huggingface.co/datasets/Unified-Language-Model-Alignment/Anthropic_HH_Golden) | ULMA    | SFT / RLHF | English  | train 42.5k + test 2.3k | Improved on the harmless dataset of Anthropic's Helpful and Harmless (HH) datasets. Using GPT4 to rewrite the original \"chosen\" answer. Compared with the original Harmless dataset, empirically this dataset improves the performance of RLHF, DPO or ULMA methods significantly on harmless metrics. |\n\n\n### Datasets released in August 2023\n| Dataset name                                                                                            | Used by                   | Type                | Language            | Size        | Description ️                                                                                                                                          |\n|---------------------------------------------------------------------------------------------------------|---------------------------|---------------------|---------------------|-------------|--------------------------------------------------------------------------------------------------------------------------------------------------------|\n| [function_\u003cbr/\u003ecalling_\u003cbr/\u003eextended](https://huggingface.co/datasets/Trelis/function_calling_extended) | /                         | Pairs               | English\u003cbr/\u003ecode    | /           | High quality human created dataset from enhance LM's API using ability.                                                                                |\n| [AmericanStories](https://huggingface.co/datasets/dell-research-harvard/AmericanStories)                | /                         | PT                  | English             | /           | Vast sized corpus scanned from US Library of Congress.                                                                                                 |\n| [dolma](https://huggingface.co/datasets/allenai/dolma)                                                  | OLMo                      | PT                  | /                   | 3T tokens   | A large diverse open-source corpus for LM pretraining.                                                                                                 |\n| [Platypus](https://huggingface.co/datasets/garage-bAInd/Open-Platypus)                                  | Platypus2                 | Pairs               | English             | 25K         | A very high quality dataset for improving LM's STEM reasoning ability.                                                                                 |\n| [Puffin](https://huggingface.co/datasets/LDJnr/Puffin)                                                  | Redmond-Puffin\u003cbr/\u003eSeries | Dialog              | English             | ~3k entries | A dataset consists of conversations between real human and GPT-4，which features long context (over 1k tokens per conversation) and multi-turn dialogs. |\n| [tiny series](https://huggingface.co/datasets/nampdn-ai/tiny-codes)                                     | /                         | Pairs               | English             | /           | A series of short and concise codes or texts aim at improving LM's reasoning ability.                                                                  |\n| [LongBench](https://huggingface.co/datasets/THUDM/LongBench)                                            | /                         | Evaluation\u003cbr/\u003eOnly | English\u003cbr/\u003eChinese | 17 tasks    | A benchmark for evaluate LLM's long context understanding capability.                                                                                  |\n\n\n\n### Datasets released in July 2023\n| Dataset name                                                                                                | Used by      | Type            | Language     | Size              | Description ️                                                                                                                                                                                         |\n|-------------------------------------------------------------------------------------------------------------|--------------|-----------------|--------------|-------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| [orca-chat](https://huggingface.co/datasets/shahules786/orca-chat)                                          | /            | Dialog          | English      | 198,463 entries   | An Orca-style dialog dataset aims at improving LM's long context conversational ability.                                                                                                              |\n| [DialogStudio](https://github.com/salesforce/DialogStudio)                                                  | /            | Dialog          | Multilingual | /                 | A collection of diverse datasets aim at building conversational Chatbot.                                                                                                                              |\n| [chatbot_arena\u003cbr/\u003e_conversations](https://huggingface.co/datasets/lmsys/chatbot_arena_conversations)       | /            | RLHF\u003cbr/\u003eDialog | Multilingual | 33k conversations | Cleaned conversations with pairwise human preferences collected on Chatbot Arena.                                                                                                                     |\n| [WebGLM-qa](https://huggingface.co/datasets/THUDM/webglm-qa)                                                | WebGLm       | Pairs           | English      | 43.6k entries     | Dataset used by WebGLM, which is a QA system based on LLM and Internet. Each of the entry in this dataset comprise a question, a response and a reference. The response is grounded in the reference. |\n| [phi-1](https://huggingface.co/datasets/teleprint-me/phi-1)                                                 | phi-1        | Dialog          | English      | /                 | A dataset generated by using the method in [Textbooks Are All You Need](https://arxiv.org/abs/2306.11644). It focuses on math and CS problems.                                                        |\n| [Linly-\u003cbr/\u003epretraining-\u003cbr/\u003edataset](https://huggingface.co/datasets/Linly-AI/Chinese-pretraining-dataset) | Linly series | PT              | Chinese      | 3.4GB             | Chinese pretraining dataset used by Linly series model, comprises ClueCorpusSmall, CSL news-crawl and etc.                                                                                            |\n| [FineGrainedRLHF](https://github.com/allenai/FineGrainedRLHF)                                               | /            | RLHF            | English      | ~5K examples      | A repo aims at develop a new framework to collect human feedbacks. Data collected is with the purpose to improve LLMs  factual correctness, topic relevance and other abilities.                      |\n| [dolphin](https://huggingface.co/datasets/ehartford/dolphin)                                                | /            | Pairs           | English      | 4.5M entries      | An attempt to replicate Microsoft's Orca. Based on FLANv2.                                                                                                                                            |\n| [openchat_\u003cbr/\u003esharegpt4_\u003cbr/\u003edataset](https://huggingface.co/datasets/openchat/openchat_sharegpt4_dataset) | OpenChat     | Dialog          | English      | 6k dialogs        | A high quality dataset generated by using GPT-4 to complete refined ShareGPT prompts.                                                                                                                 |\n\n\n### Datasets released in June 2023\n| Dataset name                                                                                                                                                                                                                                                                                  | Used by          | Type         | Language              | Size                          | Description ️                                                                                                                                                                           |\n|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------|--------------|-----------------------|-------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca)                                                                                                                                                                                                                                | /                | Pairs        | English               | 4.5M completions              | A collection of augmented FLAN data. Generated by using method is Orca paper.                                                                                                           |\n| [COIG-PC](https://huggingface.co/datasets/BAAI/COIG-PC) \u003cbr/\u003e [COIG-Lite](https://huggingface.co/datasets/BAAI/COIG-PC-Lite)                                                                                                                                                                  | /                | Pairs        | Chinese               | /                             | Enhanced version of COIG.                                                                                                                                                               |\n| [WizardLM_Orca](https://huggingface.co/datasets/psmathur/WizardLM_Orca)                                                                                                                                                                                                                       | orca_mini series | Pairs        | English               | 55K entries                   | Enhanced WizardLM data. Generated by using orca's method.                                                                                                                               |\n| arxiv instruct datasets\u003cbr/\u003e [math](https://huggingface.co/datasets/ArtifactAI/arxiv-math-instruct-50k) \u003cbr/\u003e [CS](https://huggingface.co/datasets/ArtifactAI/arxiv-beir-cs-ml-generated-queries) \u003cbr/\u003e [Physics](https://huggingface.co/datasets/ArtifactAI/arxiv-physics-instruct-tune-30k) | /                | Pairs        | English               | 50K/\u003cbr/\u003e50K/\u003cbr/\u003e30K entries | dataset consists of question-answer pairs derived from ArXiv abstracts. Questions are generated using the t5-base model, while the answers are generated using the GPT-3.5-turbo model. |\n| [im-feeling-\u003cbr/\u003ecurious](https://huggingface.co/datasets/xiyuez/im-feeling-curious)                                                                                                                                                                                                          | /                | Pairs        | English               | 2595 entries                  | Random questions and correspond facts generated by Google **I'm feeling curious** features.                                                                                             |\n| [ign_clean\u003cbr/\u003e_instruct\u003cbr/\u003e_dataset_500k](https://huggingface.co/ignmilton)                                                                                                                                                                                                                 | /                | Pairs        | /                     | 509K entries                  | A large scale SFT dataset which is synthetically created from a subset of Ultrachat prompts. ⚠ lack of detailed datacard                                                                |\n| [WizardLM\u003cbr/\u003eevolve_instruct V2](https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k)                                                                                                                                                                                    | WizardLM         | Dialog       | English               | 196k entries                  | The latest version of Evolve Instruct dataset.                                                                                                                                          |\n| [Dynosaur](https://github.com/WadeYin9712/Dynosaur)                                                                                                                                                                                                                                           | /                | Pairs        | English               | 800K entries                  | The dataset generated by applying method in [this paper](https://dynosaur-it.github.io/). Highlight is generating high-quality data at low cost.                                        |\n| [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B)                                                                                                                                                                                                                        | /                | PT           | Primarily\u003cbr/\u003eEnglish | /                             | A cleaned and deduplicated version of RedPajama                                                                                                                                         |\n| [LIMA dataset](https://huggingface.co/datasets/GAIR/lima)                                                                                                                                                                                                                                     | LIMA             | Pairs        | English               | 1k entries                    | High quality SFT dataset used by [LIMA: Less Is More for Alignment](https://arxiv.org/pdf/2305.11206.pdf)                                                                               |\n| [TigerBot Series](https://github.com/TigerResearch/TigerBot#%E5%BC%80%E6%BA%90%E6%95%B0%E6%8D%AE%E9%9B%86)                                                                                                                                                                                    | TigerBot         | PT\u003cbr/\u003ePairs | Chinese\u003cbr/\u003eEnglish   | /                             | Datasets used to train the TigerBot, including pretraining data, STF data and some domain specific datasets like financial research reports.                                            |\n| [TSI-v0](https://huggingface.co/datasets/tasksource/tasksource-instruct-v0)                                                                                                                                                                                                                   | /                | Pairs        | English               | 30k examples\u003cbr/\u003eper task     | A Multi-task instruction-tuning data recasted from 475 of the tasksource datasets. Similar to Flan dataset and Natural instruction.                                                     |\n| [NMBVC](https://github.com/esbatmop/MNBVC)                                                                                                                                                                                                                                                    | /                | PT           | Chinese               | /                             | A large scale, continuously updating Chinese pretraining dataset.                                                                                                                       |\n| [StackOverflow\u003cbr/\u003epost](https://huggingface.co/datasets/mikex86/stackoverflow-posts)                                                                                                                                                                                                         | /                | PT           | /                     | 35GB                          | Raw StackOverflow data in markdown format, for pretraining.                                                                                                                             |\n\n\n### Datasets released before June 2023\n| Dataset name                                                                                                                                                                                                                                                                       | Used by                                             | Type                         | Language                                             | Size                                                                                    | Description ️                                                                                                                                                                                          |\n|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------|------------------------------|------------------------------------------------------|-----------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| [LaMini-Instruction](https://huggingface.co/datasets/MBZUAI/LaMini-instruction)                                                                                                                                                                                                    | /                                                   | Pairs                        | English                                              | 2.8M entries                                                                            | A dataset distilled from flan collection, p3 and self-instruction.                                                                                                                                     |\n| [ultraChat](https://huggingface.co/datasets/stingning/ultrachat)                                                                                                                                                                                                                   | /                                                   | Dialog                       | English                                              | 1.57M dialogs                                                                           | A large scale dialog dataset created by using two ChatGPT, one of which act as the user, another generates response.                                                                                   |\n| [ShareGPT_\u003cbr/\u003eVicuna_unfiltered](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered)                                                                                                                                                                       | Vicuna                                              | Pairs                        | Multilingual                                         | 53K entries                                                                             | Cleaned ShareGPT dataset.                                                                                                                                                                              |\n| [pku-saferlhf-dataset](https://github.com/PKU-Alignment/safe-rlhf#pku-saferlhf-dataset)                                                                                                                                                                                            | Beaver                                              | RLHF                         | English                                              | 10K + 1M                                                                                | The first dataset of its kind and contains 10k instances with safety preferences.                                                                                                                      |\n| RefGPT-Dataset\u003cbr/\u003e[nonofficial link](https://github.com/sufengniu/RefGPT)                                                                                                                                                                                                         | RefGPT                                              | Pairs, Dialog                | Chinese                                              | ~50K entries                                                                            | A Chinese dialog dataset aims at improve the correctness of fact in LLMs (mitigate the hallucination of LLM).                                                                                          |\n| [Luotuo-QA-A\u003cbr/\u003eCoQA-Chinese](https://huggingface.co/datasets/silk-road/Luotuo-QA-A-CoQA-Chinese)                                                                                                                                                                                 | Luotuo project                                      | Context                      | Chinese                                              | 127K QA pairs                                                                           | A dataset built upon translated CoQA. Augmented by using OpenAI API.                                                                                                                                   |\n| [Wizard-LM-Chinese\u003cbr/\u003einstruct-evol](https://huggingface.co/datasets/silk-road/Wizard-LM-Chinese-instruct-evol)                                                                                                                                                                   | Luotuo project                                      | Pairs                        | Chinese                                              | ~70K entries                                                                            | Chinese version WizardLM 70K. Answers are obtained by feed translated questions in OpenAI's GPT API and then get responses.                                                                            |\n| [alpaca_chinese\u003cbr/\u003edataset](https://github.com/hikariming/alpaca_chinese_dataset)                                                                                                                                                                                                 | /                                                   | Pairs                        | Chinese                                              | /                                                                                       | GPT-4 translated alpaca data includes some complement data (like Chinese poetry, application, etc.). Inspected by human.                                                                               |\n| [Zhihu-KOL](https://huggingface.co/datasets/wangrui6/Zhihu-KOL)                                                                                                                                                                                                                    | Open Assistant                                      | Pairs                        | Chinese                                              | 1.5GB                                                                                   | QA data on well-know Chinese Zhihu QA platform.                                                                                                                                                        |\n| [Alpaca-GPT-4_zh-cn](https://huggingface.co/datasets/shibing624/alpaca-zh)                                                                                                                                                                                                         | /                                                   | Pairs                        | Chinese                                              | about 50K entries                                                                       | A Chinese Alpaca-style dataset, generated by GPT-4 originally in Chinese, not translated.                                                                                                              |\n| [hh-rlhf](https://github.com/anthropics/hh-rlhf) \u003cbr/\u003e [on Huggingface](https://huggingface.co/datasets/Anthropic/hh-rlhf)                                                                                                                                                         | Koala                                               | RLHF                         | English                                              | 161k pairs\u003cbr/\u003e79.3MB                                                                   | A pairwise dataset for training reward models in reinforcement learning for improving language models' harmlessness and helpfulness.                                                                   |\n| [Panther-dataset_v1](https://huggingface.co/datasets/Rardilit/Panther-dataset_v1)                                                                                                                                                                                                  | Panther                                             | Pairs                        | English                                              | 377 entries                                                                             | A dataset comes from the hh-rlhf. It rewrite hh-rlhf into the form of input-output pairs.                                                                                                              |\n| [Baize Dataset](https://github.com/project-baize/baize-chatbot/tree/main/data)                                                                                                                                                                                                     | Baize                                               | Dialog                       | English                                              | 100K dialogs                                                                            | A dialog dataset generated by GPT-4 using self-talking. Questions and topics are collected from Quora, StackOverflow and some medical knowledge source.                                                |\n| [h2ogpt-fortune2000\u003cbr/\u003epersonalized](https://huggingface.co/datasets/h2oai/h2ogpt-fortune2000-personalized)                                                                                                                                                                       | h2ogpt                                              | Pairs                        | English                                              | 11363 entries                                                                           | A instruction finetune developed by h2oai, covered various topics.                                                                                                                                     |\n| [SHP](https://huggingface.co/datasets/stanfordnlp/SHP)                                                                                                                                                                                                                             | StableVicuna,\u003cbr/\u003echat-opt,\u003cbr/\u003e, SteamSHP          | RLHF                         | English                                              | 385K entries                                                                            | An RLHF dataset different from previously mentioned ones, it use scores+timestamps to infer the users' preferences. Covers 18 domains, collected by Stanford.                                          |\n| [ELI5](https://huggingface.co/datasets/eli5#source-data)                                                                                                                                                                                                                           | MiniLM series                                       | FT,\u003cbr/\u003eRLHF                 | English                                              | 270K entries                                                                            | Questions and Answers collected from Reddit, including score. Might be used for RLHF reward model training.                                                                                            |\n| [WizardLM\u003cbr/\u003eevol_instruct](https://huggingface.co/datasets/victor123/evol_instruct_70k) \u003cbr/\u003e [V2](https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k)                                                                                                      | WizardLM                                            | Pairs                        | English                                              |                                                                                         | An instruction finetune dataset derived from Alpaca-52K, using the **evolution** method in [this paper](https://arxiv.org/pdf/2304.12244.pdf)                                                          |\n| [MOSS SFT data](https://github.com/OpenLMLab/MOSS/tree/main/SFT_data)                                                                                                                                                                                                              | MOSS                                                | Pairs,\u003cbr/\u003eDialog            | Chinese, English                                     | 1.1M entries                                                                            | A conversational dataset collected and developed by MOSS team. It has usefulness, loyalty and harmlessness labels for every data entries.                                                              |\n| [ShareGPT52K](https://huggingface.co/datasets/RyokoAI/ShareGPT52K)                                                                                                                                                                                                                 | Koala, Stable LLM                                   | Pairs                        | Multilingual                                         | 52K                                                                                     | This dataset comprises conversations collected from ShareGPT, with a specific focus on customized creative conversation.                                                                               |\n| [GPT-4all Dataset](https://huggingface.co/datasets/nomic-ai/gpt4all-j-prompt-generations)                                                                                                                                                                                          | GPT-4all                                            | Pairs                        | English, \u003cbr/\u003e Might have \u003cbr/\u003e a translated version | 400k entries                                                                            | A combination of some subsets of OIG, P3 and Stackoverflow. Covers topics like general QA, customized creative questions.                                                                              |\n| [COIG](https://huggingface.co/datasets/BAAI/COIG)                                                                                                                                                                                                                                  | /                                                   | Pairs                        | Chinese,\u003cbr/\u003ecode                                    | 200K entries                                                                            | A Chinese-based dataset. It contains domains like general purpose QA, Chinese exams, code. Its quality is checked by human annotators.                                                                 |\n| [RedPajama-Data-1T](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T)                                                                                                                                                                                            | RedPajama                                           | PT                           | Primarily English                                    | 1.2T tokens \u003cbr/\u003e 5TB                                                                   | A fully open pretraining dataset follows the LLaMA's method.                                                                                                                                           |\n| [OASST1](https://huggingface.co/datasets/OpenAssistant/oasst1)                                                                                                                                                                                                                     | OpenAssistant                                       | Pairs,\u003cbr/\u003e Dialog           | Multilingual\u003cbr/\u003e(English, Spanish, etc.)            | 66,497 conversation trees                                                               | A large, human-written, human-annotated high quality conversation dataset. It aims at making LLM generates more natural response.                                                                      |\n| [Alpaca-COT](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT)                                                                                                                                                                                                                  | Phoenix                                             | Pairs,\u003cbr/\u003e Dialog,\u003cbr/\u003e CoT | English                                              | /                                                                                       | A mixture a many dataset like classic Alpaca dataset, OIG, Guanaco and some CoT(Chain-of-Thought) datasets like FLAN-CoT. May be handy to use.                                                         |\n| [Bactrian-X](https://huggingface.co/datasets/MBZUAI/Bactrian-X)                                                                                                                                                                                                                    | /                                                   | Pairs                        | Multilingual\u003cbr/\u003e (52 languages)                     | 67K entries per language                                                                | A multilingual version of **Alpaca** and **Dolly-15K**.                                                                                                                                                |\n| [databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) \u003cbr/\u003e [zh-cn Ver](https://huggingface.co/datasets/jaja7744/dolly-15k-cn)                                                                                                                   | Dolly2.0                                            | Pairs                        | English                                              | 15K+ entries                                                                            | A dataset of **human-written** prompts and responses, featuring tasks such as open-domain question-answering, brainstorming, summarization, and more.                                                  |\n| [AlpacaDataCleaned](https://github.com/gururise/AlpacaDataCleaned)                                                                                                                                                                                                                 | Some Alpaca/ LLaMA-like models                      | Pairs                        | English                                              | /                                                                                       | Cleaned version of Alpaca, GPT_LLM and GPTeacher.                                                                                                                                                      |\n| [GPT-4-LLM Dataset](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM)                                                                                                                                                                                                    | Some Alpaca-like models                             | Pairs,\u003cbr/\u003e RLHF             | English,\u003cbr/\u003e Chinese                                | 52K entries for English and Chinese respectively \u003cbr/\u003e 9K entries unnatural-instruction | NOT the dataset used by GPT-4!! It is generated by GPT-4 and some other LLM for better Pairs and RLHF. It includes instruction data as well as comparison data in RLHF style.                          |\n| [GPTeacher](https://github.com/teknium1/GPTeacher)                                                                                                                                                                                                                                 | /                                                   | Pairs                        | English                                              | 20k entries                                                                             | A dataset contains targets generated by GPT-4 and includes many of the same seed tasks as the Alpaca dataset, with the addition of some new tasks such as roleplay.                                    |\n| [HC3](https://github.com/Hello-SimpleAI/chatgpt-comparison-detection)                                                                                                                                                                                                              | Koala                                               | RLHF                         | English,\u003cbr/\u003e Chinese                                | 24322 English \u003cbr/\u003e 12853 Chinese                                                       | A multi-domain, human-vs-ChatGPT comparison dataset. Can be used for reward model training or ChatGPT detector training.                                                                               |\n| [Alpaca data](https://github.com/tatsu-lab/stanford_alpaca#data-release) \u003cbr/\u003e [Download](https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json)                                                                                                                 | Alpaca, ChatGLM-finetune-LoRA, Koala                | Dialog,\u003cbr/\u003e Pairs           | English                                              | 52K entries\u003cbr/\u003e21.4MB                                                                  | A dataset generated by text-davinci-003 to improve language models' ability to follow human instruction.                                                                                               |\n| [OIG](https://huggingface.co/datasets/laion/OIG) \u003cbr/\u003e [OIG-small-chip2](https://huggingface.co/datasets/0-hero/OIG-small-chip2)                                                                                                                                                   | Pythia-Chat-Base-7B, GPT-NeoXT-Chat-Base-20B, Koala | Dialog,\u003cbr/\u003e Pairs           | English,\u003cbr/\u003e code                                   | 44M entries                                                                             | A large conversational instruction dataset with medium and high quality subsets *(OIG-small-chip2)* for multi-task learning.                                                                           |\n| [ChatAlpaca data](https://github.com/cascip/ChatAlpaca)                                                                                                                                                                                                                            | /                                                   | Dialog,\u003cbr/\u003e Pairs           | English,\u003cbr/\u003e Chinese version coming soon            | 10k entries\u003cbr/\u003e39.5MB                                                                  | A dataset aims to help researchers develop models for instruction-following in multi-turn conversations.                                                                                               |\n| [InstructionWild](https://github.com/XueFuzhao/InstructionWild)                                                                                                                                                                                                                    | ColossalChat                                        | Pairs                        | English, Chinese                                     | 10K enreues                                                                             | A Alpaca-style dataset, but with seed tasks comes from chatgpt screenshot.                                                                                                                             |\n| [Firefly(流萤)](https://huggingface.co/datasets/YeungNLP/firefly-train-1.1M)                                                                                                                                                                                                         | Firefly(流萤)                                         | Pairs                        | Chinese                                              | 1.1M entries\u003cbr/\u003e1.17GB                                                                 | A Chinese instruction-tuning dataset with 1.1 million human-written examples across 23 tasks, but no conversation.                                                                                     |\n| [BELLE](https://github.com/LianjiaTech/BELLE) \u003cbr/\u003e [0.5M version](https://huggingface.co/datasets/BelleGroup/train_0.5M_CN) \u003cbr/\u003e [1M version](https://huggingface.co/datasets/BelleGroup/train_1M_CN) \u003cbr/\u003e [2M version](https://huggingface.co/datasets/BelleGroup/train_2M_CN) | BELLE series, Chunhua (春华)                          | Pairs                        | Chinese                                              | 2.67B in total                                                                          | A Chinese instruction dataset similar to *Alpaca data* constructed by generating answers from seed tasks, but no conversation.                                                                         |\n| [GuanacoDataset](https://huggingface.co/datasets/JosephusCheung/GuanacoDataset#guanacodataset)                                                                                                                                                                                     | Guanaco                                             | Dialog,\u003cbr/\u003e Pairs           | English,\u003cbr/\u003e Chinese,\u003cbr/\u003e Japanese                 | 534,530 entries                                                                         | A multilingual instruction dataset for enhancing language models' capabilities in various linguistic tasks, such as natural language understanding and explicit content recognition.                   |\n| [OpenAI WebGPT](https://huggingface.co/datasets/openai/webgpt_comparisons)                                                                                                                                                                                                         | WebGPT's reward model, Koala                        | RLHF                         | English                                              | 19,578 pairs                                                                            | Data set used in WebGPT paper. Used for training reward model in RLHF.                                                                                                                                 |\n| [OpenAI\u003cbr/\u003eSummarization\u003cbr/\u003eComparison](https://huggingface.co/datasets/openai/summarize_from_feedback)                                                                                                                                                                          | Koala                                               | RLHF                         | English                                              | ~93K entries\u003cbr/\u003e420MB                                                                  | A dataset of human feedback which helps training a reward model. The reward model was then used to train a summarization model to align with human preferences.                                        |\n| [self-instruct](https://github.com/yizhongw/self-instruct)                                                                                                                                                                                                                         | /                                                   | Pairs                        | English                                              | 82K entries                                                                             | The dataset generated by using the well-known [self-instruction method](https://arxiv.org/abs/2212.10560)                                                                                              |\n| [unnatural-instructions](https://github.com/orhonovich/unnatural-instructions)                                                                                                                                                                                                     | /                                                   | Pairs                        | English                                              | 240,670 examples                                                                        | An early attempt to use powerful model (text-davinci-002) to generate data.                                                                                                                            |\n| [xP3 (and some variant)](https://huggingface.co/datasets/bigscience/xP3)                                                                                                                                                                                                           | BLOOMZ, mT0                                         | Pairs                        | Multilingual,\u003cbr/\u003e code                              | 79M entries\u003cbr/\u003e88GB                                                                    | An instruction dataset for improving language models' generalization ability, similar to *Natural Instruct*.                                                                                           |\n| [Flan V2](https://github.com/google-research/FLAN/tree/main/flan/v2)                                                                                                                                                                                                               | /                                                   | /                            | English                                              | /                                                                                       | A dataset compiles datasets from Flan 2021, P3, Super-Natural Instructions, along with dozens more datasets into one and formats them into a mix of zero-shot, few-shot and chain-of-thought templates |\n| [Natural Instruction](https://instructions.apps.allenai.org/) \u003cbr/\u003e [GitHub\u0026Download](https://github.com/allenai/natural-instructions)                                                                                                                                             | tk-instruct series                                  | Pairs, \u003cbr/\u003e evaluation      | Multilingual                                         | /                                                                                       | A benchmark with over 1,600 tasks with instruction and definition for evaluating and improving language models' multi-task generalization under natural language instruction.                          |\n| [CrossWOZ](https://github.com/thu-coai/CrossWOZ)                                                                                                                                                                                                                                   | /                                                   | Dialog                       | English,\u003cbr/\u003eChinese                                 | 6K dialogs                                                                              | The dataset introduced by [this paper](https://arxiv.org/pdf/2002.11893.pdf), mainly about tourism topic in Beijing, answers are generated automatically by rules.                                     |\n\n\n#### Potential Overlaps ⚠️\n\nWe consider row items as subject.\n\n|                   | OIG     | hh-rlhf  | xP3     | natural instruct | AlpacaDataCleaned | GPT-4-LLM | Alpaca-CoT |\n|-------------------|---------|----------|---------|------------------|-------------------|-----------|------------|\n| OIG               | /       | contains | overlap | overlap          | overlap           |           | overlap    |\n| hh-rlhf           | part of | /        |         |                  |                   |           | overlap    |\n| xP3               | overlap |          | /       | overlap          |                   |           | overlap    |\n| natural instruct  | overlap |          | overlap | /                |                   |           | overlap    |\n| AlpacaDataCleaned | overlap |          |         |                  | /                 | overlap   | overlap    |\n| GPT-4-LLM         |         |          |         |                  | overlap           | /         | overlap    |\n| Alpaca-CoT        | overlap | overlap  | overlap | overlap          | overlap           | overlap   | /          |\n\n## \u003cdiv id=\"pretrain\"\u003eOpen Datasets for Pretraining 🟢 :atom:\u003c/div\u003e\n| Dataset name                                                                                                                                  | Used by                                                                        | Type                                 | Language                | Size        | Description ️                                                                                                                                     |\n|-----------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------|--------------------------------------|-------------------------|-------------|---------------------------------------------------------------------------------------------------------------------------------------------------|\n| [proof-pile](https://huggingface.co/datasets/hoskinson-center/proof-pile)                                                                     | proof-GPT                                                                      | PT                                   | English\u003cbr/\u003eLaTeX       | 13GB        | A pretraining dataset which is similar to the pile but have LaTeX corpus to enhance LM's ability in proof.                                        |\n| [peS2o](https://huggingface.co/datasets/allenai/peS2o)                                                                                        | /                                                                              | PT                                   | English                 | 7.5GB       | A high quality academic paper dataset for pretraining.                                                                                            |\n| [StackOverflow\u003cbr/\u003epost](https://huggingface.co/datasets/mikex86/stackoverflow-posts)                                                         | /                                                                              | PT                                   | /                       | 35GB        | Raw StackOverflow data in markdown format, for pretraining.                                                                                       |\n| [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B)                                                                        | /                                                                              | PT                                   | Primarily\u003cbr/\u003eEnglish   | /           | A cleaned and deduplicated version of RedPajama                                                                                                   |\n| [NMBVC](https://github.com/esbatmop/MNBVC)                                                                                                    | /                                                                              | PT                                   | Chinese                 | /           | A large scale, continuously updating Chinese pretraining dataset.                                                                                 |\n| [falcon-refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb)                                                                 | tiiuae/falcon series                                                           | PT                                   | English                 | /           | A refined subset of CommonCrawl.                                                                                                                  |\n| [CBook-150K](https://github.com/FudanNLPLAB/CBook-150K)                                                                                       | /                                                                              | PT, \u003cbr/\u003e building dataset           | Chinese                 | 150K+ books | A raw Chinese books dataset. Need some preprocess pipeline.                                                                                       |\n| [Common Crawl](https://commoncrawl.org/)                                                                                                      | LLaMA (After some process)                                                     | building datasets, \u003cbr/\u003e PT          | /                       | /           | The most well-known raw dataset, rarely be used directly. One possible preprocess pipeline is [CCNet](https://github.com/facebookresearch/cc_net) |\n| [nlp_Chinese_Corpus](https://github.com/brightmart/nlp_chinese_corpus)                                                                        | /                                                                              | PT,\u003cbr/\u003eTF                           | Chinese                 | /           | A Chinese pretrain corpus. Includes Wikipedia, Baidu Baike, Baidu QA, some forums QA and news corpus.                                             |\n| [The Pile (V1)](https://pile.eleuther.ai/)                                                                                                    | GLM (partly), LLaMA (partly), GPT-J, GPT-NeoX-20B, Cerebras-GPT 6.7B, OPT-175b | PT                                   | Multilingual,\u003cbr/\u003e code | 825GB       | A diverse open-source language modeling dataset consisting of 22 smaller, high-quality datasets that includes many domains and tasks.             |\n| C4 \u003cbr/\u003e [Huggingface dataset](https://huggingface.co/datasets/c4) \u003cbr/\u003e [TensorFlow dataset](https://www.tensorflow.org/datasets/catalog/c4) | Google T5 Series, LLaMA                                                        | PT                                   | English                 | 305GB       | A colossal, cleaned version of Common Crawl's web crawl corpus. Frequently be used.                                                               |\n| [ROOTS](https://huggingface.co/bigscience-data)                                                                                               | BLOOM                                                                          | PT                                   | Multilingual,\u003cbr/\u003e code | 1.6TB       | A diverse open-source dataset consisting of sub-datasets like Wikipedia and StackExchange for language modeling.                                  |\n| [PushshPairs reddit](https://files.pushshPairs.io/reddit/) \u003cbr/\u003e [paper](https://arxiv.org/pdf/2001.08435.pdf)                                | OPT-175b                                                                       | PT                                   | /                       | /           | Raw reddit data, one possible processing pipeline in [this paper](https://aclanthology.org/2021.eacl-main.24.pdf)                                 |\n| [Gutenberg project](https://www.gutenberg.org/policy/robot_access.html)                                                                       | LLaMA                                                                          | PT                                   | Multilingual            | /           | A book dataset, mostly novels. Not be preprocessed.                                                                                               |\n| [CLUECorpus](https://github.com/CLUEbenchmark/CLUE)                                                                                           | /                                                                              | PT, \u003cbr/\u003e finetune, \u003cbr/\u003e evaluation | Chinese                 | 100GB       | A Chinese pretraining Corpus sourced from *Common Crawl*.                                                                                         |\n\n\n## \u003cdiv id=\"domain-specific\"\u003eDomain-specific Datasets 🟢 💡\u003c/div\u003e\n\n| Dataset name                                                                                                         | Used by                                                   | Type             | Language              | Size              | Description ️                                                                                                                                                         |\n|----------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------|------------------|-----------------------|-------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| [starcoderdata](https://huggingface.co/datasets/bigcode/starcoderdata)                                               | starcoder\u003cbr/\u003eseries                                      | PT               | code                  | 783GB             | A large pretraining dataset for improving LM's coding ability.                                                                                                        |\n| [code_\u003cbr/\u003einstructions\u003cbr/\u003e_120k_alpaca](https://huggingface.co/datasets/iamtarun/code_instructions_120k_alpaca)    | /                                                         | Pairs            | English/code          | 121,959 entries   | [code_instruction](https://huggingface.co/datasets/sahil2801/code_instructions_120k) in instruction finetune format.                                                  |\n| [function-\u003cbr/\u003einvocations-25k](https://huggingface.co/datasets/unaidedelf87777/openapi-function-invocations-25k)    | some MPT \u003cbr/\u003e variants                                   | Pairs            | English code          | 25K entries       | A dataset aims at teaching AI models how to correctly invoke [APIsGuru](https://github.com/APIs-guru/openapi-directory) functions based on natural language prompts.  |\n| [TheoremQA](https://huggingface.co/datasets/wenhu/TheoremQA)                                                         | /                                                         | Pairs            | English               | 800               | A high quality STEM theorm QA dataset.                                                                                                                                |\n| [phi-1](https://huggingface.co/datasets/teleprint-me/phi-1)                                                          | phi-1                                                     | Dialog           | English               | /                 | A dataset generated by using the method in [Textbooks Are All You Need](https://arxiv.org/abs/2306.11644). It focuses on math and CS problems.                        |\n| [FinNLP](https://github.com/AI4Finance-Foundation/FinNLP)                                                            | [FinGPT](https://github.com/AI4Finance-Foundation/FinGPT) | Raw data         | English,\u003cbr/\u003eChinese  | /                 | Open-source raw financial text data. Includes news, social media and etc.                                                                                             |\n| [PRM800K](https://github.com/openai/prm800k)                                                                         | A variant of\u003cbr/\u003eGPT-4                                    | Context          | English               | 800K entries      | A process supervision dataset for mathematical problems                                                                                                               |\n| [MeChat data](https://github.com/qiuhuachuan/smile)  ⚠️use with care                                                 | MeChat                                                    | Dialog           | Chinese               | 355733 utterances | A Chinese SFT dataset for training a mental healthcare chatbot.                                                                                                       |\n| [ChatGPT-Jailbreak-Prompts](https://huggingface.co/datasets/rubend18/ChatGPT-Jailbreak-Prompts) \u003cbr/\u003e ⚠️RISKY        | /                                                         | /                | English               | 163KB file size   | Prompts for bypassing the safety regulation of ChatGPT. Can be use for probing the harmlessness of LLMs                                                               |\n| [awesome chinese\u003cbr/\u003elegal resources](https://github.com/pengxiao-song/awesome-chinese-legal-resources)              | LaWGPT                                                    | /                | Chinese               | /                 | A collection of Chinese legal data for LLM training.                                                                                                                  |\n| [Long Form](https://github.com/akoksal/LongForm)                                                                     | /                                                         | Pairs            | English               | 23.7K entries     | A dataset aims at improving the long text generation ability of LLM.                                                                                                  |\n| [symbolic-instruction-tuning](https://huggingface.co/datasets/sail/symbolic-instruction-tuning)                      | /                                                         | Pairs            | English,\u003cbr/\u003e code    | 796               | A dataset focuses on the 'symbolic' tasks: like SQL coding, mathematical computation, etc.                                                                            |\n| [Safety Prompt](https://github.com/thu-coai/Safety-Prompts)                                                          | /                                                         | Evaluation  only | Chinese               | 100k entries      | Chinese safety prompts for evaluating and improving the safety of LLMs.                                                                                               |\n| [Tapir-Cleaned](https://huggingface.co/datasets/MattiaL/tapir-cleaned-116k)                                          | /                                                         | Pairs            | English,              | 116k entries      | This is a revised version of the DAISLab dataset of PairsTT rules, which has been thoroughly cleaned, scored, and adjusted for the purpose of instruction-tuning      |\n| [instructional_\u003cbr/\u003ecodesearchnet_python](https://huggingface.co/datasets/Nan-Do/instructional_codesearchnet_python) | /                                                         | Pairs            | English \u0026\u003cbr/\u003e Python | 192MB             | This dataset is a template generated instructional Python datastet generated from an annotated version of the code-search-net dataset for the Open-Assistant project. |\n| [finance-alpaca](https://huggingface.co/datasets/gbharti/finance-alpaca)                                             | /                                                         | Pairs            | English               | 1.3K entries      | An Alpaca-style dataset but focus on financial topics                                                                                                                 |\n\n## \u003cdiv id=\"multimodal\"\u003eMultimodal Datasets for VLM \u003c/div\u003e\n| Dataset name                                                                        | Used by            | Type                 | Language     | Size           | Description ️                                                                                               |\n|-------------------------------------------------------------------------------------|--------------------|----------------------|--------------|----------------|-------------------------------------------------------------------------------------------------------------|\n| [ShareGPT4V](https://huggingface.co/datasets/Lin-Chen/ShareGPT4V)                   | /                  | image-prompt-caption | English      | 1.2M instances | A set of GPT4-Vision-powered multi-modal captions data.                                                     |\n| [OBELICS](https://huggingface.co/datasets/HuggingFaceM4/OBELICS)                    | idefics\u003cbr/\u003eseries | image-document       | English      | 141M documents | an open, massive, and curated collection of interleaved image-text web documents.                           |\n| [JourneyDB](https://huggingface.co/datasets/JourneyDB/JourneyDB)                    | /                  | image-prompt-caption | English      | 4M instances   | A large scale dataset comprises QA, caption, and text prompting tasks, which is based on Midjourney images. |\n| [M3IT](https://huggingface.co/datasets/MMInstruction/M3IT)                          | Ying-VLM           | instruction-image    | Multilingual | 2.4M instances | A dataset comprises 40 tasks with 400 human written instruction.                                            |\n| [MIMIC-IT](https://github.com/Luodian/Otter/tree/main/mimic-it)                     | Otter              | instruction-image    | Multilingial | 2.2M instances | High quality multi-modal instructions-response pairs based on images and videos.                            |\n| [LLaVA Instruction](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) | LLaVA              | instruction-image    | English      | 158k samples   | A multimodal dataset generated upon COCO dataset by prompting GPT-4 to get instructions.                    |\n\n\n## Private Datasets 🔴\n| Dataset name          | Used by            | Type | Language                              | Size  | Description ️                                                                                   |\n|-----------------------|--------------------|------|---------------------------------------|-------|-------------------------------------------------------------------------------------------------|\n| WebText(Reddit links) | GPT-2              | PT   | English                               | /     | Data crawled from Reddit and filtered for GPT-2 pretraining.                                    |\n| MassiveText           | Gopher, Chinchilla | PT   | 99% English, 1% other(including code) |       |                                                                                                 |\n| WuDao(悟道) Corpora     | GLM                | PT   | Chinese                               | 200GB | A large scale Chinese corpus, Possible component originally open-sourced but not available now. |\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FZjh-819%2FLLMDataHub","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FZjh-819%2FLLMDataHub","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FZjh-819%2FLLMDataHub/lists"}