https://github.com/dsdanielpark/all-about-llm

dsdanielpark's curation and categorization of resources on large language models, along with documentation.
https://github.com/dsdanielpark/all-about-llm
large-language-model llm nlp
Last synced: 3 months ago
JSON representation
dsdanielpark's curation and categorization of resources on large language models, along with documentation.
Host: GitHub
URL: https://github.com/dsdanielpark/all-about-llm
Owner: dsdanielpark
Created: 2023-07-03T03:30:18.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2024-06-11T08:48:32.000Z (about 1 year ago)
Last Synced: 2024-06-11T10:16:40.086Z (about 1 year ago)
Topics: large-language-model, llm, nlp
Language: Python
Homepage: https://github.com/dsdanielpark/all-about-llm
Size: 300 KB
Stars: 10
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
Awesome Lists containing this project

README

        

[![](https://img.shields.io/badge/Language-English-lightgrey)](https://github.com/dsdanielpark/all-about-llm) 

# All About LLM 

Curated the comments as a sub-module to see how active the activity is while syncing forks. Therefore, this repository serves the purpose of curating comments not only for some experiments but mostly for self-checking, where I can see on my own where and when commits and pull requests frequently occur. To allow for viewing a list of all submodules, I intentionally do not use folders for organizing the repository. Additionally, you can view the complete list in the [git submodule file.](https://github.com/dsdanielpark/all-about-llm/blob/main/.gitmodules)

This repository contains only some of the models required for _personal_ research, so please refer to other repositories for detailed information and updates.




- [All About LLM](#all-about-llm)

  - [Quick start](#quick-start)

  - [Leaderboards](#leaderboards)

  - [Open LLM](#open-llm)

  - [LLM Model Evaluation](#llm-model-evaluation)

  - [Datasets](#datasets)




 

## Quick start

```

$ git clone https://github.com/dsdanielpark/all-about-llm.git

$ cd all-about-llm

$ git submodule update --init --recursive

$ python syncfolk_submodules.py

```

## Leaderboards

| Leaderboard Name | Description | 

| --- | --- | 

| [AlpacaEval Leaderboard](https://tatsu-lab.github.io/alpaca_eval/) | Provides evaluation metrics for LLMs. |  

| [Chatbot Arena (LMSYS Org)](https://chat.lmsys.org/) | Offers resources and a leaderboard for LLM performance. | 

| [Open LLM Leaderboard (Hugging Face)](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) | Features a leaderboard for LLMs. |

| [The Big Benchmarks Collection](https://huggingface.co/collections/open-llm-leaderboard/the-big-benchmarks-collection-64faca6335a7fc7d4ffe974a) | Gathering benchmark spaces on the hub (beyond the Open LLM Leaderboard). |

| [MTEB Leaderboard](#) | Massive Text Embedding Benchmark (MTEB) Leaderboard. |

| [Chatbot Arena Leaderboard](#) | This leaderboard is based on Chatbot Arena, MT-Bench, and MMLU (5-shot). |

| [LLM-Perf Leaderboard](#) | Benchmarks performance (latency, throughput & memory) of LLMs with different hardwares and optimizations. |

| [Big Code Models Leaderboard](#) | Compares performance of base multilingual code generation models on benchmarks like HumanEval and MultiPL-E. |

| [Open ASR Leaderboard](#) | Ranks and evaluates speech recognition models, reporting Average WER and RTF. |

| [MT Bench](#) | MT-Bench Browser associated with Chatbot Arena. |

| [Toolbench Leaderboard](#) | - |

| [OpenCompass LLM Leaderboard](#) | - |

| [OpenCompass MMBench Leaderboard](#) | - |

| [Open Ko-LLM Leaderboard](#) | - |




## Open LLM

| LLM | Initial Release | Developer | License |

| --- | --- | --- | --- |

| [GPT-J](#) | 2021-06-09 | EleutherAI | Apache 2.0 |

| [GPTNeo](#) | 2021-03-21 | EleutherAI, Together | Apache 2.0 |

| [FLAN-T5](#) | 2022-12-06 | Google | Apache 2.0 |

| [BLOOM](#) | 2022-07-06 | Hugging Face | Open RAIL-M v1 |

| [OPT](#) | 2022-05-03 | Meta | NA |

| [Pythia](#) | 2023-02-13 | EleutherAI, Together | Apache 2.0 |

| [LLaMA](#) | 2023-02-24 | Meta | Noncommercial |

| [FLAN-UL2](#) | 2023-03-03 | Google | Apache 2.0 |

| [Alpaca](#) | 2023-03-13 | Stanford | Noncommercial |

| [Cerebras-GPT](#) | 2023-03-28 | Cerebras | Apache 2.0 |

| [Dolly](#) | 2023-03-24 | Databricks | MIT |

| [Vicuna](#) | 2023-03-30 | UC Berkeley, CMU, Stanford, MBZUAI, UCSD | Noncommercial |

| [GPT4All](#) | 2023-03-26 | Nomic AI | Varies |

| [Koala](#) | 2023-04-03 | BAIR | Noncommercial |

| [OpenAssistant](#) | 2023-04-15 | LAION | Varies |

| [StableLM](#) | 2023-04-19 | Stability AI | CC BY-SA 4.0 |

| [OpenLLaMA](#) | 2023-04-28 | OpenLM Research | Apache 2.0 |

| [FastChat](#) | 2023-04-28 | LMSYS | Apache 2.0 |

| [StableVicuna](#) | 2023-04-28 | Stability AI | Noncommercial |

| [BLOOMChat](#) | 2023-05-19 | SambaNova | Apache 2.0 |

| [MPT](https://www.mosaicml.com/blog/mpt-7b) | 2023-05-05 | MosaicML | Apache 2.0 |

| [RedPajama](https://github.com/togethercomputer/RedPajama-Data) | 2023-05-05 | Together | Apache 2.0 |

| [Falcon](https://falconllm.tii.ae/) | 2023-05-23 | TII | Apache 2.0 |

| [Guanaco](https://guanaco-model.github.io/) | 2023-05-23 | UW NLP | Noncommercial |

| [WizardLM](https://huggingface.co/WizardLM/WizardLM-70B-V1.0) | 2023-05-26 | WizardLM | Non-commercial |

| [Orca](https://huggingface.co/Open-Orca/OpenOrca-Preview1-13B) | 2023-06-05 | Microsoft | Noncommercial |

| [Llama 2](https://ai.meta.com/llama/) | 2023-07-18 | Meta | Custom (Commercial OK) |

| [Platypus](https://arxiv.org/abs/2308.07317) | 2023-08-14 | - | Non-commercial |

| [Qwen](https://arxiv.org/abs/2308.07317) | 2023-08-28 | Alibaba Cloud | commercial |

| [Mistral](https://mistral.ai) | 2023-10-10 | Mistral AI | Permissive commercial |

| [Zephyr](https://github.com/zephyrproject-rtos/zephyr) | 2023-10-25 | - | Apache |




## LLM Model Evaluation

- [Harness Task Table](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/docs/task_table.md)

- [Harness Task](https://github.com/EleutherAI/lm-evaluation-harness/tree/master/lm_eval/tasks)

| No. | Task | Description | Year | Few-shot Examples | Random Baseline Accuracy |

| --- | --- | --- | --- | --- | --- |

| 1 | [Jeopardy](https://github.com/aigoopy/llm-jeopardy) | Consists of 2,117 Jeopardy questions from the topics of Literature, American History, World History, Word Origins, and Science, where the model is expected to provide correct answers. | 2022 | 10 | 0% |

| 2 | [MMLU](https://paperswithcode.com/sota/multi-task-language-understanding-on-mmlu) | Comprises 14,042 multiple-choice questions across 57 categories, with academic-standard test-style questions covering subjects like law, mathematics, ethics, and more. The model must choose between options A, B, C, or D. | 2019 | 10 | 25% |

| 3 | [BIG-bench: wikidata](https://github.com/google/BIG-bench/blob/main/bigbench/benchmark_tasks/qa_wikidata/README.md) | Consists of 20,321 questions regarding factual information derived from Wikipedia. The model is expected to complete sentences like "Barack Obama's nationality is..." | 2022 | 10 | ~0% |

| 4 | [ARC easy](https://leaderboard.allenai.org/arc_easy/submissions/get-started) | Comprises 2,376 simple multiple-choice science questions extracted from 3rd to 9th-grade science exams, requiring the model to use basic scientific world knowledge. | 2019 | 10 | 25% |

| 5 | [ARC challenge](https://paperswithcode.com/dataset/arc) | Contains 1,172 challenging multiple-choice science questions extracted from 3rd to 9th-grade science exams, involving scientific world knowledge and some procedural reasoning. | 2019 | 10 | 25% |

| 6 | [BIG-bench misconceptions](https://paperswithcode.com/sota/misconceptions-on-big-bench) | Comprises 219 true/false questions about common misconceptions across various topics, and the model is expected to provide correct answers. | 2022 | 10 | 50% |

| 7 | [BIG-bench: Strategy QA](https://github.com/google/BIG-bench) | Consists of 2,289 yes/no questions related to various common-sense topics, and the model is expected to select the correct answers. | 2022 | 10 | - |

| 8 | [BIG-bench: Strange Stories](https://github.com/google/BIG-bench) | Comprises 174 short stories followed by 2-choice multiple-choice questions regarding characters, their emotions, and common-sense inferences about specific actions. | 2022 | 10 | 50% |

| 9 | [BIG-bench: Novel Concepts](https://github.com/google/BIG-bench) | Contains 32 problems for finding common concepts, and the model is expected to choose the common concept among three given words. | 2022 | 10 | 25% |

| 10 | [COPA](https://paperswithcode.com/sota/question-answering-on-copa) | Involves cause/effect multiple-choice questions where the model receives premises and must select the correct cause/effect among two options. | 2011 | 0 | 50% |

| 11 | [PIQA](https://paperswithcode.com/paper/piqa-reasoning-about-physical-commonsense-in) | Comprises 1,838 2-choice multiple-choice questions about common-sense physics intuition, and the model is expected to select the correct answer. | 2019 | 10 | 50% |

| 12 | [OpenBook QA](https://allenai.org/data/open-book-qa) | Consists of 500 multiple-choice questions about basic physics and scientific intuition for general objects and entities, and the model is expected to select the correct answers. | 2018 | 0 | 25% |

| 13 | [LAMBADA](https://paperswithcode.com/sota/language-modelling-on-lambada) | Contains 5,153 text passages from books where the model reads the first N-1 words of each passage and predicts the last token. | 2016 | 0 | 0% |

| 14 | [HellaSwag](https://paperswithcode.com/dataset/hellaswag) | Consists of 10,042 multiple-choice scenario-based questions where the model must choose the most plausible conclusion among four options. | 2019 | 10 | 25% |

| 15 | [Winograd Schema Challenge](https://paperswithcode.com/dataset/wsc) | Contains 273 scenarios where the model must correctly resolve semantic coreferences in sentences. | 2012 | 0 | 50% |

| 16 | [Winogrande](https://paperswithcode.com/paper/winogrande-an-adversarial-winograd-schema) | Comprises 1,267 scenarios with two starting sentences and a single ending sentence, and the model must select the semantically correct one. | 2012 | 0 | 50% |

| 17 | [BIG bench language identification](https://github.com/google/BIG-bench) | Contains 10,000 multiple-choice questions where the model must recognize sentences written in languages other than English and identify the corresponding language. | 2012 | 10 | 25% |

| 18 | [BIG bench conceptual combinations](https://github.com/google/BIG-bench) | Comprises 103 questions where the model answers multiple-choice questions about the meaning of defined neologisms and sentences using these neologisms. | 2022 | 10 | 25% |

| 19 | [BIG bench conlang translation](https://github.com/google/BIG-bench) | Contains 164 problems where the model provides translations of simple sentences between English and a constructed language. | 2022 | 0 | 0% |

| 20 | [BIG-bench elementary math QA](https://github.com/google/BIG-bench) | Consists of 38,160 multiple-choice arithmetic word problems, and the model is expected to select the correct answer. | 2022 | 10 | 25% |

| 21 | [BIG-bench dyck languages](https://github.com/google/BIG-bench) | Involves 1,000 problems where the model must output the correct tokens required to complete a balanced expression of parentheses and curly braces. | 2022 | 10 | 0% |

| 22 | [BIG-bench algorithms](https://example.com/big-bench-algorithms) | Contains 1,320 problems where the model must determine the length of the longest common subsequence of two strings or check the balance of expressions consisting of parentheses and curly braces. | 2022 | 10 | 0% |

| 23 | [BIG-bench logical deduction](https://github.com/google/BIG-bench) | Comprises 1,500 multiple-choice questions requiring the model to select the logically consistent unique proposition among multiple logical constraints describing the relative order of objects. | 2022 | 10 | 25% |

| 24 | [BIG-bench operators](https://github.com/google/BIG-bench) | Contains 210 problems where the model must calculate the result of expressions using mathematical operators, testing the model's ability to apply mathematical concepts. | 2022 | 10 | 0% |

| 25 | [BIG-bench repeat copy logic](https://github.com/google/BIG-bench) | Comprises 32 tasks where the model must repeatedly copy a series of words in a specific order and produce the correct output. | 2022 | 10 | 0% |

| 26 | [Simple arithmetic with spaces](https://github.com/google/BIG-bench) | Contains 1,000 arithmetic problems with three-digit numbers and up to three operations, where the model must calculate the correct result using the right order of operations. | 2023 | 10 | 0% |

| 27 | [Simple arithmetic without spaces](https://github.com/google/BIG-bench) | Comprises 1,000 arithmetic problems with three-digit numbers and up to three operations, where the model must calculate the correct result of expressions with no spaces between numbers and operators. | 2023 | 10 | 0% |

| 28 | [Math QA](https://github.com/google/BIG-bench) | Contains 2,983 multiple-choice math word problems, requiring basic inference, language comprehension, and arithmetic/algebra skills. | 2021 | 10 | 25% |

| 29 | [LogiQA](https://github.com/google/BIG-bench) | Comprises 651 multiple-choice logic word problems based on mathematical and symbolic problems, where the model must make logical conclusions. | 2020 | 10 | 25% |

| 30 | [BIG-bench: Understanding fables](https://github.com/google/BIG-bench) | Consists of 189 short stories followed by 4-choice multiple-choice questions where the model must select the correct moral for the story. | 2022 | 10 | 25% |

| 31 | [Pubmed QA Labeled](https://pubmedqa.github.io/) | Comprises 1,000 hand-labeled medical documents and related questions, where the model must respond with yes/no/maybe. | 2019 | 10 | ~0% |

| 32 | [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) | Consists of 10,570 short documents followed by related questions on various topics, and the model is expected to output the exact correct answer. | 2016 | 10 | ~0% |

| 33 | [BoolQ](https://paperswithcode.com/paper/boolq-exploring-the-surprising-difficulty-of) | Contains 3,270 short passages on a diverse range of subjects followed by yes/no questions in multiple-choice format. | 2019 | 10 | ~50% |

| 34 | [HumanEval code generation](https://paperswithcode.com/sota/code-generation-on-humaneval) | Comprises 164 Python programming challenges where the model is presented with the method signature and docstring comment for a Python program and is expected to complete the program. The resulting code's functional correctness is tested on a number of input/output pairs. | 2022 | 0 | 0% |

| 35 | [AI2 Reasoning Challenge (25-shot)](https://allenai.org/data/arc) | Consists of grade-school science questions. | / | 25 | / |

| 36 | [TruthfulQA (0-shot)](https://github.com/sylinrl/TruthfulQA) | A test to measure a model's propensity to reproduce falsehoods commonly found online. Note: TruthfulQA in the Harness is actually a minima 6-shot task, as it is prepended by 6 [examples](https://raw.githubusercontent.com/sylinrl/TruthfulQA/main/data/finetune_truth.jsonl) systematically, even when launched using 0 for the number of few-shot examples. | / | 0 | / |

| 37 | [AGIEval](https://github.com/ruixiangcui/AGIEval) | AGIEval is a new benchmark designed to assess foundation models in human-centric




## Datasets 

- https://github.com/Zjh-819/LLMDataHub

- Curated by [Junhao Zhao]([email protected])

  

| Dataset name                                                                                            | Used by                   | Type                | Language            | Size        | Description ️                                                                                                                                          |

|---------------------------------------------------------------------------------------------------------|---------------------------|---------------------|---------------------|-------------|--------------------------------------------------------------------------------------------------------------------------------------------------------|

| [function_
calling_
extended](https://huggingface.co/datasets/Trelis/function_calling_extended) | /                         | Pairs               | English
code    | /           | High quality human created dataset from enhance LM's API using ability.                                                                                |

| [AmericanStories](https://huggingface.co/datasets/dell-research-harvard/AmericanStories)                | /                         | Pre-trained                  | English             | /           | Vast sized corpus scanned from US Library of Congress.                                                                                                 |

| [dolma](https://huggingface.co/datasets/allenai/dolma)                                                  | OLMo                      | Pre-trained                  | /                   | 3T tokens   | A large diverse open-source corpus for LM pretraining.                                                                                                 |

| [Platypus](https://huggingface.co/datasets/garage-bAInd/Open-Platypus)                                  | Platypus2                 | Pairs               | English             | 25K         | A very high quality dataset for improving LM's STEM reasoning ability.                                                                                 |

| [Puffin](https://huggingface.co/datasets/LDJnr/Puffin)                                                  | Redmond-Puffin
Series | Dialog              | English             | ~3k entries | A dataset consists of conversations between real human and GPT-4，which features long context (over 1k tokens per conversation) and multi-turn dialogs. |

| [tiny series](https://huggingface.co/datasets/nampdn-ai/tiny-codes)                                     | /                         | Pairs               | English             | /           | A series of short and concise codes or texts aim at improving LM's reasoning ability.                                                                  |

| [LongBench](https://huggingface.co/datasets/THUDM/LongBench)                                            | /                         | Evaluation
Only | English
Chinese | 17 tasks    | A benchmark for evaluate LLM's long context understanding capability.                                                                                  |

| [orca-chat](https://huggingface.co/datasets/shahules786/orca-chat)                                          | /            | Dialog          | English      | 198,463 entries   | An Orca-style dialog dataset aims at improving LM's long context conversational ability.                                                                                                              |

| [DialogStudio](https://github.com/salesforce/DialogStudio)                                                  | /            | Dialog          | Multilingual | /                 | A collection of diverse datasets aim at building conversational Chatbot.                                                                                                                              |

| [chatbot_arena
_conversations](https://huggingface.co/datasets/lmsys/chatbot_arena_conversations)       | /            | RLHF
Dialog | Multilingual | 33k conversations | Cleaned conversations with pairwise human preferences collected on Chatbot Arena.                                                                                                                     |

| [WebGLM-qa](https://huggingface.co/datasets/THUDM/webglm-qa)                                                | WebGLm       | Pairs           | English      | 43.6k entries     | Dataset used by WebGLM, which is a QA system based on LLM and Internet. Each of the entry in this dataset comprise a question, a response and a reference. The response is grounded in the reference. |

| [phi-1](https://huggingface.co/datasets/teleprint-me/phi-1)                                                 | phi-1        | Dialog          | English      | /                 | A dataset generated by using the method in [Textbooks Are All You Need](https://arxiv.org/abs/2306.11644). It focuses on math and CS problems.                                                        |

| [Linly-
pretraining-
dataset](https://huggingface.co/datasets/Linly-AI/Chinese-pretraining-dataset) | Linly series | PT              | Chinese      | 3.4GB             | Chinese pretraining dataset used by Linly series model, comprises ClueCorpusSmall, CSL news-crawl and etc.                                                                                            |

| [FineGrainedRLHF](https://github.com/allenai/FineGrainedRLHF)                                               | /            | RLHF            | English      | ~5K examples      | A repo aims at develop a new framework to collect human feedbacks. Data collected is with the purpose to improve LLMs  factual correctness, topic relevance and other abilities.                      |

| [dolphin](https://huggingface.co/datasets/ehartford/dolphin)                                                | /            | Pairs           | English      | 4.5M entries      | An attempt to replicate Microsoft's Orca. Based on FLANv2.                                                                                                                                            |

| [openchat_
sharegpt4_
dataset](https://huggingface.co/datasets/openchat/openchat_sharegpt4_dataset) | OpenChat     | Dialog          | English      | 6k dialogs        | A high quality dataset generated by using GPT-4 to complete refined ShareGPT prompts.                                                                                                                 |

| [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca)                                                                                                                                                                                                                                | /                | Pairs        | English               | 4.5M completions              | A collection of augmented FLAN data. Generated by using method is Orca paper.                                                                                                           |

| [COIG-PC](https://huggingface.co/datasets/BAAI/COIG-PC) 
 [COIG-Lite](https://huggingface.co/datasets/BAAI/COIG-PC-Lite)                                                                                                                                                                  | /                | Pairs        | Chinese               | /                             | Enhanced version of COIG.                                                                                                                                                               |

| [WizardLM_Orca](https://huggingface.co/datasets/psmathur/WizardLM_Orca)                                                                                                                                                                                                                       | orca_mini series | Pairs        | English               | 55K entries                   | Enhanced WizardLM data. Generated by using orca's method.                                                                                                                               |

| arxiv instruct datasets
 [math](https://huggingface.co/datasets/ArtifactAI/arxiv-math-instruct-50k) 
 [CS](https://huggingface.co/datasets/ArtifactAI/arxiv-beir-cs-ml-generated-queries) 
 [Physics](https://huggingface.co/datasets/ArtifactAI/arxiv-physics-instruct-tune-30k) | /                | Pairs        | English               | 50K/
50K/
30K entries | dataset consists of question-answer pairs derived from ArXiv abstracts. Questions are generated using the t5-base model, while the answers are generated using the GPT-3.5-turbo model. |

| [im-feeling-
curious](https://huggingface.co/datasets/xiyuez/im-feeling-curious)                                                                                                                                                                                                          | /                | Pairs        | English               | 2595 entries                  | Random questions and correspond facts generated by Google **I'm feeling curious** features.                                                                                             |

| [ign_clean
_instruct
_dataset_500k](https://huggingface.co/ignmilton)                                                                                                                                                                                                                 | /                | Pairs        | /                     | 509K entries                  | A large scale SFT dataset which is synthetically created from a subset of Ultrachat prompts. ⚠ lack of detailed datacard                                                                |

| [WizardLM
evolve_instruct V2](https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k)                                                                                                                                                                                    | WizardLM         | Dialog       | English               | 196k entries                  | The latest version of Evolve Instruct dataset.                                                                                                                                          |

| [Dynosaur](https://github.com/WadeYin9712/Dynosaur)                                                                                                                                                                                                                                           | /                | Pairs        | English               | 800K entries                  | The dataset generated by applying method in [this paper](https://dynosaur-it.github.io/). Highlight is generating high-quality data at low cost.                                        |

| [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B)                                                                                                                                                                                                                        | /                | PT           | Primarily
English | /                             | A cleaned and deduplicated version of RedPajama                                                                                                                                         |

| [LIMA dataset](https://huggingface.co/datasets/GAIR/lima)                                                                                                                                                                                                                                     | LIMA             | Pairs        | English               | 1k entries                    | High quality SFT dataset used by [LIMA: Less Is More for Alignment](https://arxiv.org/pdf/2305.11206.pdf)                                                                               |

| [TigerBot Series](https://github.com/TigerResearch/TigerBot#%E5%BC%80%E6%BA%90%E6%95%B0%E6%8D%AE%E9%9B%86)                                                                                                                                                                                    | TigerBot         | PT
Pairs | Chinese
English   | /                             | Datasets used to train the TigerBot, including pretraining data, STF data and some domain specific datasets like financial research reports.                                            |

| [TSI-v0](https://huggingface.co/datasets/tasksource/tasksource-instruct-v0)                                                                                                                                                                                                                   | /                | Pairs        | English               | 30k examples
per task     | A Multi-task instruction-tuning data recasted from 475 of the tasksource datasets. Similar to Flan dataset and Natural instruction.                                                     |

| [NMBVC](https://github.com/esbatmop/MNBVC)                                                                                                                                                                                                                                                    | /                | PT           | Chinese               | /                             | A large scale, continuously updating Chinese pretraining dataset.                                                                                                                       |

| [StackOverflow
post](https://huggingface.co/datasets/mikex86/stackoverflow-posts)                                                                                                                                                                                                         | /                | PT           | /                     | 35GB                          | Raw StackOverflow data in markdown format, for pretraining.                                                                                                                             |

| [LaMini-Instruction](https://huggingface.co/datasets/MBZUAI/LaMini-instruction)                                                                                                                                                                                                    | /                                                   | Pairs                        | English                                              | 2.8M entries                                                                            | A dataset distilled from flan collection, p3 and self-instruction.                                                                                                                                     |

| [ultraChat](https://huggingface.co/datasets/stingning/ultrachat)                                                                                                                                                                                                                   | /                                                   | Dialog                       | English                                              | 1.57M dialogs                                                                           | A large scale dialog dataset created by using two ChatGPT, one of which act as the user, another generates response.                                                                                   |

| [ShareGPT_
Vicuna_unfiltered](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered)                                                                                                                                                                       | Vicuna                                              | Pairs                        | Multilingual                                         | 53K entries                                                                             | Cleaned ShareGPT dataset.                                                                                                                                                                              |

| [pku-saferlhf-dataset](https://github.com/PKU-Alignment/safe-rlhf#pku-saferlhf-dataset)                                                                                                                                                                                            | Beaver                                              | RLHF                         | English                                              | 10K + 1M                                                                                | The first dataset of its kind and contains 10k instances with safety preferences.                                                                                                                      |

| RefGPT-Dataset
[nonofficial link](https://github.com/sufengniu/RefGPT)                                                                                                                                                                                                         | RefGPT                                              | Pairs, Dialog                | Chinese                                              | ~50K entries                                                                            | A Chinese dialog dataset aims at improve the correctness of fact in LLMs (mitigate the hallucination of LLM).                                                                                          |

| [Luotuo-QA-A
CoQA-Chinese](https://huggingface.co/datasets/silk-road/Luotuo-QA-A-CoQA-Chinese)                                                                                                                                                                                 | Luotuo project                                      | Context                      | Chinese                                              | 127K QA pairs                                                                           | A dataset built upon translated CoQA. Augmented by using OpenAI API.                                                                                                                                   |

| [Wizard-LM-Chinese
instruct-evol](https://huggingface.co/datasets/silk-road/Wizard-LM-Chinese-instruct-evol)                                                                                                                                                                   | Luotuo project                                      | Pairs                        | Chinese                                              | ~70K entries                                                                            | Chinese version WizardLM 70K. Answers are obtained by feed translated questions in OpenAI's GPT API and then get responses.                                                                            |

| [alpaca_chinese
dataset](https://github.com/hikariming/alpaca_chinese_dataset)                                                                                                                                                                                                 | /                                                   | Pairs                        | Chinese                                              | /                                                                                       | GPT-4 translated alpaca data includes some complement data (like Chinese poetry, application, etc.). Inspected by human.                                                                               |

| [Zhihu-KOL](https://huggingface.co/datasets/wangrui6/Zhihu-KOL)                                                                                                                                                                                                                    | Open Assistant                                      | Pairs                        | Chinese                                              | 1.5GB                                                                                   | QA data on well-know Chinese Zhihu QA platform.                                                                                                                                                        |

| [Alpaca-GPT-4_zh-cn](https://huggingface.co/datasets/shibing624/alpaca-zh)                                                                                                                                                                                                         | /                                                   | Pairs                        | Chinese                                              | about 50K entries                                                                       | A Chinese Alpaca-style dataset, generated by GPT-4 originally in Chinese, not translated.                                                                                                              |

| [hh-rlhf](https://github.com/anthropics/hh-rlhf) 
 [on Huggingface](https://huggingface.co/datasets/Anthropic/hh-rlhf)                                                                                                                                                         | Koala                                               | RLHF                         | English                                              | 161k pairs
79.3MB                                                                   | A pairwise dataset for training reward models in reinforcement learning for improving language models' harmlessness and helpfulness.                                                                   |

| [Panther-dataset_v1](https://huggingface.co/datasets/Rardilit/Panther-dataset_v1)                                                                                                                                                                                                  | Panther                                             | Pairs                        | English                                              | 377 entries                                                                             | A dataset comes from the hh-rlhf. It rewrite hh-rlhf into the form of input-output pairs.                                                                                                              |

| [Baize Dataset](https://github.com/project-baize/baize-chatbot/tree/main/data)                                                                                                                                                                                                     | Baize                                               | Dialog                       | English                                              | 100K dialogs                                                                            | A dialog dataset generated by GPT-4 using self-talking. Questions and topics are collected from Quora, StackOverflow and some medical knowledge source.                                                |

| [h2ogpt-fortune2000
personalized](https://huggingface.co/datasets/h2oai/h2ogpt-fortune2000-personalized)                                                                                                                                                                       | h2ogpt                                              | Pairs                        | English                                              | 11363 entries                                                                           | A instruction finetune developed by h2oai, covered various topics.                                                                                                                                     |

| [SHP](https://huggingface.co/datasets/stanfordnlp/SHP)                                                                                                                                                                                                                             | StableVicuna,
chat-opt,
, SteamSHP          | RLHF                         | English                                              | 385K entries                                                                            | An RLHF dataset different from previously mentioned ones, it use scores+timestamps to infer the users' preferences. Covers 18 domains, collected by Stanford.                                          |

| [ELI5](https://huggingface.co/datasets/eli5#source-data)                                                                                                                                                                                                                           | MiniLM series                                       | FT,
RLHF                 | English                                              | 270K entries                                                                            | Questions and Answers collected from Reddit, including score. Might be used for RLHF reward model training.                                                                                            |

| [WizardLM
evol_instruct](https://huggingface.co/datasets/victor123/evol_instruct_70k) 
 [V2](https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k)                                                                                                      | WizardLM                                            | Pairs                        | English                                              |                                                                                         | An instruction finetune dataset derived from Alpaca-52K, using the **evolution** method in [this paper](https://arxiv.org/pdf/2304.12244.pdf)                                                          |

| [MOSS SFT data](https://github.com/OpenLMLab/MOSS/tree/main/SFT_data)                                                                                                                                                                                                              | MOSS                                                | Pairs,
Dialog            | Chinese, English                                     | 1.1M entries                                                                            | A conversational dataset collected and developed by MOSS team. It has usefulness, loyalty and harmlessness labels for every data entries.                                                              |

| [ShareGPT52K](https://huggingface.co/datasets/RyokoAI/ShareGPT52K)                                                                                                                                                                                                                 | Koala, Stable LLM                                   | Pairs                        | Multilingual                                         | 52K                                                                                     | This dataset comprises conversations collected from ShareGPT, with a specific focus on customized creative conversation.                                                                               |

| [GPT-4all Dataset](https://huggingface.co/datasets/nomic-ai/gpt4all-j-prompt-generations)                                                                                                                                                                                          | GPT-4all                                            | Pairs                        | English, 
 Might have 
 a translated version | 400k entries                                                                            | A combination of some subsets of OIG, P3 and Stackoverflow. Covers topics like general QA, customized creative questions.                                                                              |

| [COIG](https://huggingface.co/datasets/BAAI/COIG)                                                                                                                                                                                                                                  | /                                                   | Pairs                        | Chinese,
code                                    | 200K entries                                                                            | A Chinese-based dataset. It contains domains like general purpose QA, Chinese exams, code. Its quality is checked by human annotators.                                                                 |

| [RedPajama-Data-1T](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T)                                                                                                                                                                                            | RedPajama                                           | PT                           | Primarily English                                    | 1.2T tokens 
 5TB                                                                   | A fully open pretraining dataset follows the LLaMA's method.                                                                                                                                           |

| [OASST1](https://huggingface.co/datasets/OpenAssistant/oasst1)                                                                                                                                                                                                                     | OpenAssistant                                       | Pairs,
 Dialog           | Multilingual
(English, Spanish, etc.)            | 66,497 conversation trees                                                               | A large, human-written, human-annotated high quality conversation dataset. It aims at making LLM generates more natural response.                                                                      |

| [Alpaca-COT](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT)                                                                                                                                                                                                                  | Phoenix                                             | Pairs,
 Dialog,
 CoT | English                                              | /                                                                                       | A mixture a many dataset like classic Alpaca dataset, OIG, Guanaco and some CoT(Chain-of-Thought) datasets like FLAN-CoT. May be handy to use.                                                         |

| [Bactrian-X](https://huggingface.co/datasets/MBZUAI/Bactrian-X)                                                                                                                                                                                                                    | /                                                   | Pairs                        | Multilingual
 (52 languages)                     | 67K entries per language                                                                | A multilingual version of **Alpaca** and **Dolly-15K**.                                                                                                                                                |

| [databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) 
 [zh-cn Ver](https://huggingface.co/datasets/jaja7744/dolly-15k-cn)                                                                                                                   | Dolly2.0                                            | Pairs                        | English                                              | 15K+ entries                                                                            | A dataset of **human-written** prompts and responses, featuring tasks such as open-domain question-answering, brainstorming, summarization, and more.                                                  |

| [AlpacaDataCleaned](https://github.com/gururise/AlpacaDataCleaned)                                                                                                                                                                                                                 | Some Alpaca/ LLaMA-like models                      | Pairs                        | English                                              | /                                                                                       | Cleaned version of Alpaca, GPT_LLM and GPTeacher.                                                                                                                                                      |

| [GPT-4-LLM Dataset](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM)                                                                                                                                                                                                    | Some Alpaca-like models                             | Pairs,
 RLHF             | English,
 Chinese                                | 52K entries for English and Chinese respectively 
 9K entries unnatural-instruction | NOT the dataset used by GPT-4!! It is generated by GPT-4 and some other LLM for better Pairs and RLHF. It includes instruction data as well as comparison data in RLHF style.                          |

| [GPTeacher](https://github.com/teknium1/GPTeacher)                                                                                                                                                                                                                                 | /                                                   | Pairs                        | English                                              | 20k entries                                                                             | A dataset contains targets generated by GPT-4 and includes many of the same seed tasks as the Alpaca dataset, with the addition of some new tasks such as roleplay.                                    |

| [HC3](https://github.com/Hello-SimpleAI/chatgpt-comparison-detection)                                                                                                                                                                                                              | Koala                                               | RLHF                         | English,
 Chinese                                | 24322 English 
 12853 Chinese                                                       | A multi-domain, human-vs-ChatGPT comparison dataset. Can be used for reward model training or ChatGPT detector training.                                                                               |

| [Alpaca data](https://github.com/tatsu-lab/stanford_alpaca#data-release) 
 [Download](https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json)                                                                                                                 | Alpaca, ChatGLM-finetune-LoRA, Koala                | Dialog,
 Pairs           | English                                              | 52K entries
21.4MB                                                                  | A dataset generated by text-davinci-003 to improve language models' ability to follow human instruction.                                                                                               |

| [OIG](https://huggingface.co/datasets/laion/OIG) 
 [OIG-small-chip2](https://huggingface.co/datasets/0-hero/OIG-small-chip2)                                                                                                                                                   | Pythia-Chat-Base-7B, GPT-NeoXT-Chat-Base-20B, Koala | Dialog,
 Pairs           | English,
 code                                   | 44M entries                                                                             | A large conversational instruction dataset with medium and high quality subsets *(OIG-small-chip2)* for multi-task learning.                                                                           |

| [ChatAlpaca data](https://github.com/cascip/ChatAlpaca)                                                                                                                                                                                                                            | /                                                   | Dialog,
 Pairs           | English,
 Chinese version coming soon            | 10k entries
39.5MB                                                                  | A dataset aims to help researchers develop models for instruction-following in multi-turn conversations.                                                                                               |

| [InstructionWild](https://github.com/XueFuzhao/InstructionWild)                                                                                                                                                                                                                    | ColossalChat                                        | Pairs                        | English, Chinese                                     | 10K enreues                                                                             | A Alpaca-style dataset, but with seed tasks comes from chatgpt screenshot.                                                                                                                             |

| [Firefly](https://huggingface.co/datasets/YeungNLP/firefly-train-1.1M)                                                                                                                                                                                                         | Firefly(流萤)                                         | Pairs                        | Chinese                                              | 1.1M entries
1.17GB                                                                 | A Chinese instruction-tuning dataset with 1.1 million human-written examples across 23 tasks, but no conversation.                                                                                     |

| [BELLE](https://github.com/LianjiaTech/BELLE) 
 [0.5M version](https://huggingface.co/datasets/BelleGroup/train_0.5M_CN) 
 [1M version](https://huggingface.co/datasets/BelleGroup/train_1M_CN) 
 [2M version](https://huggingface.co/datasets/BelleGroup/train_2M_CN) | BELLE series, Chunhua (春华)                          | Pairs                        | Chinese                                              | 2.67B in total                                                                          | A Chinese instruction dataset similar to *Alpaca data* constructed by generating answers from seed tasks, but no conversation.                                                                         |

| [GuanacoDataset](https://huggingface.co/datasets/JosephusCheung/GuanacoDataset#guanacodataset)                                                                                                                                                                                     | Guanaco                                             | Dialog,
 Pairs           | English,
 Chinese,
 Japanese                 | 534,530 entries                                                                         | A multilingual instruction dataset for enhancing language models' capabilities in various linguistic tasks, such as natural language understanding and explicit content recognition.                   |

| [OpenAI WebGPT](https://huggingface.co/datasets/openai/webgpt_comparisons)                                                                                                                                                                                                         | WebGPT's reward model, Koala                        | RLHF                         | English                                              | 19,578 pairs                                                                            | Data set used in WebGPT paper. Used for training reward model in RLHF.                                                                                                                                 |

| [OpenAI
Summarization
Comparison](https://huggingface.co/datasets/openai/summarize_from_feedback)                                                                                                                                                                          | Koala                                               | RLHF                         | English                                              | ~93K entries
420MB                                                                  | A dataset of human feedback which helps training a reward model. The reward model was then used to train a summarization model to align with human preferences.                                        |

| [self-instruct](https://github.com/yizhongw/self-instruct)                                                                                                                                                                                                                         | /                                                   | Pairs                        | English                                              | 82K entries                                                                             | The dataset generated by using the well-known [self-instruction method](https://arxiv.org/abs/2212.10560)                                                                                              |

| [unnatural-instructions](https://github.com/orhonovich/unnatural-instructions)                                                                                                                                                                                                     | /                                                   | Pairs                        | English                                              | 240,670 examples                                                                        | An early attempt to use powerful model (text-davinci-002) to generate data.                                                                                                                            |

| [xP3 (and some variant)](https://huggingface.co/datasets/bigscience/xP3)                                                                                                                                                                                                           | BLOOMZ, mT0                                         | Pairs                        | Multilingual,
 code                              | 79M entries
88GB                                                                    | An instruction dataset for improving language models' generalization ability, similar to *Natural Instruct*.                                                                                           |

| [Flan V2](https://github.com/google-research/FLAN/tree/main/flan/v2)                                                                                                                                                                                                               | /                                                   | /                            | English                                              | /                                                                                       | A dataset compiles datasets from Flan 2021, P3, Super-Natural Instructions, along with dozens more datasets into one and formats them into a mix of zero-shot, few-shot and chain-of-thought templates |

| [Natural Instruction](https://instructions.apps.allenai.org/) 
 [GitHub&Download](https://github.com/allenai/natural-instructions)                                                                                                                                             | tk-instruct series                                  | Pairs, 
 evaluation      | Multilingual                                         | /                                                                                       | A benchmark with over 1,600 tasks with instruction and definition for evaluating and improving language models' multi-task generalization under natural language instruction.                          |

| [CrossWOZ](https://github.com/thu-coai/CrossWOZ)                                                                                                                                                                                                                                   | /                                                   | Dialog                       | English,
Chinese                                 | 6K dialogs                                                                              | The dataset introduced by [this paper](https://arxiv.org/pdf/2002.11893.pdf), mainly about tourism topic in Beijing, answers are generated automatically by rules.                                     |

| [proof-pile](https://huggingface.co/datasets/hoskinson-center/proof-pile)                                                                     | proof-GPT                                                                      | PT                                   | English
LaTeX       | 13GB        | A pretraining dataset which is similar to the pile but have LaTeX corpus to enhance LM's ability in proof.                                        |

| [peS2o](https://huggingface.co/datasets/allenai/peS2o)                                                                                        | /                                                                              | PT                                   | English                 | 7.5GB       | A high quality academic paper dataset for pretraining.                                                                                            |

| [StackOverflow
post](https://huggingface.co/datasets/mikex86/stackoverflow-posts)                                                         | /                                                                              | PT                                   | /                       | 35GB        | Raw StackOverflow data in markdown format, for pretraining.                                                                                       |

| [lvwerra/stack-exchange-paired](https://huggingface.co/datasets/lvwerra/stack-exchange-paired/tree/main/data/rl) | Stack LLaMA 2 | PT | English | 6.3GB | Paired StackOverFlow human preference dataset  |

| [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B)                                                                        | /                                                                              | PT                                   | Primarily
English   | /           | A cleaned and deduplicated version of RedPajama                                                                                                   |

| [NMBVC](https://github.com/esbatmop/MNBVC)                                                                                                    | /                                                                              | PT                                   | Chinese                 | /           | A large scale, continuously updating Chinese pretraining dataset.                                                                                 |

| [falcon-refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb)                                                                 | tiiuae/falcon series                                                           | PT                                   | English                 | /           | A refined subset of CommonCrawl.                                                                                                                  |

| [CBook-150K](https://github.com/FudanNLPLAB/CBook-150K)                                                                                       | /                                                                              | PT, 
 building dataset           | Chinese                 | 150K+ books | A raw Chinese books dataset. Need some preprocess pipeline.                                                                                       |

| [Common Crawl](https://commoncrawl.org/)                                                                                                      | LLaMA (After some process)                                                     | building datasets, 
 PT          | /                       | /           | The most well-known raw dataset, rarely be used directly. One possible preprocess pipeline is [CCNet](https://github.com/facebookresearch/cc_net) |

| [nlp_Chinese_Corpus](https://github.com/brightmart/nlp_chinese_corpus)                                                                        | /                                                                              | PT,
TF                           | Chinese                 | /           | A Chinese pretrain corpus. Includes Wikipedia, Baidu Baike, Baidu QA, some forums QA and news corpus.                                             |

| [The Pile (V1)](https://pile.eleuther.ai/)                                                                                                    | GLM (partly), LLaMA (partly), GPT-J, GPT-NeoX-20B, Cerebras-GPT 6.7B, OPT-175b | PT                                   | Multilingual,
 code | 825GB       | A diverse open-source language modeling dataset consisting of 22 smaller, high-quality datasets that includes many domains and tasks.             |

| C4 
 [Huggingface dataset](https://huggingface.co/datasets/c4) 
 [TensorFlow dataset](https://www.tensorflow.org/datasets/catalog/c4) | Google T5 Series, LLaMA                                                        | PT                                   | English                 | 305GB       | A colossal, cleaned version of Common Crawl's web crawl corpus. Frequently be used.                                                               |

| [ROOTS](https://huggingface.co/bigscience-data)                                                                                               | BLOOM                                                                          | PT                                   | Multilingual,
 code | 1.6TB       | A diverse open-source dataset consisting of sub-datasets like Wikipedia and StackExchange for language modeling.                                  |

| [PushshPairs reddit](https://files.pushshPairs.io/reddit/) 
 [paper](https://arxiv.org/pdf/2001.08435.pdf)                                | OPT-175b                                                                       | PT                                   | /                       | /           | Raw reddit data, one possible processing pipeline in [this paper](https://aclanthology.org/2021.eacl-main.24.pdf)                                 |

| [Gutenberg project](https://www.gutenberg.org/policy/robot_access.html)                                                                       | LLaMA                                                                          | PT                                   | Multilingual            | /           | A book dataset, mostly novels. Not be preprocessed.                                                                                               |

| [CLUECorpus](https://github.com/CLUEbenchmark/CLUE)                                                                                           | /                                                                              | PT, 
 finetune, 
 evaluation | Chinese                 | 100GB       | A Chinese pretraining Corpus sourced from *Common Crawl*.                                                                                         |

| [starcoderdata](https://huggingface.co/datasets/bigcode/starcoderdata)                                               | starcoder
series                                      | PT               | code                  | 783GB             | A large pretraining dataset for improving LM's coding ability.                                                                                                        |

| [code_
instructions
_120k_alpaca](https://huggingface.co/datasets/iamtarun/code_instructions_120k_alpaca)    | /                                                         | Pairs            | English/code          | 121,959 entries   | [code_instruction](https://huggingface.co/datasets/sahil2801/code_instructions_120k) in instruction finetune format.                                                  |

| [function-
invocations-25k](https://huggingface.co/datasets/unaidedelf87777/openapi-function-invocations-25k)    | some MPT 
 variants                                   | Pairs            | English code          | 25K entries       | A dataset aims at teaching AI models how to correctly invoke [APIsGuru](https://github.com/APIs-guru/openapi-directory) functions based on natural language prompts.  |

| [TheoremQA](https://huggingface.co/datasets/wenhu/TheoremQA)                                                         | /                                                         | Pairs            | English               | 800               | A high quality STEM theorm QA dataset.                                                                                                                                |

| [phi-1](https://huggingface.co/datasets/teleprint-me/phi-1)                                                          | phi-1                                                     | Dialog           | English               | /                 | A dataset generated by using the method in [Textbooks Are All You Need](https://arxiv.org/abs/2306.11644). It focuses on math and CS problems.                        |

| [FinNLP](https://github.com/AI4Finance-Foundation/FinNLP)                                                            | [FinGPT](https://github.com/AI4Finance-Foundation/FinGPT) | Raw data         | English,
Chinese  | /                 | Open-source raw financial text data. Includes news, social media and etc.                                                                                             |

| [PRM800K](https://github.com/openai/prm800k)                                                                         | A variant of
GPT-4                                    | Context          | English               | 800K entries      | A process supervision dataset for mathematical problems                                                                                                               |

| [MeChat data](https://github.com/qiuhuachuan/smile)  ⚠️                                                 | MeChat                                                    | Dialog           | Chinese               | 355733 utterances | A Chinese SFT dataset for training a mental healthcare chatbot.                                                                                                       |

| [ChatGPT-Jailbreak-Prompts](https://huggingface.co/datasets/rubend18/ChatGPT-Jailbreak-Prompts) ⚠️        | /                                                         | /                | English               | 163KB file size   | Prompts for bypassing the safety regulation of ChatGPT. Can be use for probing the harmlessness of LLMs                                                               |

| [awesome chinese
legal resources](https://github.com/pengxiao-song/awesome-chinese-legal-resources)              | LaWGPT                                                    | /                | Chinese               | /                 | A collection of Chinese legal data for LLM training.                                                                                                                  |

| [Long Form](https://github.com/akoksal/LongForm)                                                                     | /                                                         | Pairs            | English               | 23.7K entries     | A dataset aims at improving the long text generation ability of LLM.                                                                                                  |

| [symbolic-instruction-tuning](https://huggingface.co/datasets/sail/symbolic-instruction-tuning)                      | /                                                         | Pairs            | English,
 code    | 796               | A dataset focuses on the 'symbolic' tasks: like SQL coding, mathematical computation, etc.                                                                            |

| [Safety Prompt](https://github.com/thu-coai/Safety-Prompts)                                                          | /                                                         | Evaluation  only | Chinese               | 100k entries      | Chinese safety prompts for evaluating and improving the safety of LLMs.                                                                                               |

| [Tapir-Cleaned](https://huggingface.co/datasets/MattiaL/tapir-cleaned-116k)                                          | /                                                         | Pairs            | English,              | 116k entries      | This is a revised version of the DAISLab dataset of PairsTT rules, which has been thoroughly cleaned, scored, and adjusted for the purpose of instruction-tuning      |

| [instructional_
codesearchnet_python](https://huggingface.co/datasets/Nan-Do/instructional_codesearchnet_python) | /                                                         | Pairs            | English &
 Python | 192MB             | This dataset is a template generated instructional Python datastet generated from an annotated version of the code-search-net dataset for the Open-Assistant project. |

| [finance-alpaca](https://huggingface.co/datasets/gbharti/finance-alpaca)                                             | /                                                         | Pairs            | English               | 1.3K entries      | An Alpaca-style dataset but focus on financial topics                                                                                                                 |

| [OBELICS](https://huggingface.co/datasets/HuggingFaceM4/OBELICS)                    | idefics
series | image-document       | English      | 141M documents | an open, massive, and curated collection of interleaved image-text web documents.                           |

| [JourneyDB](https://huggingface.co/datasets/JourneyDB/JourneyDB)                    | /                  | image-prompt-caption | English      | 4M instances   | A large scale dataset comprises QA, caption, and text prompting tasks, which is based on Midjourney images. |

| [M3IT](https://huggingface.co/datasets/MMInstruction/M3IT)                          | Ying-VLM           | instruction-image    | Multilingual | 2.4M instances | A dataset comprises 40 tasks with 400 human written instruction.                                            |

| [MIMIC-IT](https://github.com/Luodian/Otter/tree/main/mimic-it)                     | Otter              | instruction-image    | Multilingial | 2.2M instances | High quality multi-modal instructions-response pairs based on images and videos.                            |

| [LLaVA Instruction](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) | LLaVA              | instruction-image    | English      | 158k samples   | A multimodal dataset generated upon COCO dataset by prompting GPT-4 to get instructions.                    |

| WebText(Reddit links) | GPT-2              | PT   | English                               | /     | Data crawled from Reddit and filtered for GPT-2 pretraining.                                    |

| MassiveText           | Gopher, Chinchilla | PT   | 99% English, 1% other(including code) |       |                                                                                                 |

| WuDao Corpora     | GLM                | PT   | Chinese                               | 200GB | A large scale Chinese corpus, Possible component originally open-sourced but not available now. |
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/dsdanielpark/all-about-llm

Awesome Lists containing this project

README