https://github.com/dsdanielpark/open-llm-datasets

Repository for organizing datasets and papers used in Open LLM.
https://github.com/dsdanielpark/open-llm-datasets
datasets large-language-models llm llm-datasets llm-training natural-language-processing
Last synced: 3 months ago
JSON representation
Repository for organizing datasets and papers used in Open LLM.
Host: GitHub
URL: https://github.com/dsdanielpark/open-llm-datasets
Owner: dsdanielpark
License: mit
Created: 2023-05-29T12:52:16.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2023-07-06T01:59:07.000Z (almost 2 years ago)
Last Synced: 2025-01-14T10:48:33.316Z (4 months ago)
Topics: datasets, large-language-models, llm, llm-datasets, llm-training, natural-language-processing
Homepage: https://huggingface.co/datasets
Size: 3.13 MB
Stars: 92
Watchers: 5
Forks: 6
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

        

# Open-LLM-datasets

Repository for organizing datasets used in Open LLM. 




# Table of Contents

- [Datasets](#datasets)

  - [General Open Access Datasets for Alignment](#general-open-access-datasets-for-alignment)

  - [Open Datasets for Pretraining](#open-datasets-for-pretraining)

  - [Domain-specific datasets and Private datasets](#domain-specific-datasets-and-private-datasets)

  - [Potential Overlap](#potential-overlap)

- [Papers](#papers)

  - [Pre-trained LLM](#pre-trained-llm)

  - [Instruction finetuned LLM](#instruction-finetuned-llm)

  - [Aligned LLM](#aligned-llm)

- [Open LLM](#open-llm)

  - [LLM Training Frameworks](#llm-training-frameworks)

  - [LLM Optimization](#llm-optimization)

    - [State-of-the-art Parameter-Efficient Fine-Tuning (PEFT) methods](#state-of-the-art-parameter-efficient-fine-tuning-peft-methods)

  - [Tools for deploying LLM](#tools-for-deploying-llm)

  - [Tutorials about LLM](#tutorials-about-llm)

  - [Courses about LLM](#courses-about-llm)

  - [Opinions about LLM](#opinions-about-llm)

  - [Other Awesome Lists](#other-awesome-lists)

  - [Other Useful Resources](#other-useful-resources)

  - [Contribute](#contribute)

- [References](#references)




# Datasets

To download or access information about the most commonly used datasets: https://huggingface.co/datasets

## General Open Access Datasets for Alignment

- [falcon-refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb)

- [ultraChat](https://huggingface.co/datasets/stingning/ultrachat)

- [ShareGPT_Vicuna_unfiltered](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered)

- [pku-saferlhf-dataset](https://github.com/PKU-Alignment/safe-rlhf#pku-saferlhf-dataset)

- [RefGPT-Dataset](https://github.com/ziliwangnlp/RefGPT)

- [Luotuo-QA-A-CoQA-Chinese](https://huggingface.co/datasets/silk-road/Luotuo-QA-A-CoQA-Chinese)

- [Wizard-LM-Chinese-instruct-evol](https://huggingface.co/datasets/silk-road/Wizard-LM-Chinese-instruct-evol)

- [alpaca_chinese_dataset](https://github.com/hikariming/alpaca_chinese_dataset)

- [Zhihu-KOL](https://huggingface.co/datasets/wangrui6/Zhihu-KOL)

- [Alpaca-GPT-4_zh-cn](https://huggingface.co/datasets/shibing624/alpaca-zh)

- [Baize Dataset](https://github.com/project-baize/baize-chatbot/tree/main/data)

- [h2oai/h2ogpt-fortune2000-personalized](https://huggingface.co/datasets/h2oai/h2ogpt-fortune2000-personalized)

- [SHP](https://huggingface.co/datasets/stanfordnlp/SHP)

- [ELI5](https://huggingface.co/datasets/eli5#source-data)

- [evol_instruct_70k](https://huggingface.co/datasets/victor123/evol_instruct_70k)

- [MOSS SFT data](https://github.com/OpenLMLab/MOSS/tree/main/SFT_data)

- [ShareGPT52K](https://huggingface.co/datasets/RyokoAI/ShareGPT52K)

- [GPT-4all Dataset](https://huggingface.co/datasets/nomic-ai/gpt4all-j-prompt-generations)

- [COIG](https://huggingface.co/datasets/BAAI/COIG)

- [RedPajama-Data-1T](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T)

- [OpenAssistant Conversations Dataset (OASST1)](https://huggingface.co/datasets/OpenAssistant/oasst1)

- [Alpaca-COT](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT)

- [CBook-150K](https://github.com/FudanNLPLAB/CBook-150K)

- [databricks-dolly-15k](https://github.com/databrickslabs/dolly/tree/master/data) ([possible zh-cn version](https://huggingface.co/datasets/jaja7744/dolly-15k-cn))

- [AlpacaDataCleaned](https://github.com/gururise/AlpacaDataCleaned)

- [GPT-4-LLM Dataset](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM)

- [GPTeacher](https://github.com/teknium1/GPTeacher)

- [HC3](https://github.com/Hello-SimpleAI/chatgpt-comparison-detection)

- [Alpaca data](https://github.com/tatsu-lab/stanford_alpaca#data-release) [Download](https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json)

- [OIG](https://huggingface.co/datasets/laion/OIG) [OIG-small-chip2](https://huggingface.co/datasets/0-hero/OIG-small-chip2)

- [ChatAlpaca data](https://github.com/cascip/ChatAlpaca)

- [InstructionWild](https://github.com/XueFuzhao/InstructionWild)

- [Firefly(流萤)](https://huggingface.co/datasets/YeungNLP/firefly-train-1.1M)

- [BELLE](https://github.com/LianjiaTech/BELLE) [0.5M version](https://huggingface.co/datasets/BelleGroup/train_0.5M_CN) [1M version](https://huggingface.co/datasets/BelleGroup/train_1M_CN) [2M version](https://huggingface.co/datasets/BelleGroup/train_2M_CN)

- [GuanacoDataset](https://huggingface.co/datasets/JosephusCheung/GuanacoDataset#guanacodataset)

- [xP3 (and some variant)](https://huggingface.co/datasets/bigscience/xP3)

- [OpenAI WebGPT](https://huggingface.co/datasets/openai/webgpt_comparisons)

- [OpenAI Summarization Comparison](https://huggingface.co/datasets/openai/summarize_from_feedback)

- [Natural Instruction](https://instructions.apps.allenai.org/) [GitHub&Download](https://github.com/allenai/natural-instructions)

- [hh-rlhf](https://github.com/anthropics/hh-rlhf) [on Huggingface](https://huggingface.co/datasets/Anthropic/hh-rlhf)

- [OpenAI PRM800k](https://github.com/openai/prm800k) 

## Open Datasets for Pretraining

- [falcon-refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb)

- [Common Crawl](https://commoncrawl.org/)

- [nlp_Chinese_Corpus](https://github.com/brightmart/nlp_chinese_corpus)

- [The Pile (V1)](https://pile.eleuther.ai/)

- [Huggingface dataset for C4](https://huggingface.co/datasets/c4)

- [TensorFlow dataset for C4](https://www.tensorflow.org/datasets/catalog/c4)

- [ROOTS](https://huggingface.co/bigscience-data)

- [PushshPairs reddit](https://files.pushshPairs.io/reddit/)

- [Gutenberg project](https://www.gutenberg.org/policy/robot_access.html)

- [CLUECorpus](https://github.com/CLUEbenchmark/CLUE)

## Domain-specific datasets and Private datasets

- [ChatGPT-Jailbreak-Prompts](https://huggingface.co/datasets/rubend18/ChatGPT-Jailbreak-Prompts)

- [awesome-chinese-legal-resources](https://github.com/pengxiao-song/awesome-chinese-legal-resources)

- [Long Form](https://github.com/akoksal/LongForm)

- [symbolic-instruction-tuning](https://huggingface.co/datasets/sail/symbolic-instruction-tuning)

- [Safety Prompt](https://github.com/thu-coai/Safety-Prompts)

- [Tapir-Cleaned](https://huggingface.co/datasets/MattiaL/tapir-cleaned-116k)

- [instructional_codesearchnet_python](https://huggingface.co/datasets/Nan-Do/instructional_codesearchnet_python)

- [finance-alpaca](https://huggingface.co/datasets/gbharti/finance-alpaca)

- WebText(Reddit links) - Private Dataset

- MassiveText - Private Dataset

- [Korean-Open-LLM-Datasets](https://github.com/dsdanielpark/Korean-Open-LLM-Datasets)

## Potential Overlap

|                   | OIG     | hh-rlhf  | xP3     | Natural instruct | AlpacaDataCleaned | GPT-4-LLM | Alpaca-CoT |

|-------------------|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|

| OIG               | -       | Contains | Overlap | Overlap          | Overlap           |           | Overlap    |

| hh-rlhf           | Part of | -        |         |                  |                   |           | Overlap    |

| xP3               | Overlap |          | -       | Overlap          |                   |           | Overlap    |

| Natural instruct  | Overlap |          | Overlap | -                |                   |           | Overlap    |

| AlpacaDataCleaned | Overlap |          |         |                  | -                 | Overlap   | Overlap    |

| GPT-4-LLM         |         |          |         |                  | Overlap           | -         | Overlap    |

| Alpaca-CoT        | Overlap | Overlap  | Overlap | Overlap          | Overlap           | Overlap   | -         |

# Papers

- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)

- [Improving Language Understanding by Generative Pre-Training](https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf)

- [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://aclanthology.org/N19-1423.pdf)

- [Language Models are Unsupervised Multitask Learners](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)

- [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/pdf/1909.08053.pdf)

- [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://jmlr.org/papers/v21/20-074.html)

- [ZeRO: Memory Optimizations Toward Training Trillion Parameter Models](https://arxiv.org/pdf/1910.02054.pdf)

- [Scaling Laws for Neural Language Models](https://arxiv.org/pdf/2001.08361.pdf)

- [Language models are few-shot learners](https://papers.nips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf)

- [Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://arxiv.org/pdf/2101.03961.pdf)

- [Evaluating Large Language Models Trained on Code](https://arxiv.org/pdf/2107.03374.pdf)

- [On the Opportunities and Risks of Foundation Models](https://arxiv.org/pdf/2108.07258.pdf)

- [Finetuned Language Models are Zero-Shot Learners](https://openreview.net/forum?id=gEZrGCozdqR)

- [Multitask Prompted Training Enables Zero-Shot Task Generalization](https://arxiv.org/abs/2110.08207)

- [GLaM: Efficient Scaling of Language Models with Mixture-of-Experts](https://arxiv.org/pdf/2112.06905.pdf)

- [WebGPT: Improving the Factual Accuracy of Language Models through Web Browsing](https://openai.com/blog/webgpt/)

- [Improving language models by retrieving from trillions of tokens](https://www.deepmind.com/publications/improving-language-models-by-retrieving-from-trillions-of-tokens)

- [Scaling Language Models: Methods, Analysis & Insights from Training Gopher](https://arxiv.org/pdf/2112.11446.pdf)

- [Chain-of-Thought Prompting Elicits Reasoning in Large Language Models](https://arxiv.org/pdf/2201.11903.pdf)

- [LaMDA: Language Models for Dialog Applications](https://arxiv.org/pdf/2201.08239.pdf)

- [Solving Quantitative Reasoning Problems with Language Models](https://arxiv.org/abs/2206.14858)

- [Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model](https://arxiv.org/pdf/2201.11990.pdf)

- [Training language models to follow instructions with human feedback](https://arxiv.org/pdf/2203.02155.pdf)

- [PaLM: Scaling Language Modeling with Pathways](https://arxiv.org/pdf/2204.02311.pdf)

- [An empirical analysis of compute-optimal large language model training](https://www.deepmind.com/publications/an-empirical-analysis-of-compute-optimal-large-language-model-training)

- [OPT: Open Pre-trained Transformer Language Models](https://arxiv.org/pdf/2205.01068.pdf)

- [Unifying Language Learning Paradigms](https://arxiv.org/abs/2205.05131v1)

- [Emergent Abilities of Large Language Models](https://openreview.net/pdf?id=yzkSU5zdwD)

- [Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models](https://github.com/google/BIG-bench)

- [Language Models are General-Purpose Interfaces](https://arxiv.org/pdf/2206.06336.pdf)

- [Improving alignment of dialogue agents via targeted human judgements](https://arxiv.org/pdf/2209.14375.pdf)

- [Scaling Instruction-Finetuned Language Models](https://arxiv.org/pdf/2210.11416.pdf)

- [GLM-130B: An Open Bilingual Pre-trained Model](https://arxiv.org/pdf/2210.02414.pdf)

- [Holistic Evaluation of Language Models](https://arxiv.org/pdf/2211.09110.pdf)

- [BLOOM: A 176B-Parameter Open-Access Multilingual Language Model](https://arxiv.org/pdf/2211.05100.pdf)

- [Galactica: A Large Language Model for Science](https://arxiv.org/pdf/2211.09085.pdf)

- [OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization](https://arxiv.org/pdf/2212.12017)

- [The Flan Collection: Designing Data and Methods for Effective Instruction Tuning](https://arxiv.org/pdf/2301.13688.pdf)

- [LLaMA: Open and Efficient Foundation Language Models](https://research.facebook.com/publications/llama-open-and-efficient-foundation-language-models/)

- [Language Is Not All You Need: Aligning Perception with Language Models](https://arxiv.org/abs/2302.14045)

- [PaLM-E: An Embodied Multimodal Language Model](https://palm-e.github.io)

- [GPT-4 Technical Report](https://openai.com/research/gpt-4)

- [Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling](https://arxiv.org/abs/2304.01373)

- [Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision](https://arxiv.org/abs/2305.03047)

- [PaLM 2 Technical Report](https://ai.google/static/documents/palm2techreport.pdf)

- [RWKV: Reinventing RNNs for the Transformer Era](https://arxiv.org/abs/2305.13048)

- [Let’s Verify Step by Step - Open AI](https://cdn.openai.com/improving-mathematical-reasoning-with-process-supervision/Lets_Verify_Step_by_Step.pdf)

## Pre-trained LLM

- Switch Transformer: [Paper](https://arxiv.org/pdf/2101.03961.pdf)

- GLaM: [Paper](https://arxiv.org/pdf/2112.06905.pdf)

- PaLM: [Paper](https://arxiv.org/pdf/2204.02311.pdf)

- MT-NLG: [Paper](https://arxiv.org/pdf/2201.11990.pdf)

- J1-Jumbo: [api](https://docs.ai21.com/docs/complete-api), [Paper](https://uploads-ssl.webflow.com/60fd4503684b466578c0d307/61138924626a6981ee09caf6_jurassic_tech_paper.pdf)

- OPT: [api](https://opt.alpa.ai), [ckpt](https://github.com/facebookresearch/metaseq/tree/main/projects/OPT), [Paper](https://arxiv.org/pdf/2205.01068.pdf), [OPT-175B License Agreement](https://github.com/facebookresearch/metaseq/blob/edefd4a00c24197486a3989abe28ca4eb3881e59/projects/OPT/MODEL_LICENSE.md)

- BLOOM: [api](https://huggingface.co/bigscience/bloom), [ckpt](https://huggingface.co/bigscience/bloom), [Paper](https://arxiv.org/pdf/2211.05100.pdf), [BigScience RAIL License v1.0](https://huggingface.co/spaces/bigscience/license)

- GPT 3.0: [api](https://openai.com/api/), [Paper](https://arxiv.org/pdf/2005.14165.pdf)

- LaMDA: [Paper](https://arxiv.org/pdf/2201.08239.pdf)

- GLM: [ckpt](https://github.com/THUDM/GLM-130B), [Paper](https://arxiv.org/pdf/2210.02414.pdf), [The GLM-130B License](https://github.com/THUDM/GLM-130B/blob/799837802264eb9577eb9ae12cd4bad0f355d7d6/MODEL_LICENSE)

- YaLM: [ckpt](https://github.com/yandex/YaLM-100B), [Blog](https://medium.com/yandex/yandex-publishes-yalm-100b-its-the-largest-gpt-like-neural-network-in-open-source-d1df53d0e9a6), [Apache 2.0 License](https://github.com/yandex/YaLM-100B/blob/14fa94df2ebbbd1864b81f13978f2bf4af270fcb/LICENSE)

- LLaMA: [ckpt](https://github.com/facebookresearch/llama), [Paper](https://research.facebook.com/publications/llama-open-and-efficient-foundation-language-models/), [Non-commercial bespoke license](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md)

- GPT-NeoX: [ckpt](https://github.com/EleutherAI/gpt-neox), [Paper](https://arxiv.org/pdf/2204.06745.pdf), [Apache 2.0 License](https://github.com/EleutherAI/gpt-neox/blob/main/LICENSE)

- UL2: [ckpt](https://huggingface.co/google/ul2), [Paper](https://arxiv.org/pdf/2205.05131v1.pdf), [Apache 2.0 License](https://huggingface.co/google/ul2)

- T5: [ckpt](https://huggingface.co/t5-11b), [Paper](https://jmlr.org/papers/v21/20-074.html), [Apache 2.0 License](https://huggingface.co/t5-11b)

- CPM-Bee: [api](https://live.openbmb.org/models/bee), [Paper](https://arxiv.org/pdf/2012.00413.pdf)

- rwkv-4: [ckpt](https://huggingface.co/BlinkDL/rwkv-4-pile-7b), [Github](https://github.com/BlinkDL/RWKV-LM), [Apache 2.0 License](https://huggingface.co/BlinkDL/rwkv-4-pile-7b)

- GPT-J: [ckpt](https://huggingface.co/EleutherAI/gpt-j-6B), [Github](https://github.com/kingoflolz/mesh-transformer-jax), [Apache 2.0 License](https://huggingface.co/EleutherAI/gpt-j-6b)

- GPT-Neo: [ckpt](https://github.com/EleutherAI/gpt-neo), [Github](https://github.com/EleutherAI/gpt-neo), [MIT License](https://github.com/EleutherAI/gpt-neo/blob/23485e3c7940560b3b4cb12e0016012f14d03fc7/LICENSE)

## Instruction finetuned LLM

- Flan-PaLM: [Link](https://arxiv.org/pdf/2210.11416.pdf)

- BLOOMZ: [Link](https://huggingface.co/bigscience/bloomz)

- InstructGPT: [Link](https://platform.openai.com/overview)

- Galactica: [Link](https://huggingface.co/facebook/galactica-120b)

- OpenChatKit: [Link](https://github.com/togethercomputer/OpenChatKit)

- Flan-UL2: [Link](https://github.com/google-research/google-research/tree/master/ul2)

- Flan-T5: [Link](https://github.com/google-research/t5x/blob/main/docs/models.md#flan-t5-checkpoints)

- T0: [Link](https://huggingface.co/bigscience/T0)

- Alpaca: [Link](https://crfm.stanford.edu/alpaca/)

## Aligned LLM

- GPT 4: [Blog](https://openai.com/research/gpt-4)

- ChatGPT: [Demo](https://openai.com/blog/chatgpt/) | [API](https://share.hsforms.com/1u4goaXwDRKC9-x9IvKno0A4sk30)

- Sparrow: [Paper](https://arxiv.org/pdf/2209.14375.pdf)

- Claude: [Demo](https://poe.com/claude) | [API](https://www.anthropic.com/earlyaccess)




# Open LLM

## LLM Leader Board

![](./assets/totalplot.png)

- Visuallization of Open LLM Leader Board: https://github.com/dsdanielpark/Open-LLM-Leaderboard-Report

- Open LLM Leader Board: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard




- [LLaMA](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/) - A foundational, 65-billion-parameter large language model.

- [Alpaca](https://crfm.stanford.edu/2023/03/13/alpaca.html) - A model fine-tuned from the LLaMA 7B model on 52K instruction-following demonstrations.

- [Flan-Alpaca](https://github.com/declare-lab/flan-alpaca) - Instruction Tuning from Humans and Machines.

- [Baize](https://github.com/project-baize/baize-chatbot) - Baize is an open-source chat model trained with LoRA.

- [Cabrita](https://github.com/22-hours/cabrita) - A Portuguese finetuned instruction LLaMA.

- [Vicuna](https://github.com/lm-sys/FastChat) - An Open-Source Chatbot Impressing GPT-4 with 90% ChatGPT Quality.

- [Llama-X](https://github.com/AetherCortex/Llama-X) - Open Academic Research on Improving LLaMA to SOTA LLM.

- [Chinese-Vicuna](https://github.com/Facico/Chinese-Vicuna) - A Chinese Instruction-following LLaMA-based Model.

- [GPTQ-for-LLaMA](https://github.com/qwopqwop200/GPTQ-for-LLaMa) - 4 bits quantization of LLaMA using GPTQ.

- [GPT4All](https://github.com/nomic-ai/gpt4all) - Demo, data, and code to train open-source assistant-style large language model based on GPT-J and LLaMa.

- [Koala](https://bair.berkeley.edu/blog/2023/04/03/koala/) - A Dialogue Model for Academic Research.

- [BELLE](https://github.com/LianjiaTech/BELLE) - Be Everyone's Large Language model Engine.

- [StackLLaMA](https://huggingface.co/blog/stackllama) - A hands-on guide to train LLaMA with RLHF.

- [RedPajama](https://github.com/togethercomputer/RedPajama-Data) - An Open Source Recipe to Reproduce LLaMA training dataset.

- [Chimera](https://github.com/FreedomIntelligence/LLMZoo) - Latin Phoenix.

- [CaMA](https://github.com/zjunlp/CaMA) - a Chinese-English Bilingual LLaMA Model.

- [BLOOM](https://huggingface.co/bigscience/bloom) - BigScience Large Open-science Open-access Multilingual Language Model.

- [BLOOMZ&mT0](https://huggingface.co/bigscience/bloomz) - a family of models capable of following human instructions in dozens of languages zero-shot.

- [Phoenix](https://github.com/FreedomIntelligence/LLMZoo)

- [T5](https://arxiv.org/abs/1910.10683) - Text-to-Text Transfer Transformer.

- [T0](https://arxiv.org/abs/2110.08207) - Multitask Prompted Training Enables Zero-Shot Task Generalization.

- [OPT](https://arxiv.org/abs/2205.01068) - Open Pre-trained Transformer Language Models.

- [UL2](https://arxiv.org/abs/2205.05131v1) - a unified framework for pretraining models that are universally effective across datasets and setups.

- [GLM](https://github.com/THUDM/GLM)- GLM is a General Language Model pretrained with an autoregressive blank-filling objective and can be finetuned on various natural language understanding and generation tasks.

- [ChatGLM-6B](https://github.com/THUDM/ChatGLM-6B) - ChatGLM-6B is an open-source, supporting Chinese and English dialogue language model based on General Language Model (GLM) architecture.

- [RWKV](https://github.com/BlinkDL/RWKV-LM) - Parallelizable RNN with Transformer-level LLM Performance.

- [ChatRWKV](https://github.com/BlinkDL/ChatRWKV) - ChatRWKV is like ChatGPT but powered by my RWKV (100% RNN) language model.

- [StableLM](https://stability.ai/blog/stability-ai-launches-the-first-of-its-stablelm-suite-of-language-models) - Stability AI Language Models.

- [YaLM](https://medium.com/yandex/yandex-publishes-yalm-100b-its-the-largest-gpt-like-neural-network-in-open-source-d1df53d0e9a6) - a GPT-like neural network for generating and processing text.

- [GPT-Neo](https://github.com/EleutherAI/gpt-neo) - An implementation of model & data parallel GPT3-like models.

- [GPT-J](https://github.com/kingoflolz/mesh-transformer-jax/#gpt-j-6b) - A 6 billion parameter, autoregressive text generation model trained on The Pile.

- [Dolly](https://www.databricks.com/blog/2023/03/24/hello-dolly-democratizing-magic-chatgpt-open-models.html) - a cheap-to-build LLM that exhibits a surprising degree of the instruction following capabilities exhibited by ChatGPT.

- [Pythia](https://github.com/EleutherAI/pythia) - Interpreting Autoregressive Transformers Across Time and Scale.

- [Dolly 2.0](https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm) - the first open source, instruction-following LLM, fine-tuned on a human-generated instruction dataset licensed for research and commercial use.

- [OpenFlamingo](https://github.com/mlfoundations/open_flamingo) - an open-source reproduction of DeepMind's Flamingo model.

- [Cerebras-GPT](https://www.cerebras.net/blog/cerebras-gpt-a-family-of-open-compute-efficient-large-language-models/) - A Family of Open, Compute-efficient, Large Language Models.

- [GALACTICA](https://github.com/paperswithcode/galai/blob/main/docs/model_card.md) - The GALACTICA models are trained on a large-scale scientific corpus.

- [GALPACA](https://huggingface.co/GeorgiaTechResearchInstitute/galpaca-30b) - GALACTICA 30B fine-tuned on the Alpaca dataset.

- [Palmyra](https://huggingface.co/Writer/palmyra-base) - Palmyra Base was primarily pre-trained with English text.

- [Camel](https://huggingface.co/Writer/camel-5b-hf) - a state-of-the-art instruction-following large language model.

- [h2oGPT](https://github.com/h2oai/h2ogpt)

- [PanGu-α](https://openi.org.cn/pangu/) - PanGu-α is a 200B parameter autoregressive pretrained Chinese language model.

- [MOSS](https://github.com/OpenLMLab/MOSS) - MOSS is an open-source dialogue language model that supports Chinese and English.

- [Open-Assistant](https://github.com/LAION-AI/Open-Assistant) - a project meant to give everyone access to a great chat-based large language model.

- [HuggingChat](https://huggingface.co/chat/) - Powered by Open Assistant's latest model – the best open-source chat model right now and @huggingface Inference API.

- [StarCoder](https://huggingface.co/blog/starcoder) - Hugging Face LLM for Code

- [MPT-7B](https://www.mosaicml.com/blog/mpt-7b) - Open LLM for commercial use by MosaicML

## LLM Training Frameworks

- [Serving OPT-175B, BLOOM-176B and CodeGen-16B using Alpa](https://alpa.ai/tutorials/opt_serving.html)

- [Alpa](https://github.com/alpa-projects/alpa)

- [Megatron-LM GPT2 tutorial](https://www.deepspeed.ai/tutorials/megatron/)

- [DeepSpeed Chat](https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-chat)

- [pretrain_gpt3_175B.sh](https://github.com/NVIDIA/Megatron-LM/blob/main/examples/pretrain_gpt3_175B.sh)

- [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)

- [deepspeed.ai](https://www.deepspeed.ai)

- [Github repo](https://github.com/microsoft/DeepSpeed)

- [Colossal-AI](https://colossalai.org)

- [Open source solution replicates ChatGPT training process! Ready to go with only 1.6GB GPU memory and gives you 7.73 times faster training!](https://www.hpc-ai.tech/blog/colossal-ai-chatgpt)

- [BMTrain](https://github.com/OpenBMB/BMTrain)

- [Mesh TensorFlow `(mtf)`](https://github.com/tensorflow/mesh)

- [This tutorial](https://jax.readthedocs.io/en/latest/notebooks/Distributed_arrays_and_automatic_parallelization.html)

## LLM Optimization

### State-of-the-art Parameter-Efficient Fine-Tuning (PEFT) methods

- **github:** https://github.com/huggingface/peft

- **abstract:** Parameter-Efficient Fine-Tuning (PEFT) methods enable efficient adaptation of pre-trained language models (PLMs) to various downstream applications without fine-tuning all the model's parameters. Fine-tuning large-scale PLMs is often prohibitively costly. In this regard, PEFT methods only fine-tune a small number of (extra) model parameters, thereby greatly decreasing the computational and storage costs. Recent State-of-the-Art PEFT techniques achieve performance comparable to that of full fine-tuning. Seamlessly integrated with hugging face Accelerate for large scale models leveraging DeepSpeed and Big Model Inference.

- **Supported methods:**

Supported methods:

1. LoRA: [LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS](https://arxiv.org/abs/2106.09685)

2. Prefix Tuning: [Prefix-Tuning: Optimizing Continuous Prompts for Generation](https://aclanthology.org/2021.acl-long.353/), [P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks](https://arxiv.org/pdf/2110.07602.pdf)

3. P-Tuning: [GPT Understands, Too](https://arxiv.org/abs/2103.10385)

4. Prompt Tuning: [The Power of Scale for Parameter-Efficient Prompt Tuning](https://arxiv.org/abs/2104.08691)

5. AdaLoRA: [Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning](https://arxiv.org/abs/2303.10512)  

## Tools for deploying LLM

- [Haystack](https://haystack.deepset.ai/)

- [Sidekick](https://github.com/ai-sidekick/sidekick)

- [LangChain](https://github.com/hwchase17/langchain)

- [wechat-chatgpt](https://github.com/fuergaosi233/wechat-chatgpt)

- [oobabooga/text-generation-webui](https://github.com/oobabooga/text-generation-webui)

## Tutorials about LLM

- [Andrej Karpathy] State of GPT [video](https://build.microsoft.com/en-US/sessions/db3f4859-cd30-4445-a0cd-553c3304f8e2)

- [Hyung Won Chung] Instruction finetuning and RLHF lecture [Youtube](https://www.youtube.com/watch?v=zjrM-MW-0y0)

- [Jason Wei] Scaling, emergence, and reasoning in large language models [Slides](https://docs.google.com/presentation/d/1EUV7W7X_w0BDrscDhPg7lMGzJCkeaPkGCJ3bN8dluXc/edit?pli=1&resourcekey=0-7Nz5A7y8JozyVrnDtcEKJA#slide=id.g16197112905_0_0)

- [Susan Zhang] Open Pretrained Transformers [Youtube](https://www.youtube.com/watch?v=p9IxoSkvZ-M&t=4s)

- [Ameet Deshpande] How Does ChatGPT Work? [Slides](https://docs.google.com/presentation/d/1TTyePrw-p_xxUbi3rbmBI3QQpSsTI1btaQuAUvvNc8w/edit#slide=id.g206fa25c94c_0_24)

- [Yao Fu] The Source of the Capability of Large Language Models: Pretraining, Instructional Fine-tuning, Alignment, and Specialization [Bilibili](https://www.bilibili.com/video/BV1Qs4y1h7pn/?spm_id_from=333.337.search-card.all.click&vd_source=1e55c5426b48b37e901ff0f78992e33f)

- [Hung-yi Lee] ChatGPT: Analyzing the Principle [Youtube](https://www.youtube.com/watch?v=yiY4nPOzJEg&list=RDCMUC2ggjtuuWvxrHHHiaDH1dlQ&index=2)

- [Jay Mody] GPT in 60 Lines of NumPy [Link](https://jaykmody.com/blog/gpt-from-scratch/)

- [ICML 2022] Welcome to the "Big Model" Era: Techniques and Systems to Train and Serve Bigger Models [Link](https://icml.cc/virtual/2022/tutorial/18440)

- [NeurIPS 2022] Foundational Robustness of Foundation Models [Link](https://nips.cc/virtual/2022/tutorial/55796)

- [Andrej Karpathy] Let's build GPT: from scratch, in code, spelled out. [Video](https://www.youtube.com/watch?v=kCc8FmEb1nY)|[Code](https://github.com/karpathy/ng-video-lecture)

- [DAIR.AI] Prompt Engineering Guide [Link](https://github.com/dair-ai/Prompt-Engineering-Guide)

- [Philipp Schmid] Fine-tune FLAN-T5 XL/XXL using DeepSpeed & Hugging Face Transformers [Link](https://www.philschmid.de/fine-tune-flan-t5-deepspeed)

- [HuggingFace] Illustrating Reinforcement Learning from Human Feedback (RLHF) [Link](https://huggingface.co/blog/rlhf)

- [HuggingFace] What Makes a Dialog Agent Useful? [Link](https://huggingface.co/blog/dialog-agents)

- [HeptaAI] ChatGPT Kernel: InstructGPT, PPO Reinforcement Learning Based on Feedback Instructions [Link](https://zhuanlan.zhihu.com/p/589747432)

- [Yao Fu] How does GPT Obtain its Ability? Tracing Emergent Abilities of Language Models to their Sources [Link](https://yaofu.notion.site/How-does-GPT-Obtain-its-Ability-Tracing-Emergent-Abilities-of-Language-Models-to-their-Sources-b9a57ac0fcf74f30a1ab9e3e36fa1dc1)

- [Stephen Wolfram] What Is ChatGPT Doing ... and Why Does It Work? [Link](https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work/)

- [Jingfeng Yang] Why did all of the public reproduction of GPT-3 fail? [Link](https://jingfengyang.github.io/gpt)

- [Hung-yi Lee] ChatGPT (possibly) How It Was Created - The Socialization Process of GPT [Video](https://www.youtube.com/watch?v=e0aKI2GGZNg)

- [Open AI Improving mathematical reasoning with process supervision](https://openai.com/research/improving-mathematical-reasoning-with-process-supervision)

## Courses about LLM

- [DeepLearning.AI] ChatGPT Prompt Engineering for Developers [Homepage](https://www.deeplearning.ai/short-courses/chatgpt-prompt-engineering-for-developers/)

- [Princeton] Understanding Large Language Models [Homepage](https://www.cs.princeton.edu/courses/archive/fall22/cos597G/)

- [Stanford] CS224N-Lecture 11: Prompting, Instruction Finetuning, and RLHF [Slides](https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture11-prompting-rlhf.pdf)

- [Stanford] CS324-Large Language Models [Homepage](https://stanford-cs324.github.io/winter2022/)

- [Stanford] CS25-Transformers United V2 [Homepage](https://web.stanford.edu/class/cs25/)

- [Stanford Webinar] GPT-3 & Beyond [Video](https://www.youtube.com/watch?v=-lnHHWRCDGk)

- [MIT] Introduction to Data-Centric AI [Homepage](https://dcai.csail.mit.edu)

## Opinions about LLM

- [Google "We Have No Moat, And Neither Does OpenAI"](https://www.semianalysis.com/p/google-we-have-no-moat-and-neither) [2023-05-05]

- [AI competition statement](https://petergabriel.com/news/ai-competition-statement/) [2023-04-20] [petergabriel]

- [Noam Chomsky: The False Promise of ChatGPT](https://www.nytimes.com/2023/03/08/opinion/noam-chomsky-chatgpt-ai.html) \[2023-03-08][Noam Chomsky]

- [Is ChatGPT 175 Billion Parameters? Technical Analysis](https://orenleung.super.site/is-chatgpt-175-billion-parameters-technical-analysis) \[2023-03-04][Owen]

- [The Next Generation Of Large Language Models ](https://www.notion.so/Awesome-LLM-40c8aa3f2b444ecc82b79ae8bbd2696b) \[2023-02-07][Forbes]

- [Large Language Model Training in 2023](https://research.aimultiple.com/large-language-model-training/) \[2023-02-03][Cem Dilmegani]

- [What Are Large Language Models Used For? ](https://www.notion.so/Awesome-LLM-40c8aa3f2b444ecc82b79ae8bbd2696b) \[2023-01-26][NVIDIA]

- [Large Language Models: A New Moore's Law ](https://huggingface.co/blog/large-language-models) \[2021-10-26\]\[Huggingface\]

## Other Awesome Lists

- [LLMsPracticalGuide](https://github.com/Mooler0410/LLMsPracticalGuide)

- [Awesome ChatGPT Prompts](https://github.com/f/awesome-chatgpt-prompts)

- [awesome-chatgpt-prompts-zh](https://github.com/PlexPt/awesome-chatgpt-prompts-zh)

- [Awesome ChatGPT](https://github.com/humanloop/awesome-chatgpt)

- [Chain-of-Thoughts Papers](https://github.com/Timothyxxx/Chain-of-ThoughtsPapers)

- [Instruction-Tuning-Papers](https://github.com/SinclairCoder/Instruction-Tuning-Papers)

- [LLM Reading List](https://github.com/crazyofapple/Reading_groups/)

- [Reasoning using Language Models](https://github.com/atfortes/LM-Reasoning-Papers)

- [Chain-of-Thought Hub](https://github.com/FranxYao/chain-of-thought-hub)

- [Awesome GPT](https://github.com/formulahendry/awesome-gpt)

- [Awesome GPT-3](https://github.com/elyase/awesome-gpt3)

- [Awesome LLM Human Preference Datasets](https://github.com/PolisAI/awesome-llm-human-preference-datasets)

- [RWKV-howto](https://github.com/Hannibal046/RWKV-howto)

- *[Amazing-Bard-Prompts](https://github.com/dsdanielpark/amazing-bard-prompts)*

## Other Useful Resources

- [Arize-Phoenix](https://phoenix.arize.com/)

- [Emergent Mind](https://www.emergentmind.com)

- [ShareGPT](https://sharegpt.com)

- [Major LLMs + Data Availability](https://docs.google.com/spreadsheets/d/1bmpDdLZxvTCleLGVPgzoMTQ0iDP2-7v7QziPrzPdHyM/edit#gid=0)

- [500+ Best AI Tools](https://vaulted-polonium-23c.notion.site/500-Best-AI-Tools-e954b36bf688404ababf74a13f98d126)

- [Cohere Summarize Beta](https://txt.cohere.ai/summarize-beta/)

- [chatgpt-wrapper](https://github.com/mmabrouk/chatgpt-wrapper)

- [Open-evals](https://github.com/open-evals/evals)

- [Cursor](https://www.cursor.so)

- [AutoGPT](https://github.com/Significant-Gravitas/Auto-GPT)

- [OpenAGI](https://github.com/agiresearch/OpenAGI)

- [HuggingGPT](https://github.com/microsoft/JARVIS)




## How to Contribute

Since this repository focuses on collecting various datasets for LLM, you are welcome to contribute and add datasets in any form you prefer.

# References

[1]https://github.com/KennethanCeyer/awesome-llm 


[2]https://github.com/Hannibal046/Awesome-LLM 


[3]https://github.com/Zjh-819/LLMDataHub 


[4]https://huggingface.co/datasets
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/dsdanielpark/open-llm-datasets

Awesome Lists containing this project

README