Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
https://github.com/zhengzangw/awesome-huge-models

A collection of AWESOME things about HUGE AI models.
https://github.com/zhengzangw/awesome-huge-models
List: awesome-huge-models
deep-learning llm machine-learning
Last synced: 16 days ago
JSON representation
A collection of AWESOME things about HUGE AI models.
Host: GitHub
URL: https://github.com/zhengzangw/awesome-huge-models
Owner: zhengzangw
Created: 2022-06-12T08:59:20.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2023-06-25T16:04:29.000Z (about 1 year ago)
Last Synced: 2024-05-21T16:07:00.702Z (about 2 months ago)
Topics: deep-learning, llm, machine-learning
Homepage: https://github.com/zhengzangw/awesome-huge-models
Size: 5.25 MB
Stars: 288
Watchers: 17
Forks: 23
Open Issues: 0
Metadata Files:
- Readme: README.md
Lists

awesome-artificial-intelligence - awesome-huge-models - A collection of AWESOME things about HUGE AI models. (Other awesome AI lists)
ultimate-awesome - awesome-huge-models - A collection of AWESOME things about HUGE AI models. (Other Lists / Julia Lists)
README

        

# awesome-huge-models [![Awesome](https://awesome.re/badge.svg)](https://awesome.re)

A collection of AWESOME things about HUGE AI models.

**[2023.06]** We are now in the post-GPT4 era, where LLMs are thriving and new models are emerging from GitHub repositories rather than traditional papers. People are striving to release everything openly, including training and inference codes, instruction-tuned weights and datasets, pretrained weights, and [the datasets used for pretraining LLMs](#open-llm-training-dataset). In this update, I try to catch up with the latest developments in the open-source wave of LLMs.

**[2023.03]** Only pretrained models are recorded here. Models are sorted according to the first release date. To support the open source process of LLM, we highligh the open-sourced LLM models with [[open]]().

**[2022.06]** There is a trend of training large-scale deep learning models (w.r.t. params, dataset, FLOPs) led by big companies. These models achieve the SoTA perfermance at a high price, with bags of training tricks and distributed training systems. Keeping an eye on this trend informs us of the current boundaries of AI models. [[Intro in Chinese](https://zhuanlan.zhihu.com/p/529863941)]

## Contents

- [awesome-huge-models ](#awesome-huge-models-)

  - [Contents](#contents)

  - [Survey](#survey)

  - [Models](#models)

    - [Language Model](#language-model)

    - [Vision Models](#vision-models)

    - [Reinforcement Learning](#reinforcement-learning)

    - [Speech](#speech)

    - [Science](#science)

  - [Open LLM Training Dataset](#open-llm-training-dataset)

  - [Distributed Training Framework](#distributed-training-framework)

    - [PyTorch Ecosystem](#pytorch-ecosystem)

    - [XLA Ecosystem](#xla-ecosystem)

    - [Other Frameworks](#other-frameworks)

    - [Inference Frameworks](#inference-frameworks)

    - [Recommendation Training Framework](#recommendation-training-framework)

  - [Keys Explanations](#keys-explanations)

## Survey



    



- [A Survey of Large Language Models](https://arxiv.org/abs/2303.18223) [2023.03]

- [A Dive into Vision-Language Models](https://huggingface.co/blog/vision_language_pretraining) [2023.02]

- [Compute Trends Across Three Eras of Machine Learning](https://arxiv.org/abs/2202.05924) [[chart](https://ourworldindata.org/grapher/ai-training-computation)] [2022.02]

- [Vision-and-Language Pretrained Models: A Survey](https://arxiv.org/abs/2204.07356) [2022.04]

- [A Roadmap to Big Model](https://arxiv.org/abs/2203.14101) [2022.03]

- [A Survey of Vision-Language Pre-trained Models](https://arxiv.org/abs/2202.10936) [2022.02]

- [Transformers in Vision: A Survey](https://arxiv.org/abs/2101.01169) [2022.01]

- [On the Opportunities and Risk of Foundation Models](https://arxiv.org/abs/2108.07258) [2021.08]

- [Pre-Trained Models: Past, Present and Future](https://arxiv.org/abs/2106.07139) [2021.06]

Resources list:

- [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)

- [Awesome-LLM](https://github.com/Hannibal046/Awesome-LLM)

- [Open-LLM](https://github.com/eugeneyan/open-llms)

- [LLMDataHub](https://github.com/Zjh-819/LLMDataHub)

## Models

### Language Model



    



- **Baichuan** [[Baichuan]]() Jun. 2023 [[open]](https://github.com/baichuan-inc/baichuan-7B)

  ```yaml

  Field: Language

  Params: 7B

  Training Data: 1.2T tokens (English, Chinese, Private)

  License: Apache 2.0

  Context: 4096

  ```

- **Falcon** [[TII]]() Jun. 2023 [[open]](https://huggingface.co/tiiuae/falcon-40b)

  ```yaml

  Field: Language

  Params: 40B

  Training Data: 1T tokens (RefinedWeb)

  License: Apache 2.0

  Context Length: 2048

  ```

- **OpenLLaMA** [[OpenLM]]() May. 2023 [[open]](https://github.com/openlm-research/open_llama)  

  ```yaml

  Field: Language

  Params: 13B, 7B, 3B

  Training Data: 1T tokens (RedPajama)

  License: Apache 2.0

  Context Length: 2048

  ```

- **Redpajama-INCITE** [[Together]](https://github.com/togethercomputer/RedPajama-Data) May. 2023 [[open]](https://huggingface.co/togethercomputer/RedPajama-INCITE-Base-3B-v1)

  ```yaml

  Field: Language

  Params: 7B, 3B

  Training Data: 1T tokens (Redpajama)

  License: Apache 2.0

  Context Length: 2048

  ```

- **MPT** [[MosaicML]](https://www.mosaicml.com/blog/mpt-7b) May. 2023 [[open]](https://github.com/mosaicml/llm-foundry)  

  ```yaml

  Field: Language

  Params: 30B, 7B

  Training Data: 1T tokens (Private)

  License: Apache 2.0, CC BY-SA-3.0

  Context Length: 84k

  ```

- **Stable-LM** [[Stability-AI]](https://stability.ai/blog/stability-ai-launches-the-first-of-its-stablelm-suite-of-language-models) Apr. 2023 [[open]](https://github.com/Stability-AI/StableLM#stablelm-alpha)

  ```yaml

  Field: Language

  Params: 7B, 3B

  Training Data: 1.5T tokens

  License: CC BY-SA-4.0

  ```

- **LiT-LLaMa** [[Lightning-AI]]() Apr. 2023 [[open]](https://github.com/Lightning-AI/lit-llama)  

  

  ```yaml

  Field: Language

  Params: 13B, 7B

  Training Data: 1.2T tokens (Redpajama)

  License: Apache 2.0

  ```

- **h2oGPT** [[H2O.ai]](https://h2o.ai/blog/building-the-worlds-best-open-source-large-language-model-h2o-ais-journey/) [[open]](https://github.com/h2oai/h2ogpt)  

  [h2oGPT: Democratizing Large Language Models](https://arxiv.org/pdf/2306.08161.pdf)

  ```yaml

  Field: Language

  Params: 13B, 7B

  Training Data: 1.0T tokens

  License: Apache 2.0

  Context Length: 2048

  ```

- **Cerabras-GPT** [[Cerabras]]() Mar. 2023 [[open]](https://huggingface.co/cerebras/Cerebras-GPT-13B)  

  Training Compute-Optimal Large Language Models [[preprint]](https://arxiv.org/abs/2203.15556)  

  ```yaml

  Field: Language

  Params: 13B

  Training Data: 371B tokens (Redpajama)

  License: Apache 2.0

  Context Length: 2048

  ```

- **Claude** [[Anthropic]](https://www.anthropic.com/index/introducing-claude) Mar. 2023 [close]

  ```yaml

  Field: Language-Vision

  ```

- **GPT-4** [[OpenAI]](https://openai.com/product/gpt-4) Mar. 2023 [close]  

   GPT-4 Technical Report [[Preprint]](https://cdn.openai.com/papers/gpt-4.pdf)

  ```yaml

  Field: Language-Vision

  Params: 1.7T

  Architecture: De, MoE

  ```

- **Bard** [[Google]](https://blog.google/technology/ai/bard-google-ai-search-updates/)

  

  ```yaml

  Field: Language-Vision

  ```

- **LLaMa** [[Meta]]() Feb. 2023 [[open]](https://github.com/facebookresearch/llama)  

   Open and Efficient Foundation Language Models [[Preprint]](https://arxiv.org/pdf/2302.13971v1.pdf)

  ```yaml

  Field: Language

  Params: 65B, 33B, 13B, 7B

  Training Data: 4TB (1.4T tokens)

  Training Cost: 1,022,362 (2048 80G-A100 x 21 days)

  Training Power Consumption: 449 MWh

  Instruction-tuned Variants: Alpaca, Vicuna, Dolly, Guanaco, ColossalChat, GPT4All, Koala, BELLE, MiniGPT-4, etc.

  License: GPL

  ```

- **RWKV-4** [[Personal]]() Dec. 2022 [[open]](https://github.com/BlinkDL/RWKV-LM)

  ```yaml

  Field: Language

  Params: 14B, 7B, 3B, 1.5B

  Training Data: 332B tokens

  Architecture: De, RNN

  License: Apache 2.0

  ```

- **AnthropicLM** [[Anthropic]]() Dec. 2022 [close]  

   Constitutional AI: Harmlessness from AI Feedback

  ```yaml

  Field: Language

  Params: 52B

  ```

- **BLOOM** [[BigScience]]() Nov. 2022 [[open]](https://huggingface.co/bigscience/bloom)  

   A 176B-Parameter Open-Access Multilingual Language Model [[Preprint]](https://arxiv.org/pdf/2211.05100.pdf)

  ```yaml

  Field: Language

  Params: 176B

  Training Data: 174GB (336B tokens)

  Training Cost: 1M A100 GPU hours = 384 80G-A100 x 4 months

  Training Power Consumption: 475 MWh

  Training Framework: Megatron + Deepspeed

  Instruction-tuned Variants: BLOOMZ

  License: OpenRAIL-M v1

  Context Length: 2048

  ```

  

- **Galactica** [[Meta]]() Nov. 2022 [[open]](https://huggingface.co/facebook/galactica-1.3b)

  A scientific language model trained on over 48 million scientific texts [[Preprint]](https://arxiv.org/pdf/2211.09085.pdf)

  

  ```yaml

  Field: Language

  Params: 125M, 1.3B, 6.7B, 30B, 120B

  ```

- **Pythia** [[EleutherAI]]() Oct. 2022 [[open]](https://github.com/EleutherAI/pythia)

  

  ```yaml

  Field: Language

  Params: 12B

  Instruction-tuned Variants: Dolly 2.0

  License: Apache 2.0

  Context Length: 2048

  ```

- **GLM-130B** [[BAAI]](https://keg.cs.tsinghua.edu.cn/glm-130b/zh/posts/glm-130b/) Oct. 2022 [[open]](https://github.com/THUDM/GLM-130B)  

   GLM-130B: An Open Bilingual Pre-trained Model [[ICLR'23]](https://arxiv.org/pdf/2210.02414.pdf)

  ```yaml

  Field: Language

  Params: 130B

  Training Data: (400B tokens)

  Training Cost: 516,096 A100 hours = 768 40G-A100 x 28 days

  Training Framework: Megatron + Deepspeed

  ```

- **UL2** [[Google]]() May 2022 [[open]](https://huggingface.co/google/ul2)  

   Unifying Language Learning Paradigms [[Preprint]](https://arxiv.org/abs/2205.05131)

  ```yaml

  Field: Language

  Params: 20B (1T tokens)

  Training Data: 800GB

  Achitecture: En-De

  Training Framework: Jax + T5x

  License: Apache 2.0

  Instruction-tuned Variants: Flan-UL2

  Context Length: 2048

  ```

- **OPT** [[Meta]](https://ai.facebook.com/blog/democratizing-access-to-large-scale-language-models-with-opt-175b/) May 2022 [[open]](https://github.com/facebookresearch/metaseq)  

   OPT: Open Pre-trained Transformer Language Models [[Preprint]](https://arxiv.org/abs/2205.01068)

  ```yaml

  Field: Language

  Params: 175B

  Training Data: 800GB (180B tokens)

  Training Cost: 809,472 A100 hours =  992 80G-A100 x 34 days

  Training Power Consumption: 356 MWh

  Architecutre: De

  Training Framework: Megatron + Fairscale

  ```

- **PaLM** [[Google]](https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html) Apr. 2022 [close]  

   PaLM: Scaling Language Modeling with Pathways [[Preprint]](https://arxiv.org/abs/2204.02311)

  ```yaml

  Field: Language

  Params: 550B

  Training Data: 3TB (780B tokens)

  Training Cost: $10M (16,809,984 TPUv4core-hours, 64 days)

  Training petaFLOPs: 2.5B

  Architecture: De

  Training Framework: Jax + T5x

  ```

- **GPT-NeoX** [[EleutherAI]](https://blog.eleuther.ai/announcing-20b/) Apr. 2022 [[open]](https://github.com/EleutherAI/gpt-neox)  

   GPT-NeoX-20B: An Open-Source Autoregressive Language Model [[Preprint]](https://arxiv.org/abs/2204.06745)

  ```yaml

  Field: Language

  Params: 20B

  Training Data: 525GiB

  Training petaFLOPs: 93B

  Architecture: De

  Training Framework: Megatron + Fairscale

  License: Apache 2.0

  Context Length: 2048

  ```

- **InstructGPT** [[OpenAI]]() Mar. 2022 [close]  

   Training language models to follow instructions with human feedback [[Preprint]](https://arxiv.org/abs/2203.02155)

  ```yaml

  Field: Language

  Params: 175B

  ```

- **Chinchilla** [[DeepMind]](https://www.deepmind.com/publications/an-empirical-analysis-of-compute-optimal-large-language-model-training) Mar. 2022 [close]  

   Training Compute-Optimal Large Language Models [[Preprint]](https://arxiv.org/abs/2203.15556)

  ```yaml

  Field: Language

  Params: 70B

  Training Data: 5.2TB (1.4T tokens)

  Training petaFLOPs: 580M

  Architecture: De

  ```

- **EVA 2.0** [[BAAI]](https://wudaoai.cn/model/detail/EVA) Mar. 2022 [[open]](https://openi.pcl.ac.cn/BAAI/WuDao-Model/src/branch/master)  

   EVA2.0: Investigating Open-Domain Chinese Dialogue Systems with Large-Scale Pre-Training [[Preprint]](https://arxiv.org/abs/2203.09313)

  ```yaml

  Field: Language (Dialogue)

  Params: 2.8B

  Training Data: 180G (1.4B samples, Chinese)

  ```

- **AlphaCode** [[DeepMind]](https://www.deepmind.com/blog/competitive-programming-with-alphacode) Mar. 2022 [close]  

   Competition-Level Code Generation with AlphaCode [[Preprint]](https://arxiv.org/abs/2203.07814)

  ```yaml

  Field: Code Generation

  Params: 41B

  Training Data: (967B tokens)

  Architecture: De

  ```

- **ST-MoE** [[Google]]() Feb. 2022 [close]  

   ST-MoE: Designing Stable and Transferable Sparse Expert Models [[Preprint]](https://arxiv.org/abs/2202.08906)

  ```yaml

  Field: Language

  Params: 296B

  Architecture: En-De, MoE

  ```

- **LaMDA** [[Google]](https://arxiv.org/abs/2201.08239) Jan. 2022 [close]  

   LaMDA: Language Models for Dialog Applications [[Preprint]](https://arxiv.org/abs/2201.08239)

  ```yaml

  Field: Language (Dialogue)

  Params: 137B

  Training Data: (1.56T words)

  Training petaFLOPs: 360M

  Architecture: De

  ```

- **GLaM** [[Google]](https://ai.googleblog.com/2021/12/more-efficient-in-context-learning-with.html) Dec. 2021 [close]  

   GLaM: Efficient Scaling of Language Models with Mixture-of-Experts [[Preprint]](https://arxiv.org/abs/2112.06905)

  ```yaml

  Field: Language

  Params: 1.2T

  Architecture: De, MoE

  ```

- **Gopher** [[DeepMind]](https://www.deepmind.com/blog/language-modelling-at-scale-gopher-ethical-considerations-and-retrieval) Dec. 2021 [close]  

   Scaling Language Models: Methods, Analysis & Insights from Training Gopher [[Preprint]](https://arxiv.org/abs/2112.11446)

  ```yaml

  Field: Language

  Params: 280B

  Training Data: 1.3TB (300B tokens)

  Training petaFLOPs: 630M

  Architecture: De

  ```

- **Yuan 1.0** [[inspur]](https://air.inspur.com/home) Oct. 2021 [close]  

   Yuan 1.0: Large-Scale Pre-trained Language Model in Zero-Shot and Few-Shot Learning [[Preprint]](https://arxiv.org/abs/2110.04725)

  ```yaml

  Field: Language

  Params: 245B

  Training Data: 5TB (180B tokens, Chinese)

  Training petaFLOPs: 410M

  Architecture: De, MoE

  ```

- **MT-NLG** [[Microsoft, Nvidia]](https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/) Oct. 2021 [close]  

   Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model [[Preprint]](https://arxiv.org/abs/2201.11990)

  ```yaml

  Field: Language

  Params: 530B

  Training Data: 339B tokens

  Training petaFLOPs: 1.4B

  Architecture: De

  ```

- **Plato-XL** [[Baidu]](http://research.baidu.com/Blog/index-view?id=163) Sept. 2021 [close]  

   PLATO-XL: Exploring the Large-scale Pre-training of Dialogue Generation [[Preprint]](https://arxiv.org/abs/2109.09519)

  ```yaml

  Field: Language (Dialogue)

  Params: 11B

  Training Data: (1.2B samples)

  ```

- **GPT-J** [[EleutherAI]](https://arankomatsuzaki.wordpress.com/2021/06/04/gpt-j/) Aug. 2021 [[open]](https://github.com/kingoflolz/mesh-transformer-jax)  

  ```yaml

  Field: Language

  Params: 6B

  Programming Language: Jax

  ```

- **Jurassic-1** [[AI21 Labs]](https://www.zdnet.com/article/watch-out-gpt-3-here-comes-ai21s-jurassic-language-model/) Aug. 2021 [close]  

   Jurassic-1: Technical Details and Evaluation [[Preprint]](https://uploads-ssl.webflow.com/60fd4503684b466578c0d307/61138924626a6981ee09caf6_jurassic_tech_paper.pdf)

  ```yaml

  Field: Language

  Params: 178B

  Training petaFLOPs: 370M

  Architecture: De

  ```

- **Codex** [[OpenAI]](https://openai.com/blog/openai-codex/) July 2021 [close]  

   Evaluating Large Language Models Trained on Code [[Preprint]](https://arxiv.org/abs/2107.03374)

  ```yaml

  Field: Code Generation

  Params: 12B

  Training Data: 159GB

  Architecture: De

  ```

- **ERNIE 3.0** [[Baidu]](https://wenxin.baidu.com/wenxin/ernie) July 2021 [close]  

   ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation [[Preprint]](https://arxiv.org/abs/2107.02137)

  ```yaml

  Field: Language

  Params: 10B

  Training Data: 4TB (375B tokens, with knowledge graph)

  Architecture: En

  Objective: MLM

  ```

- **CPM-2** [[BAAI]]() June 2021 [[open]](https://openi.pcl.ac.cn/BAAI/WuDao-Model/src/branch/master)  

   CPM-2: Large-scale Cost-effective Pre-trained Language Models [[Preprint]](https://arxiv.org/abs/2106.10715)

  ```yaml

  Field: Language

  Params: 198B

  Training Data: 2.6TB (Chinese 2.3TB, English 300GB)

  Architecture: En-De

  Objective: MLM

  ```

- **HyperClova** [[Naver]](https://www.navercorp.com/promotion/pressReleasesView/30546) May 2021 [close]  

   What Changes Can Large-scale Language Models Bring? Intensive Study on HyperCLOVA: Billions-scale Korean Generative Pretrained Transformers [[Preprint]](https://arxiv.org/abs/2109.04650v1)

  ```yaml

  Field: Language

  Params: 82B

  Training Data: 562B tokens (Korean)

  Training petaFLOPs: 63B

  Architecture: De

  ```

- **ByT5** [[Google]]() May 2021 [[open]](https://github.com/google-research/byt5)  

   ByT5: Towards a token-free future with pre-trained byte-to-byte models [[TACL'22]](https://arxiv.org/abs/2105.13626)

  ```yaml

  Field: Language

  Params: 13B

  Training Data: (101 languages)

  Architecture: En-De

  ```

- **PanGu-α** [[Huawei]]() Apr. 2021 [close]  

   PanGu-α: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation [[Preprint]](https://arxiv.org/abs/2104.12369)

  ```yaml

  Field: Language

  Params: 200B

  Training Data: 1.1TB (Chinese)

  Training petaFLOPs: 58M

  Architecture: De

  ```

- **mT5** [[Google]]() Mar. 2021 [[open]](https://github.com/google-research/multilingual-t5)  

   mT5: A massively multilingual pre-trained text-to-text transformer [[Preprint]](https://arxiv.org/abs/2010.11934)

  ```yaml

  Field: Language

  Params: 13B

  Training Data: (101 languages)

  Architecture: En-De

  ```

- **WuDao-WenHui** [[BAAI]]() Mar. 2021 [[open]](https://openi.pcl.ac.cn/BAAI/WuDao-Model/src/branch/master/Transformer-XL)

  ```yaml

  Field: Language

  Params: 2.9B

  Training Data: 303GB (Chinese)

  ```

- **GLM** [[BAAI]]() Mar. 2021 [[open]](https://openi.pcl.ac.cn/BAAI/WuDao-Model/src/branch/master/GLM)  

   GLM: General Language Model Pretraining with Autoregressive Blank Infilling [[Preprint]](https://arxiv.org/abs/2103.10360)

  ```yaml

  Field: Language

  Params: 10B

  Architecture: De

  ```

- **Switch Transformer** [[Google]]() Jan. 2021 [[open]](https://github.com/google-research/t5x)  

   Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity [[Preprint]](https://arxiv.org/abs/2101.03961)

  ```yaml

  Field: Language

  Params: 1.6T

  Training Data: 750GB

  Training petaFLOPs: 82M

  Architecture: En-De, MoE

  Objective: MLM

  ```

- **CPM** [[BAAI]]() Dec. 2020 [[open]](https://github.com/TsinghuaAI/CPM)  

   CPM: A Large-scale Generative Chinese Pre-trained Language Model [[Preprint]](https://arxiv.org/abs/2012.00413)

  ```yaml

  Field: Language

  Params: 2.6B

  Training Data: 100G (Chinese)

  Training petaFLOPs: 1.8M

  Architecture: De

  Objective: LTR

  ```

- **GPT-3** [[OpenAI]](https://openai.com/api/) May 2020 [close]  

   Language Models are Few-Shot Learners [[NeurIPS'20]](https://papers.nips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf)

  ```yaml

  Field: Language

  Params: 175B

  Training Data: 45TB (680B Tokens)

  Training Time: 95 A100 GPU years (835584 A100 GPU hours, 355 V100 GPU years)

  Training Cost: $4.6M

  Training petaFLOPs: 310M

  Architecture: De

  Obective: LTR

  Instruction-tuned Variants: InstructGPT, WebGPT, ChatGPT

  ```

- **Blender** [[Meta]](https://ai.facebook.com/blog/blender-bot-2-an-open-source-chatbot-that-builds-long-term-memory-and-searches-the-internet/) Apr. 2020 [[close]](https://huggingface.co/facebook/blenderbot-90M?text=Hey+my+name+is+Thomas%21+How+are+you%3F)  

   Recipes for building an open-domain chatbot [[Preprint]](https://arxiv.org/abs/2004.13637)

  ```yaml

  Field: Language (Dialogue)

  Params: 9.4B

  ```

- **T-NLG** [[Microsoft]](https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/) Feb. 2020 [close]

  ```yaml

  Field: Language

  Params: 17B

  Training petaFLOPs: 16M

  Architecture: De

  Obective: LTR

  ```

- **Meena** [[Google]](https://ai.googleblog.com/2020/01/towards-conversational-agent-that-can.html) Jan. 2020 [close]  

   Towards a Human-like Open-Domain Chatbot [[Preprint]](https://arxiv.org/abs/2001.09977)

  ```yaml

  Field: Language (Dialogue)

  Params: 2.6B

  Training Data: 341GB (40B words)

  Training petaFLOPs: 110M

  ```

- **DialoGPT** [[Microsoft]](https://www.microsoft.com/en-us/research/project/large-scale-pretraining-for-response-generation/) Nov. 2019 [[open]](https://github.com/microsoft/DialoGPT)  

   DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation [[ACL'20]](https://arxiv.org/abs/1911.00536)

  ```yaml

  Field: Language (Dialogue)

  Params: 762M

  Training Data: (147M conversation)

  Architecture: De

  ```

- **T5** [[Google]](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html) Oct. 2019 [[open]](https://github.com/google-research/text-to-text-transfer-transformer)  

   Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer [[JMLR'19]](https://arxiv.org/abs/1910.10683)

  ```yaml

  Field: Language

  Params: 11B

  Training Data: 800GB

  Training Cost: $1.5M

  Training petaFLOPs: 41M

  Architecture: En-De

  Obective: MLM

  License: Apache 2.0

  Instruction-tuned Variants: Flan-T5

  Context-Length: 512

  ```

- **Megatron-LM** [[Nvidia]]() Sept. 2019 [[open]](https://github.com/NVIDIA/Megatron-LM)  

   Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism [[Preprint]](https://arxiv.org/abs/1909.08053)

  ```yaml

  Field: Language

  Params: 8.3B

  Training Data: 174GB

  Training petaFLOPs: 9.1M

  Architecture: De

  Obective: LTR

  Training Framework: Megatron

  ```

- **Megatron-BERT** [[Nvidia]]() Sept. 2019 [[open]](https://github.com/NVIDIA/Megatron-LM)  

   Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism [[Preprint]](https://arxiv.org/abs/1909.08053)

  ```yaml

  Field: Language

  Params: 3.9B

  Training Data: 174GB

  Training petaFLOPs: 57M

  Architecture: En

  Obective: MLM

  Training Framework: Megatron

  ```

- **RoBERTa** [[Meta]](https://ai.facebook.com/blog/roberta-an-optimized-method-for-pretraining-self-supervised-nlp-systems/) July 2019 [[open]](https://github.com/facebookresearch/fairseq)  

   RoBERTa: A Robustly Optimized BERT Pretraining Approach [[Preprint]](https://arxiv.org/abs/1907.11692)

  ```yaml

  Field: Language

  Params: 354M

  Training Data: 160GB

  Training Time: 1024 V100 GPU days

  Architecture: En

  Objective: MLM

  ```

- **XLNet** [[Google]]() June 2019 [[open]](https://github.com/zihangdai/xlnet)  

   XLNet: Generalized Autoregressive Pretraining for Language Understanding [[NeurIPS'19]](https://papers.nips.cc/paper/2019/hash/dc6a7e655d7e5840e66733e9ee67cc69-Abstract.html)

  ```yaml

  Field: Language

  Params: 340M

  Training Data: 113GB (33B words)

  Training Time: 1280 TPUv3 days

  Training Cost: $245k

  Architecture: En

  Objective: PLM

  ```

- **GPT-2** [[OpenAI]](https://openai.com/blog/better-language-models/) Feb. 2019 [[open]](https://github.com/openai/gpt-2)  

   Language Models are Unsupervised Multitask Learners [[Preprint]](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)

  ```yaml

  Field: Language

  Params: 1.5B

  Training Data: 40GB (8M web pages)

  Training Cost: $43k

  Training petaFLOPs: 1.5M

  Architecture: De

  Objective: LTR

  ```

- **BERT** [[Google]]() Oct. 2018 [[open]](https://github.com/google-research/bert)  

   BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [[NAACL'18]](https://arxiv.org/abs/1810.04805)

  ```yaml

  Field: Language

  Params: 330M

  Training Data: 16GB (3.3B words)

  Training Time: 64 TPUv2 days (280 V100 GPU days)

  Training Cost: $7k

  Training petaFLOPs: 290k

  Architecture: En

  Objective: MLM, NSP

  ```

- **GPT** [[OpenAI]](https://openai.com/blog/language-unsupervised/) June 2018 [open]

  Improving Language Understanding by Generative Pre-Training [[Preprint]](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf)

  ```yaml

  Field: Language

  Params: 117M

  Training Data: 1GB (7k books)

  Training petaFLOPs: 18k

  Architecture: De

  Objective: LTR

  ```

### Vision Models

- **Eva02-E** [[BAAI]]() Mar. 2023 [[open]](https://github.com/huggingface/pytorch-image-models/tree/main)  

  EVA-02: A Visual Representation for Neon Genesis [[Preprint]](https://arxiv.org/abs/2303.11331v2)

  ```yaml

  Field: Vision-Language

  Params: 5B

  Training Data: 2B image-text pairs

  Architecture: Transformer

  Objective: MIM, Clip Constrastive

  ```

- **MAE->WSP-2B** [[Meta]]() Mar. 2023 [close]  

   The effectiveness of MAE pre-pretraining for billion-scale pretraining [[Preprint]](https://arxiv.org/abs/2303.13496)

  ```yaml

  Field: Vision

  Params: 6.5B

  Training Data: 3B images

  Architecture: Transformer

  Objective: MAE, Weakly-Supervised

  ```

- **OpenCLIP G/14** [[LAION]]() Mar. 2023 [[open]](https://huggingface.co/laion/CLIP-ViT-g-14-laion2B-s12B-b42K)

  ```yaml

  Field: Vision-Language

  Params: 2.5B

  Training Data: 2B images

  ```

- **ViT-22B** [[Google]]() Feb. 2023 [close]  

  [Scaling Vision Transformers to 22 Billion Parameters](https://arxiv.org/abs/2302.05442)

  ```yaml

  Field: Vision

  Params: 22B

  Training Data: 4B images

  Architecture: Transformer

  Objective: Supervised

  ```

- **ERNIE-ViLG** [[Baidu]](https://wenxin.baidu.com/wenxin/ernie-vilg) Dec. 2022 [close]  

   ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation [[Preprint]](https://arxiv.org/abs/2112.15283)

  ```yaml

  Field: Image Generation (text to image)

  Params: 10B

  Training Data: 145M text-image pairs

  Architecture: Transformer, dVAE + De

  ```

- **InternImage-G** [[Shanghai AI Lab]](https://github.com/OpenGVLab/InternImage) Nov. 2022 [[open]](https://github.com/OpenGVLab/InternImage)

  InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions [[CVPR'23 Highlight]](https://arxiv.org/abs/2211.05778)

  ```yaml

  Field: Vision

  Params: 3B

  Architecture: CNN

  Core Operator: Deformable Convolution v3

  ```

- **Stable Diffusion** [[Stability AI]]() Aug. 2022 [[open]]()

  ```yaml

  Field: Image Generation (text to image)

  Params: 890M

  Training Data: 5B images

  Architecture: Transformer, Diffusion

  ```

- **Imagen** [[Google]](https://imagen.research.google/) May 2022  

   Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding [[Preprint]](https://arxiv.org/abs/2205.11487)

  ```yaml

  Field: Image Generation (text to image)

  Text Encoder: T5

  Image Decoder: Diffusion, Upsampler

  ```

- **Flamingo** [[DeepMind]]() Apr. 2022 [close]  

   Flamingo: a Visual Language Model for Few-Shot Learning [[Preprint]](https://arxiv.org/abs/2204.14198)

  ```yaml

  Field: Vision-Language

  Params: 80B

  ```

- **DALL·E 2** [[OpenAI]](https://openai.com/dall-e-2/) Apr. 2022  

   Hierarchical Text-Conditional Image Generation with CLIP Latents [[Preprint]](https://cdn.openai.com/papers/dall-e-2.pdf)

  ```yaml

  Field: Image Generation (text to image)

  Text Encoder: GPT2 (CLIP)

  Image Encoder: ViT (CLIP)

  Image Decoder: Diffusion, Upsampler

  ```

- **BaGuaLu** [[BAAI, Alibaba]]() Apr. 2022  

   BaGuaLu: targeting brain scale pretrained models with over 37 million cores [[PPoPP'22]](https://keg.cs.tsinghua.edu.cn/jietang/publications/PPOPP22-Ma%20et%20al.-BaGuaLu%20Targeting%20Brain%20Scale%20Pretrained%20Models%20w.pdf)

  ```yaml

  Field: Vision-Language

  Params: 174T

  Architecture: M6

  ```

- **SEER** [[Meta]]() Feb. 2022 [[open]](https://github.com/facebookresearch/vissl)  

   Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision [[Preprint]](https://arxiv.org/abs/2202.08360v2)

  ```yaml

  Field: Vision

  Params: 10B

  Training Data: 1B images

  Architecture: Convolution

  Objective: SwAV

  ```

- **ERNIE-ViLG** [[Baidu]](https://wenxin.baidu.com/wenxin/ernie-vilg) Dec. 2021  

   ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation [[Preprint]](https://arxiv.org/abs/2112.15283)

  ```yaml

  Field: Image Generation (text to image)

  Params: 10B

  Training Data: 145M text-image pairs

  Architecture: Transformer, dVAE + De

  ```

- **NUWA** [[Microsoft]]() Nov. 2021 [[open]](https://github.com/microsoft/NUWA)  

   NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion [[Preprint]](https://arxiv.org/abs/2111.12417)

  ```yaml

  Field: Vision-Language

  Generatioon: Image, Video

  Params: 870M

  ```

- **SwinV2-G** [[Google]]() Nov. 2021 [[open]](https://github.com/microsoft/Swin-Transformer)  

   Swin Transformer V2: Scaling Up Capacity and Resolution [[CVPR'22]](https://arxiv.org/abs/2111.09883v2)

  ```yaml

  Field: Vision

  Params: 3B

  Training Data: 70M

  Architecture: Transformer

  Objective: Supervised

  ```

- **Zidongtaichu** [[CASIA]](http://www.ia.cas.cn/xwzx/kydt/202109/t20210927_6215538.html) Sept. 2021 [close]

  ```yaml

  Field: Image, Video, Language, Speech

  Params: 100B

  ```

- **ViT-G/14** [[Google]]() June 2021  

   Scaling Vision Transformers [[Preprint]](https://arxiv.org/abs/2106.04560)

  ```yaml

  Field: Vision

  Params: 1.8B

  Training Data: 300M images

  Training petaFLOPs: 3.4M

  Architecture: Transformer

  Objective: Supervised

  ```

- **CoAtNet** [[Google]](https://ai.googleblog.com/2021/09/toward-fast-and-accurate-neural.html) June 2021 [[open]](https://github.com/chinhsuanwu/coatnet-pytorch)  

   CoAtNet: Marrying Convolution and Attention for All Data Sizes [[NeurIPS'21]](https://arxiv.org/abs/2106.04803)

  ```yaml

  Field: Vision

  Params: 2.4B

  Training Data: 300M images

  Architecture: Transformer, Convolution

  Objective: Supervised

  ```

- **V-MoE** [[Google]](https://ai.googleblog.com/2022/01/scaling-vision-with-sparse-mixture-of.html) June 2021  

   Scaling Vision with Sparse Mixture of Experts [[NeurIPS'21]](https://proceedings.neurips.cc//paper/2021/file/48237d9f2dea8c74c2a72126cf63d933-Paper.pdf)

  ```yaml

  Field: Vision

  Params: 15B

  Training Data: 300M images

  Training Time: 16.8k TPUv3 days

  Training petaFLOPs: 33.9M

  Architecture: Transformer, MoE

  Objective: Supervised

  ```

- **CogView** [[BAAI, Alibaba]](https://wudao.aminer.cn/CogView/index.html) May 2021 [>](https://github.com/THUDM/CogView)  

   CogView: Mastering Text-to-Image Generation via Transformers [[NeurIPS'21]](https://arxiv.org/abs/2105.13290)

  ```yaml

  Field: Vision-Language

  Params: 4B

  Training Data: 30M text-image pairs

  Training petaFLOPs: 27M

  Image Encoder: VAE

  Text Encoder & Image Decoder: GPT2

  ```

- **M6** [[Alibaba]](https://m6.aliyun.com/#/) Mar. 2021  

   M6: A Chinese Multimodal Pretrainer [[Preprint]](https://arxiv.org/abs/2103.00823)

  ```yaml

  Field: Vision-Language

  Params: 10T

  Training Data: 300G Texts + 2TB Images

  Training petaFLOPs: 5.5M

  Fusion: Single-stream

  Objective: MLM, IC

  ```

- **DALL·E** [[OpenAI]](https://openai.com/blog/dall-e/) Feb. 2021  

   Zero-Shot Text-to-Image Generation [[ICML'21]](https://arxiv.org/abs/2102.12092)

  ```yaml

  Field: Image Generation (text to image)

  Params: 12B

  Training Data: 250M text-images pairs

  Training petaFLOPs: 47M

  Image Encoder: dVAE

  Text Encoder & Image Decoder: GPT2

  ```

- **CLIP** [[OpenAI]](https://openai.com/blog/clip/) Jan. 2021  

   Learning Transferable Visual Models From Natural Language Supervision [[ICML'22]](https://arxiv.org/abs/2103.00020)

  ```yaml

  Field: Vision-Language

  Training Data: 400M text-image pairs

  Training petaFLOPs: 11M

  Image Encoder: ViT

  Text Encoder: GPT-2

  Fusion: Dual Encoder

  Objective: CMCL

  ```

- **ViT-H/14** [[Google]](https://ai.googleblog.com/2020/12/transformers-for-image-recognition-at.html) Oct. 2020 [[open]](https://github.com/google-research/vision_transformer)  

   An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale [[ICLR'20]](https://arxiv.org/abs/2010.11929)

  ```yaml

  Field: Vision

  Params: 632M

  Training Data: 300M images

  Training petaFLOPs: 13M

  Architecture: Transformer

  Objective: Supervised

  ```

- **iGPT-XL** [[OpenAI]](https://openai.com/blog/image-gpt/) June 2020 [[open]](https://github.com/openai/image-gpt)  

   Generative Pretraining From Pixels [[ICML'20]](https://proceedings.mlr.press/v119/chen20s.html)

  ```yaml

  Field: Image Generation

  Params: 6.8B

  Training Data: 1M images

  Training petaFLOPs: 33M

  Architecture: Transformer, De

  ```

- **BigGAN-deep** [[DeepMind]]() Sept. 2018 [[open]](https://github.com/ajbrock/BigGAN-PyTorch)  

   Large Scale GAN Training for High Fidelity Natural Image Synthesis [[ICLR'19]](https://arxiv.org/abs/1809.11096)

  ```yaml

  Field: Image Generation

  Params: 158M

  Training Data: 300M images

  Training petaFLOPs: 3M

  Architecture: Convolution, GAN

  Resolution: 512x512

  ```

### Reinforcement Learning

- **PaLM-E** [[Google]](https://palm-e.github.io/) March 2023 [close]  

   PaLM-E: An Embodied Multimodal Language Model [[Preprint]](https://palm-e.github.io/assets/palm-e.pdf)

  ```yaml

  Field: Reinforcement Learning

  Params: 562B (540B LLM + 22B Vi)

  ```

- **Gato** [[DeepMind]](https://www.deepmind.com/publications/a-generalist-agent) May 2022 [close]  

   A Generalist Agent [[Preprint]](https://arxiv.org/abs/2205.06175)

  ```yaml

  Field: Reinforcement Learning

  Params: 1.2B

  Training Data: (604 Tasks)

  Objective: Supervised

  ```

### Speech

- **USM** [[Google]](https://sites.research.google/usm/) Mar. 2023 [close]  

  Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages [[Preprint]](https://arxiv.org/pdf/2303.01037v2.pdf)

  ```yaml

  Field: Speech

  Params: 2B

  Training Data: 12,000,000 hours

  ```

- **Whisper** [[OpenAI]](https://openai.com/research/whisper) Sept. 2022 [[close]](https://github.com/openai/whisper)  

   Robust Speech Recognition via Large-Scale Weak Supervision [[Preprint]](https://arxiv.org/pdf/2212.04356.pdf)

  ```yaml

  Field: Speech

  Params: 1.55B

  Training Data: 680,000 hours

  Objective: Weakly Supervised

  ```

- **HuBERT** [[Meta]](https://ai.facebook.com/blog/hubert-self-supervised-representation-learning-for-speech-recognition-generation-and-compression/) June 2021 [[open]](https://github.com/facebookresearch/fairseq/tree/main/examples/hubert)  

   HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units [[Preprint]](https://arxiv.org/abs/2106.07447)

  ```yaml

  Field: Speech

  Params: 1B

  Training Data: 60,000 hours

  Objective: MLM

  ```

- **wav2vec 2.0** [[Meta]]() Oct. 2020 [[open]](https://github.com/facebookresearch/fairseq/tree/main/examples/wav2vec)  

   wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations [[NeurIPS'20]](https://arxiv.org/abs/2006.11477)

  ```yaml

  Field: Speech

  Params: 317M

  Training Data: 50,000 hours

  Training petaFLOPs: 430M

  Objective: MLM

  ```

- **DeepSpeech 2** [[Meta]]() Dec. 2015 [[open]](https://github.com/PaddlePaddle/PaddleSpeech)  

   Deep Speech 2: End-to-End Speech Recognition in

  English and Mandarin [[ICML'15]](https://arxiv.org/pdf/1512.02595.pdf)

      ```yaml

      Field: Speech

      Params: 300M

      Training Data: 21,340 hours

      ```

### Science

- **AlphaFold 2** [[DeepMind]](https://www.deepmind.com/research/highlighted-research/alphafold) July 2021 [[open]](https://github.com/deepmind/alphafold)  

   Highly accurate protein structure prediction with AlphaFold [[Nature]](https://www.nature.com/articles/s41586-021-03819-2)

  ```yaml

  Field: Biology

  Params: 21B

  Training petaFLOPs: 100k

  ```

## Open LLM Training Dataset

This section will be reorganized. For now, as LLM prevails and data quality is a key for the performance of LLM, we keep track of this trend.

- [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B): 627B tokens, 895GB Compressed, primarily English, cleaned from RedPajama, Apache 2.0

- [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb): ~600B tokens, 500GB Compressed, English, ODC-By 1.0 license (The 5T tokens version is private)

- [MNBVC](https://github.com/esbatmop/MNBVC): 5TB (on-going, target 40TB), Chinese, MIT License

- [The Pile](https://pile.eleuther.ai/): 825G

- [RedPajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T): 1.2T tokens

## Distributed Training Framework

> Deep Learning frameworks supportting distributed training are marked with \*.

### PyTorch Ecosystem

- **Accelerate** [[Huggingface]]() Oct. 2020 [[open]](https://github.com/huggingface/accelerate)

- **Hivemind** Aug. 2020 [[open]](https://github.com/learning-at-home/hivemind)  

  Towards Crowdsourced Training of Large Neural Networks using Decentralized Mixture-of-Experts [[Preprint]](https://arxiv.org/abs/2002.04013)

- **FairScale** [[Meta]]() July 2020 [[open]](https://github.com/facebookresearch/fairscale)

- **DeepSpeed** [[Microsoft]](https://www.microsoft.com/en-us/research/project/deepspeed/) Oct. 2019 [[open]](https://github.com/microsoft/DeepSpeed)  

   ZeRO: Memory Optimizations Toward Training Trillion Parameter Models [[SC'20]](https://arxiv.org/abs/1910.02054)

- **Megatron** [[Nivida]]() Sept. 2019 [[open]](https://github.com/NVIDIA/Megatron-LM)  

   Megatron: Training Multi-Billion Parameter Language Models Using Model Parallelism [[Preprint]](https://arxiv.org/abs/1909.08053)

- **PyTorch\*** [[Meta]](https://pytorch.org/) Sept. 2016 [[open]](https://github.com/pytorch/pytorch)  

   PyTorch: An Imperative Style, High-Performance Deep Learning Library [[NeurIPS'19]](http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf)

### XLA Ecosystem

- **T5x** [[Google]]() Mar. 2022 [[open]](https://github.com/google-research/t5x)  

  Scaling Up Models and Data with 𝚝𝟻𝚡 and 𝚜𝚎𝚚𝚒𝚘 [[Preprint]](https://arxiv.org/abs/2203.17189)

- **Alpa** [[Google]]() Jan. 2022 [[open]](https://github.com/alpa-projects/alpa)  

  Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning [[OSDI'22]](https://arxiv.org/pdf/2201.12023.pdf)

- **Pathways** [[Google]](https://blog.google/technology/ai/introducing-pathways-next-generation-ai-architecture/) Mar. 2021 [close]  

   Pathways: Asynchronous Distributed Dataflow for ML [[Preprint]](https://arxiv.org/abs/2203.12533)

- **Colossal-AI** [[HPC-AI TECH]](https://colossalai.org/) Nov. 2021 [[open]](https://github.com/hpcaitech/ColossalAI)  

   Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training [[Preprint]](https://arxiv.org/abs/2110.14883)

- **GShard** [[Google]](https://arxiv.org/abs/2006.16668) June 2020  

   GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding [[Preprint]](https://arxiv.org/abs/2006.16668)

- **Jax\*** [Google]() Oct 2019 [[open]](https://github.com/google/jax)

- **Mesh Tensorflow** [[Google]]() Nov. 2018 [[open]](https://github.com/tensorflow/mesh)

- **Horovod** [[Uber]](https://horovod.ai/) Feb. 2018 [[open]](https://github.com/horovod/horovod)  

   Horovod: fast and easy distributed deep learning in TensorFlow [[Preprint]](https://arxiv.org/abs/1802.05799)

- **Tensorflow\*** [[Google]](https://www.tensorflow.org/) Nov. 2015 [[open]](https://github.com/tensorflow/tensorflow)  

   TensorFlow: A system for large-scale machine learning [[OSDI'16]](https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf)

### Other Frameworks

- **OneFlow\*** [[OneFlow]](https://docs.oneflow.org/master/index.html) July 2020 [[open]](https://github.com/OneFlow-Inc/oneflow)  

   OneFlow: Redesign the Distributed Deep Learning Framework from Scratch [[Preprint]](https://arxiv.org/abs/2110.15032)

- **MindSpore\*** [[Huawei]](https://e.huawei.com/en/products/cloud-computing-dc/atlas/mindspore) Mar. 2020 [[open]](https://github.com/mindspore-ai/mindspore)

- **PaddlePaddle\*** [[Baidu]](https://www.paddlepaddle.org.cn/) Nov. 2018 [[open]](https://github.com/PaddlePaddle/Paddle)  

   End-to-end Adaptive Distributed Training on PaddlePaddle [[Preprint]](https://arxiv.org/abs/2112.02752)

- **Ray** [[Berkeley]]() Dec. 2017 [[open]]([OSDI'17](https://github.com/ray-project/ray))  

   Ray: A Distributed Framework for Emerging AI Applications [[OSDI'17]](https://arxiv.org/pdf/1712.05889.pdf)

### Inference Frameworks

- Petals [[BigScience]]() Dec. 2022 [[open]](https://github.com/bigscience-workshop/petals)

- FlexGen [[Stanford, Berkerley, CMU, etc.]]() May 2022 [[open]](https://github.com/FMInference/FlexGen)

- FastTransformer [[NVIDIA]]() Apr. 2021 [[open]](https://github.com/NVIDIA/FasterTransformer)

- MegEngine [[MegEngine]](https://www.megengine.org.cn/) Mar. 2020

- DeepSpeed-Inference [[Microsoft]](https://www.microsoft.com/en-us/research/project/deepspeed/) Oct. 2019 [[open]](https://github.com/microsoft/DeepSpeed)

- MediaPipe [[Google]](https://google.github.io/mediapipe/) July 2019 [[open]](https://github.com/google/mediapipe)

- TensorRT [[Nvidia]]() Jun 2019 [[open]](https://github.com/NVIDIA/TensorRT)

- MNN [[Alibaba]]() May 2019 [[open]](https://github.com/alibaba/MNN)

- OpenVINO [[Intel]](https://docs.openvino.ai/latest/index.html) Oct. 2019 [[open]](https://github.com/openvinotoolkit/openvino)

- ONNX [[Linux Foundation]](https://onnx.ai/) Sep 2017 [[open]](https://github.com/onnx/onnx)

- ncnn [[Tencent]]() July 2017 [[open]](https://github.com/Tencent/ncnn)

### Recommendation Training Framework

- **HET** [[Tencent]]() Dec. 2021  

   HET: Scaling out Huge Embedding Model Training via Cache-enabled Distributed Framework [[VLDB'22]](https://arxiv.org/abs/2112.07221)

- **Persia** [[Kuaishou]]() Nov. 2021  

   Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters [[Preprint]](https://arxiv.org/abs/2111.05897)

  ```yaml

  Embeddings Params: 100T

  ```

- **ZionEX** [[Meta]]() Apr. 2021  

   Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models [[ISCA'21]](https://arxiv.org/abs/2104.05158)

  ```yaml

  Embeddings Params: 10T

  ```

- **ScaleFreeCTR** [[Huawei]]() Apr. 2021  

   ScaleFreeCTR: MixCache-based Distributed Training System for CTR Models with Huge Embedding Table [[SIGIR'21]](https://arxiv.org/abs/2104.08542)

- **Kraken** [[Kuaishou]]() Nov. 2020  

   Kraken: Memory-Efficient Continual Learning for Large-Scale Real-Time Recommendations [[SC'20]](http://storage.cs.tsinghua.edu.cn/papers/sc20-kraken.pdf/)

- **TensorNet** [[Qihoo360]]() Sept. 2020 [[open]](https://github.com/Qihoo360/tensornet)

- **HierPS** [[Baidu]]() Mar. 2020  

   Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems [[MLSys'20]](https://arxiv.org/abs/2003.05622)

- **AIBox** [[Baidu]]() Oct. 2019  

   AIBox: CTR Prediction Model Training on a Single Node [[CIKM'20]](https://dl.acm.org/doi/pdf/10.1145/3357384.3358045)

  ```yaml

  Embeddings Params: 0.1T

  ```

- **XDL** [[Alibaba]]() Aug. 2019  

   XDL: an industrial deep learning framework for high-dimensional sparse data [[DLP-KDD'21]](https://dlp-kdd.github.io/dlp-kdd2019/assets/pdf/a6-jiang.pdf)

  ```yaml

  Embeddings Params: 0.01T

  ```

## Keys Explanations

- Company tags: the related company name. Other institudes may also involve in the job.

- Params: number of parameters of the largest model

- Training data size, training cost and training petaFLOPs may have some uncertainty.

- Training cost

  - TPUv2 hour: $4.5

  - TPUv3 hour: $8

  - V100 GPU hour: $0.55 (2022)

  - A100 GPU hoor: $1.10 (2022)

- Architecture

  - En: Encoder-based Language Model

  - De: Decoder-based Language Model

  - En-De=Encoder-Decoder-based Language Model

  - The above three architectures are powered with transformers.

  - MoE: Mixture of Experts

- Objective (See explanation in section 6–8 of [this paper](https://arxiv.org/pdf/2203.14101v3.pdf))

  - MLM: Masked Language Modeling

  - LTR: Left-To-Right Language Modeling

  - NSP: Next Sentence Prediction

  - PLM: Permuted Language Modeling

  - IC: Image Captioning

  - VLM: Vision Languauge Matching

  - CMCL: Cross-Modal Contrastive Learning

- FLOPs: number of FLOating-Point operations [[explanation]](https://openai.com/blog/ai-and-compute/)

  - 1 petaFLOPs = 1e15 FLOPs