https://github.com/AdamG012/moe-paper-models

A sumary of MoE experimental setups across a number of different papers.
https://github.com/AdamG012/moe-paper-models
List: moe-paper-models
awesome-list language-model machine-learning machine-translation mixture-of-experts moe papers transformer transformers
Last synced: 6 months ago
JSON representation
A sumary of MoE experimental setups across a number of different papers.
Host: GitHub
URL: https://github.com/AdamG012/moe-paper-models
Owner: AdamG012
Created: 2023-01-05T12:55:20.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2023-02-16T00:59:59.000Z (over 2 years ago)
Last Synced: 2024-11-14T04:02:26.611Z (6 months ago)
Topics: awesome-list, language-model, machine-learning, machine-translation, mixture-of-experts, moe, papers, transformer, transformers
Homepage:
Size: 10.7 KB
Stars: 18
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

ultimate-awesome - moe-paper-models - A sumary of MoE experimental setups across a number of different papers. (Other Lists / Julia Lists)
README

        ---

author: Adam G

title: MoE Paper Experimental Setups

---

# Mixture of Experts (MoE) Paper Experimental Setups

This repository serves as a collection of notable paper experitmental

setups. Do note that these could be incomplete or erroneous for some

metrics, if so feel free to raise an issue and I will amend it as soon

as possible.

Major tasks examined across these papers:

1.  Machine Translation (MT) - tested mainly on datasets like WMT

    (English to French) and BLEU scores are used

2.  Masked Language Modelling (MLM)

3.  Language Modelling (LM)

## Model Sizes of Paper Implementations

| 
|--------------------------------------------------- 
| Paper 
| [Megablocks](https://arxiv.org/abs/2211.15841) 
| [Deepspeed-MoE](https://arxiv.org/abs/2201.05596) 
| [Deepspeed-MoE](https://arxiv.org/abs/2201.05596) 
| [Expert Choice Routing](https://arxiv.org/abs/2202.09368) 
| [Task-Level MoE](https://arxiv.org/abs/2110.03742) 
| [Hash Layers (vs Switch)](https://arxiv.org/abs/2106.04426) 
| [Hash Layers (vs BASE)](https://arxiv.org/abs/2106.04426) 
| [GShard](https://arxiv.org/abs/2006.16668) 
| [FasterMoE](https://dl.acm.org/doi/pdf/10.1145/3503221.3508418) 
| [ST-MoE](https://arxiv.org/abs/2202.08906) 
| [Random Routing](https://openreview.net/pdf?id=w1hwFUb_81) 
| [Gating Dropout](https://arxiv.org/abs/2205.14336) 
| [BASE Layers](https://arxiv.org/abs/2103.16716) 
| [Switch Transformer](https://arxiv.org/abs/2101.03961) 
| [Evo MoE](https://arxiv.org/abs/2112.14397) 
| [Stable-MoE](https://arxiv.org/abs/2204.08396) (LM) 
| [Stable-MoE](https://arxiv.org/abs/2204.08396) (MT) 
| [Outrageously Large MoEs](https://arxiv.org/abs/1701.06538v1) 
| Outrageously Large MoEs (LM-Large) 
| Outrageously Large MoEs (MT) 
| Outrageously Large MoEs (MTMT) 
| [NLLB](https://arxiv.org/abs/2207.04672) 
| [Memory Efficient NLLB](https://arxiv.org/abs/2212.09811) 
| [GLaM](https://arxiv.org/abs/2112.06905) 
| Amazon SageMaker 
| [M6-T Sparse Experts](https://arxiv.org/abs/2105.15082)

|         |                               |                                 |                                         |                                             | -----------------|---------|-------------------------------|---------------------------------|-----------------------------------------|---------------------------------------------| | Year    | Expert Size                   | Total Size                      | Num Exp (per layer)                     | Num Layers                                  | | 11/2022 | N/A                           | 839M-13B                        | 64                                      | 3/6/12                                      | | 01/2022 | 1.3/2.4/8/24/47B              | 52/107/349/1064.9/2024B         | 128                                     | 24/16/30/40/58                              | | 01/2022 | 1.3/2.4/8/24/47B              | 52/107/349/1064.9/2024B         | 128                                     | 24/16/30/40/58                              | | 02/2022 | 0.145/9.8B                    | 1.9/143B                        | 64                                      | 16                                          | | 09/2022 | 4096 FFN Size                 | 533M/13B                        | 32/128                                  | 11                                          | | 06/2021 | 4096 FFN Size                 | 751M/852M/1.28B                 | 64/16/128                               | 1/5/1                                       | | 06/2021 | 100M/33M                      | 4.5B                            | 32/3x32                                 | 1/3                                         | | 06/2020 | 8196 FNN Size                 | 37/150/600B                     | 128/512/2048                            | 12/36 (for each num exp)                    | | 03/2022 | 1024/2048/4096 FFN Size       | 13.1/13.7/27.4B                 | 64/16/16                                | 12/12/24                                    | | 02/2022 | 2816/20480                    | 4.1/269B                        | 32/64                                   | 6/6 (every 4)                               | | 09/2022 |                               | 20M-200M                        | 8/16                                    | 4/12                                        | | 05/2022 |                               | 5.6/10B                         | 128/64                                  | 12/24                                       | | 03/2021 | 135/335/911M                  | 1.5/44/117B                     | 128?                                    | 1 (BASE Layer)                              | | 01/2021 | 768/1024/4096 FFN Size        | 7/26/395/1571B                  | 128/128/64/2048                         | 12/24/24/15 (Every other)                   | | 12/2021 | 335M(MT/MLM/LM)               | 1.5(MT)/1.8(MLM LM)             | 4(MT)/16(MLM LM)                        | 6(MT)/12(MLM LM)                            | | 04/2022 | 3072/4096 FFN Size            | 454M/3.22B                      | 32/64                                   | 1/1                                         | | 04/2022 | 2048 FFN Size                 | 480M                            | 32                                      | 2                                           | (LM) | 01/2017 | 1M(dims=1024x512)             | 0.8/0.9/1.1/1.1/1.9/5.1         | 4/32/256/256/1024/4096                  | 1                                           | | 01/2017 | 1M                            | 0.1/0.4/1.2/4.4/17.3/68.9/137.7 | 32 & 256/1024/4096/16384/65536/131072-h | 1                                           | | 01/2017 | 2M                            | 8.7B                            | 32 & 512/2048-h                         | 2 (one between stacked encoder and decoder) | | 01/2017 | 8192 FFN Size                 | 8.7B                            | 512                                     | 2                                           | | 07/2022 | 8192 FFN Size/33.6M           | 54.5B/51.6B Expert Size         | 128                                     | 6 Exp Layers                                | | 12/2022 | 8192 FFN Size/33.6M           | ~10.32B assuming 80% pruning    | ~24 per layer, 288 overall              | 6 Exp Layers                                | | 12/2021 | 8192 & 16384 & 32768 FFN Size | 20/27/53 & 105/143B & 1.2T      | 32/64/128 & 256/64 & 64                 | 24 & 32 & 64 (every other layer)            | |         |                               |                                 |                                         |                                             | | 05/2021 | 1024x4096 & 1024x21248        | 1.4 & 10.8 & 103.2 & 1002.7B    | 32 & 128 & 512 & 960 (Total)            | 5 & 10 & 24 & 24                            |

\\ = Values that are unconfirmed or insinuated from their experiments.

## Experimental Setups of Baselines and Hardware

For hardware requirements, slashes denote different configurations.

|                                                                    |                                                           |                        |                                                |                  |                 |

|--------------------------------------------------------------------|-----------------------------------------------------------|------------------------|------------------------------------------------|------------------|-----------------|

| Paper                                                              | Baseline                                                  | Hardware Requirements  | Memory                                         | Top-K            | Capacity        |

| [Megablocks](https://arxiv.org/abs/2211.15841)                     | Transformer-Base to GPT3-XL (46M to 1.3B)                 | 8x A100 80GB           |                                                | 1                | 1/1.5/2x        |

| [Deepspeed-MoE](https://arxiv.org/abs/2201.05596)                  | Scalable MoE                                              | 128x A100 80GB         |                                                | 2\*              | 2               |

| [Expert Choice Routing](https://arxiv.org/abs/2202.09368)          | GShard                                                    | 512x TPU V4            |                                                | N/A\*            | 2\*             |

| [Task-Level MoE](https://arxiv.org/abs/2110.03742)                 | Transformer Base (142M)/Token/Sentence MoE                | 32x TPU V3             |                                                | 1                |                 |

| [Hash Layers (vs Switch)](https://arxiv.org/abs/2106.04426)        | Transformer-Base(225/755M)/Switch Transformer             | 8 32GB V100            |                                                | \*1              |                 |

| [Hash Layers (vs BASE)](https://arxiv.org/abs/2106.04426)          | BASE Layers                                               | 16 32GB V100           |                                                | \*1              |                 |

| [GShard](https://arxiv.org/abs/2006.16668)                         | GPipe/Base Transformer                                    | 128/512/2048x TPU V3   |                                                | 2                | 2               |

| [FasterMoE](https://dl.acm.org/doi/pdf/10.1145/3503221.3508418)    | FastMoE/ GShard/ BASE                                     | 16/64x V100            |                                                | 2                |                 |

| [ST-MoE](https://arxiv.org/abs/2202.08906)                         | Dense-L/ T5 XXL/ Switch XXL                               | TPU                    |                                                | 2                | 1.25 Cap factor |

| [Random Routing](https://openreview.net/pdf?id=w1hwFUb_81)         | Thor/Transformer Dense                                    | 8x V100                |                                                | 1/2/4/8/16       |                 |

| [Gating Dropout](https://arxiv.org/abs/2205.14336)                 | Scalable MoE                                              | 16/64x of V100/A100    |                                                | 1                | 1/2(train/test) |

| [BASE Layers](https://arxiv.org/abs/2103.16716)                    | SMoE and Switch (52B)                                     | 8/32/128 32GB V100     |                                                |                  |                 |

| [Switch Transformer](https://arxiv.org/abs/2101.03961)             | T5(223M Base/ 739M Large)                                 | 32x TPUv3              |                                                | 1                |                 |

| [Evo MoE](https://arxiv.org/abs/2112.14397)                        | Switch/Hash Layers/BASE/StableMoE                         | 8x A100                |                                                | 1                |                 |

| [Stable-MoE](https://arxiv.org/abs/2204.08396) (LM)                | Switch Transformer/BASE Layer/Hash Layer/Transformer-Base | ?x V100 GPUs           |                                                | 1                | 1 (from Switch) |

| [Stable-MoE](https://arxiv.org/abs/2204.08396) (MT)                | Transformer-Base and Large/BASE Layer/Hash Layer/Switch   | ?x V100 GPUs           |                                                | 1                | 1               |

| [Outrageously Large MoEs](https://arxiv.org/abs/1701.06538v1) (LM) | MoE-1 Wide & Deep/ 4xLSTM-512/LSTM-2048 & 8192            | 4-16x k40s             |                                                | 4 or 2 for MoE-h |                 |

| Outrageously Large MoEs (LM-Large)                                 | MoE-1 Wide & Deep/ 4xLSTM-512/LSTM-2048 & 8192            | 32/64/128x k40s        |                                                | 4 or 2 for MoE-h |                 |

| Outrageously Large MoEs (MT)                                       | GNMT/PBMT/LSTM-6/DeepAtt                                  | 64 k40s                |                                                | 4 or 2 for MoE-h |                 |

| Outrageously Large MoEs (MTMT)                                     | GNMT-Mono/GNMT-Multi                                      | 64 k40s                |                                                | 2                |                 |

| [NLLB](https://arxiv.org/abs/2207.04672)                           |                                                           |                        | 101.6GiB/ each GPU holds one expert            |                  |                 |

| [Memory Efficient NLLB](https://arxiv.org/abs/2212.09811)          | 3.3B NLLB-Dense/NLLB-200 54.5B                            | 1/4x V100 GPUs         |                                                |                  |                 |

| [GLaM](https://arxiv.org/abs/2112.06905)                           | Switch/GPT-3/KG-FiD/Megatron-NLG                          | 1024x TPU v4 (largest) | For largest experts do not fit on a single TPU | 2                | 2\*             |

| Amazon SageMaker                                                   |                                                           |                        |                                                |                  |                 |

| [M6-T Sparse Experts](https://arxiv.org/abs/2105.15082)            | Their own comparisons with different Top-K                | 480 V100 32Gb          |                                                |                  |                 |

## Datasets, Citations and Open Source

Highest citation number is taken across Google Scholar and Semantic

scholar

|                                                                    |                                                                           |                 |                           |              |                                                                                                                                                                                |

|--------------------------------------------------------------------|---------------------------------------------------------------------------|-----------------|---------------------------|--------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|

| Paper                                                              | Dataset                                                                   | Batch Size      | Open Source               | Citations    | Notes                                                                                                                                                                          |

| [Megablocks](https://arxiv.org/abs/2211.15841)                     | The Pile                                                                  | 512             | N                         | 0            |                                                                                                                                                                                |

| [Deepspeed-MoE](https://arxiv.org/abs/2201.05596)                  | Lambada/PIQA/BoolQ/RACE-h/Trivia-QA/WebQS                                 | 256/512         | Y                         | 15/36        |                                                                                                                                                                                |

| [Expert Choice Routing](https://arxiv.org/abs/2202.09368)          | GLaM                                                                      | N/A             | N                         | 6            |                                                                                                                                                                                |

| [Task-Level MoE](https://arxiv.org/abs/2110.03742)                 | WMT                                                                       | N/A             | N                         | 13           |                                                                                                                                                                                |

| [Hash Layers (vs Switch)](https://arxiv.org/abs/2106.04426)        | Pushshift.io/RoBERTa/Wikitext-103/BST                                     | 40              | Y (partly)                | 43           |                                                                                                                                                                                |

| [Hash Layers (vs BASE)](https://arxiv.org/abs/2106.04426)          | Pushshift.io/RoBERTa/Wikitext-103/BST                                     | 2               | Y (partly)                | 43           |                                                                                                                                                                                |

| [GShard](https://arxiv.org/abs/2006.16668)                         | Custom Dataset                                                            | 4M              | Y (TPU Only)              | 305          |                                                                                                                                                                                |

| [FasterMoE](https://dl.acm.org/doi/pdf/10.1145/3503221.3508418)    | Wiki Text                                                                 |                 | Y                         | 22           |                                                                                                                                                                                |

| [ST-MoE](https://arxiv.org/abs/2202.08906)                         | C4 1.5T                                                                   | 1M              | Y                         | 26           |                                                                                                                                                                                |

| [Random Routing](https://openreview.net/pdf?id=w1hwFUb_81)         | enwik8/Bookcorpus                                                         | 128/256         | Under Review              | Under Review |                                                                                                                                                                                |

| [Gating Dropout](https://arxiv.org/abs/2205.14336)                 | WMT/Web-50                                                                | 435K            | N                         | 1/5          |                                                                                                                                                                                |

| [BASE Layers](https://arxiv.org/abs/2103.16716)                    | RoBERTa corpus and CC100                                                  |                 | Y                         | 64/79        |                                                                                                                                                                                |

| [Switch Transformer](https://arxiv.org/abs/2101.03961)             | Large C4 Corpus (180B)                                                    | 1M              | Y                         | 525          |                                                                                                                                                                                |

| [Evo MoE](https://arxiv.org/abs/2112.14397)                        | WMT(MT)/OpenWebText(LM MLM)/Wikipedia/OpenWebText                         | N/A             | Y                         | 11           |                                                                                                                                                                                |

| [Stable-MoE](https://arxiv.org/abs/2204.08396) (LM)                | RoBERTa and CC100                                                         | 512K            | Y                         | 9            |                                                                                                                                                                                |

| [Stable-MoE](https://arxiv.org/abs/2204.08396) (MT)                | WMT                                                                       | 512K            | Y                         | 9            |                                                                                                                                                                                |

| [Outrageously Large MoEs](https://arxiv.org/abs/1701.06538v1) (LM) | 1B word benchmark                                                         | ?               | N(but has been recreated) | 1117/1050    | Uses MoE Layer between two LSTMS. 8.4/37.8/272.9/1079/4303M.                                                                                                                   |

| Outrageously Large MoEs (LM-Large)                                 | 100 Billion Google Corpus                                                 | 2.5M            | ""                        | ""           | Fit up to 1 Billion parameters per GPU. The 64 and 128 GPU tests are for the last two expert models                                                                            |

| Outrageously Large MoEs (MT)                                       | WMT                                                                       | ?               | ""                        | ""           | Fit up to 1 Billion parameters per GPU.                                                                                                                                        |

| Outrageously Large MoEs (MTMT)                                     | CoRR                                                                      | 1M(16K per GPU) | ""                        | ""           |                                                                                                                                                                                |

| [NLLB](https://arxiv.org/abs/2207.04672)                           | Flores-200(Eval)/LID curated data/Paracrawl and CommonCrawl (Monolingual) | 16K             | Y                         | 26/49        | Every fourth layer is an MoE layer.                                                                                                                                            |

| [Memory Efficient NLLB](https://arxiv.org/abs/2212.09811)          | Flores-200(Eval)                                                         | 16K             | N                         | 0            | Releasing some results such as experts pruned etc Every fourth FFN sublayer is replaced with an MoE layer. NLLB-200 requires 4x32 V100s to run. This usesthe 80% pruned model. |

| [GLaM](https://arxiv.org/abs/2112.06905)                           | GLaM Custom dataset of webpages/wikipedia/forums etc                      | 1M              | N                         | 59/84        |                                                                                                                                                                                |

| Amazon SageMaker                                                   |                                                                           |                 |                           |              |                                                                                                                                                                                |

| [M6-T Sparse Experts](https://arxiv.org/abs/2105.15082)            |                                                                           |                 |                           |              |                                                                                                                                                                                |
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/AdamG012/moe-paper-models

Awesome Lists containing this project

README