moe-paper-models

A sumary of MoE experimental setups across a number of different papers.
https://github.com/AdamG012/moe-paper-models

Last synced: 6 days ago
JSON representation

Datasets, Citations and Open Source
- FasterMoE
- Memory Efficient NLLB - 200(Eval) | 16K | N | 0 | Releasing some results such as experts pruned etc Every fourth FFN sublayer is replaced with an MoE layer. NLLB-200 requires 4x32 V100s to run. This usesthe 80% pruned model. |
- Megablocks
- BASE Layers
- Task-Level MoE
- Random Routing
- ST-MoE
- Expert Choice Routing
- Deepspeed-MoE - h/Trivia-QA/WebQS | 256/512 | Y | 15/36 | |
- NLLB - 200(Eval)/LID curated data/Paracrawl and CommonCrawl (Monolingual) | 16K | Y | 26/49 | Every fourth layer is an MoE layer. |
- M6-T Sparse Experts
- Hash Layers (vs BASE) - 103/BST | 2 | Y (partly) | 43 | |
- GShard
- Gating Dropout - 50 | 435K | N | 1/5 | |
- Switch Transformer
- Evo MoE
- Stable-MoE
- Outrageously Large MoEs
- GLaM

Categories

Datasets, Citations and Open Source 19

Sub Categories