Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
moe-paper-models
A sumary of MoE experimental setups across a number of different papers.
https://github.com/AdamG012/moe-paper-models
- Megablocks - 13B | 64 | 3/6/12 |
- Deepspeed-MoE
- Deepspeed-MoE
- Expert Choice Routing
- Task-Level MoE
- Hash Layers (vs Switch)
- Hash Layers (vs BASE)
- GShard
- FasterMoE
- ST-MoE
- Random Routing - 200M | 8/16 | 4/12 |
- Gating Dropout
- BASE Layers
- Switch Transformer
- Evo MoE
- Stable-MoE
- Stable-MoE
- Outrageously Large MoEs
- NLLB
- Memory Efficient NLLB
- GLaM
- M6-T Sparse Experts
- Megablocks - Base to GPT3-XL (46M to 1.3B) | 8x A100 80GB | | 1 | 1/1.5/2x |
- Deepspeed-MoE
- Expert Choice Routing
- Task-Level MoE
- Hash Layers (vs Switch) - Base(225/755M)/Switch Transformer | 8 32GB V100 | | \*1 | |
- Hash Layers (vs BASE)
- GShard
- FasterMoE
- ST-MoE - L/ T5 XXL/ Switch XXL | TPU | | 2 | 1.25 Cap factor |
- Random Routing
- Gating Dropout
- BASE Layers
- Switch Transformer
- Evo MoE
- Stable-MoE - Base | ?x V100 GPUs | | 1 | 1 (from Switch) |
- Stable-MoE - Base and Large/BASE Layer/Hash Layer/Switch | ?x V100 GPUs | | 1 | 1 |
- Outrageously Large MoEs - 1 Wide & Deep/ 4xLSTM-512/LSTM-2048 & 8192 | 4-16x k40s | | 4 or 2 for MoE-h | |
- NLLB
- Memory Efficient NLLB - Dense/NLLB-200 54.5B | 1/4x V100 GPUs | | | |
- GLaM - 3/KG-FiD/Megatron-NLG | 1024x TPU v4 (largest) | For largest experts do not fit on a single TPU | 2 | 2\* |
- M6-T Sparse Experts - K | 480 V100 32Gb | | | |
- Megablocks
- Deepspeed-MoE - h/Trivia-QA/WebQS | 256/512 | Y | 15/36 | |
- Expert Choice Routing
- Task-Level MoE
- Hash Layers (vs Switch) - 103/BST | 40 | Y (partly) | 43 | |
- Hash Layers (vs BASE) - 103/BST | 2 | Y (partly) | 43 | |
- GShard
- FasterMoE
- ST-MoE
- Random Routing
- Gating Dropout - 50 | 435K | N | 1/5 | |
- BASE Layers
- Switch Transformer
- Evo MoE
- Stable-MoE
- Stable-MoE
- Outrageously Large MoEs
- NLLB - 200(Eval)/LID curated data/Paracrawl and CommonCrawl (Monolingual) | 16K | Y | 26/49 | Every fourth layer is an MoE layer. |
- Memory Efficient NLLB - 200(Eval) | 16K | N | 0 | Releasing some results such as experts pruned etc Every fourth FFN sublayer is replaced with an MoE layer. NLLB-200 requires 4x32 V100s to run. This usesthe 80% pruned model. |
- GLaM
- M6-T Sparse Experts