Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/arpita8/awesome-mixture-of-experts-papers

Survey: A collection of AWESOME papers and resources on the latest research in Mixture of Experts.
https://github.com/arpita8/awesome-mixture-of-experts-papers

List: awesome-mixture-of-experts-papers

awesome computer-vision deep-learning large-language-models llm machine-learning mixture-of-experts rec recsys

Last synced: 3 months ago
JSON representation

Survey: A collection of AWESOME papers and resources on the latest research in Mixture of Experts.

Awesome Lists containing this project

README

        

# Mixture-of-Experts-Papers [![Awesome](https://awesome.re/badge.svg)](https://awesome.re)
A curated list of exceptional papers and resources on Mixture of Experts and related topics.

***News: Our Mixture of Experts survey has been released.***
[The Evolution of Mixture of Experts: A Survey from Basics to Breakthroughs](https://www.researchgate.net/publication/382916607_THE_EVOLUTION_OF_MIXTURE_OF_EXPERTS_A_SURVEY_FROM_BASICS_TO_BREAKTHROUGHS)


Editor

## Links
[Mendeley](https://www.mendeley.com/reference-manager/reader-v2/6d1f5496-9cee-3fdf-b081-4eaf2b6c9eb7/3903bd7b-e6a3-d6bd-340b-c3187d52e179) | [ResearchGate](https://www.researchgate.net/publication/382916607_THE_EVOLUTION_OF_MIXTURE_OF_EXPERTS_A_SURVEY_FROM_BASICS_TO_BREAKTHROUGHS) | [PDF](https://github.com/arpita8/Awesome-Mixture-of-Experts-Papers/blob/main/Mixture_of_Experts_Survey_Paper.pdf)
If our work has been of assistance to you, please feel free to cite our survey. Thank you.
```
@article{article,
author = {Vats, Arpita and Raja, Rahul and Jain, Vinija and Chadha, Aman},
year = {2024},
month = {08},
pages = {12},
title = {THE EVOLUTION OF MIXTURE OF EXPERTS: A SURVEY FROM BASICS TO BREAKTHROUGHS}
}
```

# Table of Contents
- [Sparse Mixture of Experts](#Sparsely-Gated-Mixture-of-Experts)
- [Table of Contents](#table-of-contents)
- [Collection of Recent MoE Papers](#the-papers-and-related-projects)
- [Visual Domain MoE](#MoE-in-Computer-Vision)
- [Large Language Models](#MoE-in-LLMs)
- [Specialized Experts Network](#Specialized-Experts-Network)
- [Enhanching System Performance and Efficiency](#MoE-Enhancing-Performance)
- [MoE in Recommendation](#Recommendation)
- [Python Libraries](#Python-Libraries)

## Evolution in Sparse Mixture of Experts


Editor

| **Name** | **Paper** | **Venue**| **Year** |
| -------- | ------------------------------------------------------------ | --------- | -------- |
| The Sparsely-Gated Mixture-of-Experts Layer | [Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer](https://arxiv.org/abs/1701.06538) | arXiv | 2017 |

## Collection of Recent MoE Papers
### MoE in Visual Domain
| **Name** | **Paper** | **Venue** | **Year** |
| -------- | ------------------------------------------------------------ | --------- | -------- |
| MoE-FFD | [MoE-FFD: Mixture of Experts for Generalized and Parameter-Efficient Face Forgery Detection](https://arxiv.org/abs/2404.08452) | arXiv | 2024 |
| MLLMs | [MoE-FFD: Mixture of Experts for Generalized and Parameter-Efficient Face Forgery Detection](https:/FFD/arxiv.org/abs/2404.08452) | arXiv | 2024 |
| MoE-LLaVA | [MoE-LLaVA: Mixture of Experts for Large Vision-Language Models](https://arxiv.org/abs/2401.15947) | arXiv | 2024 |
| MOVA | [MoVA: Adapting Mixture of Vision Experts to Multimodal Context](https://arxiv.org/abs/2404.13046) | arXiv | 2024 |
| MetaBEV | [MetaBEV: Solving Sensor Failures for BEV Detection and Map Segmentation](https://arxiv.org/abs/2304.09801) | arXiv | 2023 |
| AdaMV-MoE| [AdaMV-MoE: Adaptive Multi-Task Vision Mixture-of-Experts](https://openaccess.thecvf.com/content/ICCV2023/papers/Chen_AdaMV-MoE_Adaptive_Multi-Task_Vision_Mixture-of-Experts_ICCV_2023_paper.pdf) | CVPR | 2023 |
| ERNIE-ViLG 2.0 | [ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts](https://arxiv.org/abs/2210.15257) | arXiv | 2023 |
| M³ViT | [Mixture-of-Experts Vision Transformer for Efficient Multi-task Learning with Model-Accelerator Co-design](https://arxiv.org/abs/2210.14793) | arXiv | 2022 |
| LIMoE | [Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts](https://arxiv.org/abs/2206.02770) | arXiv | 2022 |
| MoEBERT | [MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation](https://arxiv.org/abs/2204.07675) | arXiv | 2022 |
| VLMo | [VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts](https://arxiv.org/abs/2111.02358) | arXiv | 2022 |
| DeepSpeed MoE | [DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale](https://arxiv.org/abs/2201.05596) | arXiv | 2022 |
| V-MoE | [Vision Mixture of Experts](https://arxiv.org/abs/2106.05974) | arXiv | 2021 |
| DSelect-k | [DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning](https://arxiv.org/abs/2106.03760) | arXiv | 2021 |
| MMoE | [Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts](https://dl.acm.org/doi/pdf/10.1145/3219819.3220007) | ACM | 2018 |

### MoE in LLMs

| **Name** | **Paper** | **Venue** | **Year** |
| -------- | ------------------------------------------------------------ | --------- | -------- |
| LoRAMoE | [LoRAMoE: Alleviate World Knowledge Forgetting in Large Language Models via MoE-Style Plugin](https://arxiv.org/abs/2312.09979) | arXiv | 2024 |
| Flan-MoE | [Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language Models](https://arxiv.org/abs/2305.14705) | ICLR | 2024 |
| RAPHAEL | [RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths](https://arxiv.org/abs/2305.18295) | arXiv | 2024 |
| Branch-Train-MiX | [Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM](https://arxiv.org/abs/2403.07816) | arXiv | 2024 |
| Self-MoE | [Self-MoE: Towards Compositional Large Language Models with Self-Specialized Experts](https://arxiv.org/abs/2406.12034) | arXiv | 2024 |
| CuMo | [CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-ExpertsCuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts](https://arxiv.org/abs/2405.05949) | arXiv | 2024 |
| MOELoRA | [MoELoRA: Contrastive Learning Guided Mixture of Experts on Parameter-Efficient Fine-Tuning for Large Language Models](https://arxiv.org/abs/2402.12851) | arXiv | 2024 |
| Mistral | [Mistral 7B](https://arxiv.org/abs/2310.06825) | arXiv | 2023 |
| HetuMoE| [HetuMoE: An Efficient Trillion-scale Mixture-of-Expert Distributed Training System](https://arxiv.org/abs/2203.14685) | arXiv | 2022 |
| GLaM | [GLaM: Efficient Scaling of Language Models with Mixture-of-Experts](https://arxiv.org/abs/2112.06905) | arXiv | 2022 |
| eDiff-I| [eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers](https://arxiv.org/abs/2211.01324) | arXiv | 2022 |

### MoE for Scaling LLMs

| **Name** | **Paper** | **Venue** | **Year** |
| -------- | ------------------------------------------------------------ | --------- | -------- |
| u-LLaVA | [u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model](https://arxiv.org/abs/2311.05348) | arXiv | 2024 |
| MoLE | [QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models](https://arxiv.org/abs/2404.07413) | arXiv | 2024 |
| Lory | [Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training](https://arxiv.org/abs/2405.03133) | arXiv | 2024 |
| Uni-MoE | [Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts](https://arxiv.org/abs/2405.11273) | arXiv | 2024 |
| MH-MoE | [Multi-Head Mixture-of-Experts](https://arxiv.org/abs/2404.15045) | arXiv | 2024 |
| DeepSeekMoE | [DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models](https://arxiv.org/abs/2401.06066) | arXiv | 2024 |
| Mini-Gemini | [Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models](https://arxiv.org/abs/2403.18814) | arXiv | 2024 |
| OpenMoE | [OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models](https://arxiv.org/abs/2402.01739) | arXiv | 2024 |
| TUTEL | [Tutel: Adaptive Mixture-of-Experts at Scale](https://arxiv.org/abs/2206.03382) | arXiv | 2023 |
| QMoE | [QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models](https://arxiv.org/abs/2310.16795) | arXiv | 2023 |
| Switch-NeRF | [Switch-NeRF: Learning Scene Decomposition with Mixture of Experts for Large-scale Neural Radiance Fields](https://openreview.net/forum?id=PQ2zoIZqvm) | ICLR | 2023 |
| SaMoE | [SaMoE: Parameter Efficient MoE Language Models via Self-Adaptive Expert Combination ](https://openreview.net/forum?id=HO2q49XYRC) | ICLR | 2023 |
| JetMoE | [JetMoE: Reaching Llama2 Performance with 0.1M Dollars](https://arxiv.org/abs/2310.16795) | arXiv | 2023 |
| MegaBlocks | [MegaBlocks: Efficient Sparse Training with Mixture-of-Experts](https://arxiv.org/abs/2211.15841) | arXiv | 2022 |
| ST-MoE | [ST-MoE: Designing Stable and Transferable Sparse Expert Models](https://arxiv.org/abs/2202.08906) | arXiv | 2022 |
| Uni-Perceiver-MoE | [Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs ](https://openreview.net/forum?id=agJEk7FhvKL) | NeurIPS | 2022 |
| SpeechMoE | [SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture of Experts](https://arxiv.org/abs/2105.03036) | arXiv | 2021 |
| Fully-Differential Sparse Transformer | [Sparse is Enough in Scaling Transformers](https://arxiv.org/pdf/2111.12763) | arXiv | 2021 |

### MoE: Enhancing System Performance and Efficiency

| **Name** | **Paper** | **Venue** | **Year** |
| -------- | ------------------------------------------------------------ | --------- | -------- |
| pMoE | [PMoE: Progressive Mixture of Experts with Asymmetric Transformer for Continual Learning](https://arxiv.org/abs/2407.21571) | arXiv | 2024 |
| HyperMoE | [HyperMoE: Towards Better Mixture of Experts via Transferring Among Experts](https://arxiv.org/abs/2402.12656) | arXiv | 2024 |
| BlackMamba | [BlackMamba: Mixture of Experts for State-Space Models](https://arxiv.org/abs/2402.01771) | arXiv | 2024 |
| ScheMoE | [ScheMoE: An Extensible Mixture-of-Experts Distributed Training System with Tasks Scheduling](https://dl.acm.org/doi/10.1145/3627703.3650083) | arXiv | 2024 |
| Pre-Gates MoE | [Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference](https://arxiv.org/pdf/2308.12066) | arXiv | 2024 |
| MoE-Mamba | [MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts](https://arxiv.org/abs/2401.04081) | arXiv | 2024 |
| Parameter-efficient MoEs | [Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient MoE for Instruction Tuning](https://arxiv.org/abs/2309.05444) | arXiv | 2023 |
| SMoE-Dropout | [Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers](https://arxiv.org/abs/2303.01610) | arXiv | 2023 |
| StableMoE | [StableMoE: Stable Routing Strategy for Mixture of Experts](https://arxiv.org/abs/2204.08396) | arXiv | 2022 |
| Alpa | [Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning](https://arxiv.org/abs/2201.12023) | arXiv | 2022 |
| BaGuaLu | [BaGuaLu: targeting brain scale pretrained models with over 37 million cores](https://dl.acm.org/doi/abs/10.1145/3503221.3508417) | ACM | 2022 |
| MEFT | [MEFT: Memory-Efficient Fine-Tuning through Sparse Adapter](https://arxiv.org/abs/2406.04984) | arXiv | 2024 |
| EdgeMoE | [EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models](https://arxiv.org/abs/2308.14352) | arXiv | 2023 |
| SE-MoE | [SE-MoE: A Scalable and Efficient Mixture-of-Experts Distributed Training and Inference System](https://arxiv.org/abs/2205.10034) | arXiv | 2022 |
| NLLB | [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) | arXiv | 2022 |
| EvoMoE | [EvoMoE: An Evolutional Mixture-of-Experts Training Framework via Dense-To-Sparse Gate](https://arxiv.org/abs/2112.14397) | arXiv | 2022 |
| FastMoE | [FastMoE: A Fast Mixture-of-Expert Training System](https://arxiv.org/abs/2103.13262) | arXiv | 2021 |
| ACE | [ACE: Ally Complementary Experts for Solving Long-Tailed Recognition in One-Shot](https://openaccess.thecvf.com/content/ICCV2021/papers/Cai_ACE_Ally_Complementary_Experts_for_Solving_Long-Tailed_Recognition_in_One-Shot_ICCV_2021_paper.pdf) | ICCV | 2021 |
| M6-10T | [M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining](https://arxiv.org/abs/2110.03888) | arXiv | 2021 |
| GShard | [GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding](https://arxiv.org/abs/2006.16668) | arXiv | 2020 |
| PAD-Net | [PAD-Net: Multi-Tasks Guided Prediction-and-Distillation Network for Simultaneous Depth Estimation and Scene Parsing](https://arxiv.org/abs/1805.04409) | arXiv | 2018 |

### Integrating Mixture of Experts into Recommendation Algorithms

| **Name** | **Paper** | **Venue** | **Year** |
| -------- | ------------------------------------------------------------ | --------- | -------- |
| MoME | [MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models](https://arxiv.org/abs/2407.12709) | arXiv | 2024 |
| CAME | [CAME: Competitively Learning a Mixture-of-Experts Model for First-stage Retrieval](https://dl.acm.org/doi/pdf/10.1145/3678880) | ACM | 2024 |
| SummaReranker | [SummaReranker: A Multi-Task Mixture-of-Experts Re-ranking Framework for Abstractive Summarization](https://arxiv.org/abs/2203.06569) | arXiv | 2022 |
| MDFEND | [MDFEND: Multi-domain Fake News Detection](https://arxiv.org/abs/2201.00987) | arXiv | 2022 |
| PLE | [PLE outperforming state-of-the-art MTL models](https://dl.acm.org/doi/10.1145/3383313.3412236) | RecSys | 2021 |

### Python Libraries for MoE

| **Name** | **Paper** | **Venue** | **Year** |
| -------- | ------------------------------------------------------------ | --------- | -------- |
| MoE-Infinity | [MoE-Infinity: Offloading-Efficient MoE Model Serving](https://arxiv.org/abs/2401.14361) | arXiv | 2024 |
| SMT 2.0 | [SMT 2.0: A Surrogate Modeling Toolbox with a focus on Hierarchical and Mixed Variables Gaussian Processes](https://arxiv.org/abs/2305.13998) | arXiv | 2023 |



Hope our survey with collection of all the recent MoE can help your work.