https://github.com/arpita8/awesome-mixture-of-experts-papers

Survey: A collection of AWESOME papers and resources on the latest research in Mixture of Experts.
https://github.com/arpita8/awesome-mixture-of-experts-papers
List: awesome-mixture-of-experts-papers
awesome computer-vision deep-learning large-language-models llm machine-learning mixture-of-experts rec recsys
Last synced: 5 months ago
JSON representation
Survey: A collection of AWESOME papers and resources on the latest research in Mixture of Experts.
Host: GitHub
URL: https://github.com/arpita8/awesome-mixture-of-experts-papers
Owner: arpita8
Created: 2024-08-11T07:59:39.000Z (10 months ago)
Default Branch: main
Last Pushed: 2024-08-21T04:58:09.000Z (10 months ago)
Last Synced: 2025-01-16T02:02:16.775Z (5 months ago)
Topics: awesome, computer-vision, deep-learning, large-language-models, llm, machine-learning, mixture-of-experts, rec, recsys
Homepage:
Size: 2.21 MB
Stars: 96
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

ultimate-awesome - awesome-mixture-of-experts-papers - Survey: A collection of AWESOME papers and resources on the latest research in Mixture of Experts. (Other Lists / Julia Lists)
README

        # Mixture-of-Experts-Papers [![Awesome](https://awesome.re/badge.svg)](https://awesome.re)

A curated list of exceptional papers and resources on Mixture of Experts and related topics.

 ***News: Our Mixture of Experts survey has been released.***

 [The Evolution of Mixture of Experts: A Survey from Basics to Breakthroughs](https://www.researchgate.net/publication/382916607_THE_EVOLUTION_OF_MIXTURE_OF_EXPERTS_A_SURVEY_FROM_BASICS_TO_BREAKTHROUGHS)



	



 ## Links

 [Mendeley](https://www.mendeley.com/reference-manager/reader-v2/6d1f5496-9cee-3fdf-b081-4eaf2b6c9eb7/3903bd7b-e6a3-d6bd-340b-c3187d52e179) | [ResearchGate](https://www.researchgate.net/publication/382916607_THE_EVOLUTION_OF_MIXTURE_OF_EXPERTS_A_SURVEY_FROM_BASICS_TO_BREAKTHROUGHS) | [PDF](https://github.com/arpita8/Awesome-Mixture-of-Experts-Papers/blob/main/Mixture_of_Experts_Survey_Paper.pdf)

 If our work has been of assistance to you, please feel free to cite our survey. Thank you.

```

@article{article,

author = {Vats, Arpita and Raja, Rahul and Jain, Vinija and Chadha, Aman},

year = {2024},

month = {08},

pages = {12},

title = {THE EVOLUTION OF MIXTURE OF EXPERTS: A SURVEY FROM BASICS TO BREAKTHROUGHS}

}

```

# Table of Contents

- [Sparse Mixture of Experts](#Sparsely-Gated-Mixture-of-Experts)

- [Table of Contents](#table-of-contents)

  - [Collection of Recent MoE Papers](#the-papers-and-related-projects)

    - [Visual Domain MoE](#MoE-in-Computer-Vision)

    - [Large Language Models](#MoE-in-LLMs)

    - [Specialized Experts Network](#Specialized-Experts-Network)

    - [Enhanching System Performance and Efficiency](#MoE-Enhancing-Performance)

    - [MoE in Recommendation](#Recommendation)

    - [Python Libraries](#Python-Libraries)

## Evolution in Sparse Mixture of Experts



	



| **Name** | **Paper**                                                    | **Venue**| **Year** |                                              

| -------- | ------------------------------------------------------------ | --------- | -------- | 

| The Sparsely-Gated Mixture-of-Experts Layer  | [Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer](https://arxiv.org/abs/1701.06538) | arXiv | 2017 |

## Collection of Recent MoE Papers

### MoE in Visual Domain

| **Name** | **Paper**                                                    | **Venue** | **Year** |                                              

 | -------- | ------------------------------------------------------------ | --------- | -------- | 

| MoE-FFD  | [MoE-FFD: Mixture of Experts for Generalized and Parameter-Efficient Face Forgery Detection](https://arxiv.org/abs/2404.08452) | arXiv | 2024 |

| MLLMs | [MoE-FFD: Mixture of Experts for Generalized and Parameter-Efficient Face Forgery Detection](https:/FFD/arxiv.org/abs/2404.08452) | arXiv | 2024 |

| MoE-LLaVA | [MoE-LLaVA: Mixture of Experts for Large Vision-Language Models](https://arxiv.org/abs/2401.15947) | arXiv | 2024 |

| MOVA  | [MoVA: Adapting Mixture of Vision Experts to Multimodal Context](https://arxiv.org/abs/2404.13046) | arXiv | 2024 |

| MetaBEV | [MetaBEV: Solving Sensor Failures for BEV Detection and Map Segmentation](https://arxiv.org/abs/2304.09801) | arXiv | 2023 |

| AdaMV-MoE| [AdaMV-MoE: Adaptive Multi-Task Vision Mixture-of-Experts](https://openaccess.thecvf.com/content/ICCV2023/papers/Chen_AdaMV-MoE_Adaptive_Multi-Task_Vision_Mixture-of-Experts_ICCV_2023_paper.pdf) | CVPR | 2023 |

| ERNIE-ViLG 2.0 | [ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts](https://arxiv.org/abs/2210.15257) | arXiv | 2023 |

| M³ViT  | [Mixture-of-Experts Vision Transformer for Efficient Multi-task Learning with Model-Accelerator Co-design](https://arxiv.org/abs/2210.14793) | arXiv | 2022 |

| LIMoE  | [Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts](https://arxiv.org/abs/2206.02770) | arXiv | 2022 |

| MoEBERT  | [MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation](https://arxiv.org/abs/2204.07675) | arXiv | 2022 |

| VLMo | [VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts](https://arxiv.org/abs/2111.02358) | arXiv | 2022 |

| DeepSpeed MoE | [DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale](https://arxiv.org/abs/2201.05596) | arXiv | 2022 |

| V-MoE  | [Vision Mixture of Experts](https://arxiv.org/abs/2106.05974) | arXiv | 2021 |

| DSelect-k | [DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning](https://arxiv.org/abs/2106.03760) | arXiv | 2021 |

| MMoE  | [Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts](https://dl.acm.org/doi/pdf/10.1145/3219819.3220007) | ACM | 2018 | 

### MoE in LLMs

| **Name** | **Paper**                                                    | **Venue** | **Year** |    

 | -------- | ------------------------------------------------------------ | --------- | -------- | 

| LoRAMoE | [LoRAMoE: Alleviate World Knowledge Forgetting in Large Language Models via MoE-Style Plugin](https://arxiv.org/abs/2312.09979) | arXiv | 2024 |

| Flan-MoE | [Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language Models](https://arxiv.org/abs/2305.14705) | ICLR | 2024 |

| RAPHAEL | [RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths](https://arxiv.org/abs/2305.18295) | arXiv | 2024 |

| Branch-Train-MiX | [Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM](https://arxiv.org/abs/2403.07816) | arXiv | 2024 |

| Self-MoE | [Self-MoE: Towards Compositional Large Language Models with Self-Specialized Experts](https://arxiv.org/abs/2406.12034) | arXiv | 2024 |

| CuMo | [CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-ExpertsCuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts](https://arxiv.org/abs/2405.05949) | arXiv | 2024 |

| MOELoRA | [MoELoRA: Contrastive Learning Guided Mixture of Experts on Parameter-Efficient Fine-Tuning for Large Language Models](https://arxiv.org/abs/2402.12851) | arXiv | 2024 |

| Mistral | [Mistral 7B](https://arxiv.org/abs/2310.06825) | arXiv | 2023 |

| HetuMoE| [HetuMoE: An Efficient Trillion-scale Mixture-of-Expert Distributed Training System](https://arxiv.org/abs/2203.14685) | arXiv | 2022 |

| GLaM | [GLaM: Efficient Scaling of Language Models with Mixture-of-Experts](https://arxiv.org/abs/2112.06905) | arXiv | 2022 |

| eDiff-I| [eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers](https://arxiv.org/abs/2211.01324) | arXiv | 2022 |

### MoE for Scaling LLMs

| **Name** | **Paper**                                                    | **Venue** | **Year** |

| -------- | ------------------------------------------------------------ | --------- | -------- | 

| u-LLaVA | [u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model](https://arxiv.org/abs/2311.05348) | arXiv | 2024 |

| MoLE | [QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models](https://arxiv.org/abs/2404.07413) | arXiv | 2024 |

| Lory | [Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training](https://arxiv.org/abs/2405.03133) | arXiv | 2024 |

| Uni-MoE | [Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts](https://arxiv.org/abs/2405.11273) | arXiv | 2024 |

| MH-MoE | [Multi-Head Mixture-of-Experts](https://arxiv.org/abs/2404.15045) | arXiv | 2024 |

| DeepSeekMoE | [DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models](https://arxiv.org/abs/2401.06066) | arXiv | 2024 |

| Mini-Gemini | [Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models](https://arxiv.org/abs/2403.18814) | arXiv | 2024 |

| OpenMoE | [OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models](https://arxiv.org/abs/2402.01739) | arXiv | 2024 |

| TUTEL | [Tutel: Adaptive Mixture-of-Experts at Scale](https://arxiv.org/abs/2206.03382) | arXiv | 2023 |

| QMoE | [QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models](https://arxiv.org/abs/2310.16795) | arXiv | 2023 |

| Switch-NeRF | [Switch-NeRF: Learning Scene Decomposition with Mixture of Experts for Large-scale Neural Radiance Fields](https://openreview.net/forum?id=PQ2zoIZqvm) | ICLR | 2023 |

| SaMoE | [SaMoE: Parameter Efficient MoE Language Models via Self-Adaptive Expert Combination ](https://openreview.net/forum?id=HO2q49XYRC) | ICLR | 2023 |

| JetMoE | [JetMoE: Reaching Llama2 Performance with 0.1M Dollars](https://arxiv.org/abs/2310.16795) | arXiv | 2023 |

| MegaBlocks | [MegaBlocks: Efficient Sparse Training with Mixture-of-Experts](https://arxiv.org/abs/2211.15841) | arXiv | 2022 |

| ST-MoE | [ST-MoE: Designing Stable and Transferable Sparse Expert Models](https://arxiv.org/abs/2202.08906) | arXiv | 2022 |

| Uni-Perceiver-MoE | [Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs ](https://openreview.net/forum?id=agJEk7FhvKL) | NeurIPS | 2022 |

| SpeechMoE | [SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture of Experts](https://arxiv.org/abs/2105.03036) | arXiv | 2021 |

| Fully-Differential Sparse Transformer | [Sparse is Enough in Scaling Transformers](https://arxiv.org/pdf/2111.12763) | arXiv | 2021 |

### MoE: Enhancing System Performance and Efficiency

| **Name** | **Paper**                                                    | **Venue** | **Year** |

| -------- | ------------------------------------------------------------ | --------- | -------- | 

| pMoE | [PMoE: Progressive Mixture of Experts with Asymmetric Transformer for Continual Learning](https://arxiv.org/abs/2407.21571) | arXiv | 2024 |

| HyperMoE | [HyperMoE: Towards Better Mixture of Experts via Transferring Among Experts](https://arxiv.org/abs/2402.12656) | arXiv | 2024 |

| BlackMamba | [BlackMamba: Mixture of Experts for State-Space Models](https://arxiv.org/abs/2402.01771) | arXiv | 2024 |

| ScheMoE | [ScheMoE: An Extensible Mixture-of-Experts Distributed Training System with Tasks Scheduling](https://dl.acm.org/doi/10.1145/3627703.3650083) | arXiv | 2024 |

| Pre-Gates MoE | [Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference](https://arxiv.org/pdf/2308.12066) | arXiv | 2024 |

| MoE-Mamba | [MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts](https://arxiv.org/abs/2401.04081) | arXiv | 2024 |

| Parameter-efficient MoEs | [Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient MoE for Instruction Tuning](https://arxiv.org/abs/2309.05444) | arXiv | 2023 |

| SMoE-Dropout | [Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers](https://arxiv.org/abs/2303.01610) | arXiv | 2023 |

| StableMoE | [StableMoE: Stable Routing Strategy for Mixture of Experts](https://arxiv.org/abs/2204.08396) | arXiv | 2022 |

| Alpa | [Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning](https://arxiv.org/abs/2201.12023) | arXiv | 2022 |

| BaGuaLu | [BaGuaLu: targeting brain scale pretrained models with over 37 million cores](https://dl.acm.org/doi/abs/10.1145/3503221.3508417) | ACM | 2022 |

| MEFT | [MEFT: Memory-Efficient Fine-Tuning through Sparse Adapter](https://arxiv.org/abs/2406.04984) | arXiv | 2024 |

| EdgeMoE | [EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models](https://arxiv.org/abs/2308.14352) | arXiv | 2023 |

| SE-MoE | [SE-MoE: A Scalable and Efficient Mixture-of-Experts Distributed Training and Inference System](https://arxiv.org/abs/2205.10034) | arXiv | 2022 |

| NLLB | [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) | arXiv | 2022 |

| EvoMoE | [EvoMoE: An Evolutional Mixture-of-Experts Training Framework via Dense-To-Sparse Gate](https://arxiv.org/abs/2112.14397) | arXiv | 2022 |

| FastMoE | [FastMoE: A Fast Mixture-of-Expert Training System](https://arxiv.org/abs/2103.13262) | arXiv | 2021 |

| ACE | [ACE: Ally Complementary Experts for Solving Long-Tailed Recognition in One-Shot](https://openaccess.thecvf.com/content/ICCV2021/papers/Cai_ACE_Ally_Complementary_Experts_for_Solving_Long-Tailed_Recognition_in_One-Shot_ICCV_2021_paper.pdf) | ICCV | 2021 |

| M6-10T | [M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining](https://arxiv.org/abs/2110.03888) | arXiv | 2021 |

| GShard | [GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding](https://arxiv.org/abs/2006.16668) | arXiv | 2020 |

| PAD-Net | [PAD-Net: Multi-Tasks Guided Prediction-and-Distillation Network for Simultaneous Depth Estimation and Scene Parsing](https://arxiv.org/abs/1805.04409) | arXiv | 2018 |

### Integrating Mixture of Experts into Recommendation Algorithms

| **Name** | **Paper**                                                    | **Venue** | **Year** |

| -------- | ------------------------------------------------------------ | --------- | -------- | 

| MoME | [MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models](https://arxiv.org/abs/2407.12709) | arXiv | 2024 |

| CAME | [CAME: Competitively Learning a Mixture-of-Experts Model for First-stage Retrieval](https://dl.acm.org/doi/pdf/10.1145/3678880) | ACM | 2024 |

| SummaReranker | [SummaReranker: A Multi-Task Mixture-of-Experts Re-ranking Framework for Abstractive Summarization](https://arxiv.org/abs/2203.06569) | arXiv | 2022 |

| MDFEND | [MDFEND: Multi-domain Fake News Detection](https://arxiv.org/abs/2201.00987) | arXiv | 2022 |

| PLE | [PLE outperforming state-of-the-art MTL models](https://dl.acm.org/doi/10.1145/3383313.3412236) | RecSys | 2021 |

### Python Libraries for MoE

| **Name** | **Paper**                                                    | **Venue** | **Year** |

| -------- | ------------------------------------------------------------ | --------- | -------- | 

| MoE-Infinity | [MoE-Infinity: Offloading-Efficient MoE Model Serving](https://arxiv.org/abs/2401.14361) | arXiv | 2024 |

| SMT 2.0 | [SMT 2.0: A Surrogate Modeling Toolbox with a focus on Hierarchical and Mixed Variables Gaussian Processes](https://arxiv.org/abs/2305.13998) | arXiv | 2023 |




Hope our survey with collection of all the recent MoE can help your work.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/arpita8/awesome-mixture-of-experts-papers

Awesome Lists containing this project

README