https://github.com/lucidrains/soft-moe-pytorch

Implementation of Soft MoE, proposed by Brain's Vision team, in Pytorch
https://github.com/lucidrains/soft-moe-pytorch

artificial-intelligence deep-learning mixture-of-experts transformers

Last synced: over 1 year ago
JSON representation

Implementation of Soft MoE, proposed by Brain's Vision team, in Pytorch

Host: GitHub
URL: https://github.com/lucidrains/soft-moe-pytorch
Owner: lucidrains
License: mit
Created: 2023-08-04T23:46:54.000Z (almost 3 years ago)
Default Branch: main
Last Pushed: 2024-04-24T15:23:45.000Z (about 2 years ago)
Last Synced: 2025-03-29T11:04:27.944Z (over 1 year ago)
Topics: artificial-intelligence, deep-learning, mixture-of-experts, transformers
Language: Python
Homepage:
Size: 1.37 MB
Stars: 271
Watchers: 11
Forks: 8
Open Issues: 4
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          



## Soft MoE - Pytorch

Implementation of Soft MoE (Mixture of Experts), proposed by Brain's Vision team, in Pytorch.

This MoE has only been made to work with non-autoregressive encoder. However, some recent text-to-image models have started using MoE with great results, so may be a fit there.

If anyone has any ideas for how to make it work for autoregressive, let me know (through email or discussions). I meditated on it but can't think of a good way. The other issue with the slot scheme is that the routing suffers the quadratic as sequence length increases (much like attention)

## Appreciation

- StabilityAI for the generous sponsorship, as well as my other sponsors out there

- Einops for making my life easy

## Install

```bash

$ pip install soft-moe-pytorch

```

## Usage

```python

import torch

from soft_moe_pytorch import SoftMoE

moe = SoftMoE(

    dim = 512,         # model dimensions

    seq_len = 1024,    # max sequence length (will automatically calculate number of slots as seq_len // num_experts) - you can also set num_slots directly

    num_experts = 4    # number of experts - (they suggest number of experts should be high enough that each of them get only 1 slot. wonder if that is the weakness of the paper?)

)

x = torch.randn(1, 1024, 512)

out = moe(x) + x # (1, 1024, 512) - add in a transformer in place of a feedforward at a certain layer (here showing the residual too)

```

For an improvised variant that does dynamic slots so that number of slots ~= sequence length, just import `DynamicSlotsSoftMoe` instead

```python

import torch

from soft_moe_pytorch import DynamicSlotsSoftMoE

# sequence length or number of slots need not be specified

moe = DynamicSlotsSoftMoE(

    dim = 512,         # model dimensions

    num_experts = 4,   # number of experts

    geglu = True

)

x = torch.randn(1, 1023, 512)

out = moe(x) + x # (1, 1023, 512)

```

## Todo

- [x] address the limitation of number of slots being fixed. think about a way to make dynamic number of slots based on sequence length

- [ ] once variable sequence length is handled in distributed, add to dynamic soft moe

- [ ] the dispatch and combine tensors can also be split and moved into the `Experts` class to better distribute work

## Citations

```bibtex

@misc{puigcerver2023sparse,

    title 	= {From Sparse to Soft Mixtures of Experts}, 

    author 	= {Joan Puigcerver and Carlos Riquelme and Basil Mustafa and Neil Houlsby},

    year 	= {2023},

    eprint 	= {2308.00951},

    archivePrefix = {arXiv},

    primaryClass = {cs.LG}

}

```

```bibtex

@misc{shazeer2020glu,

    title   = {GLU Variants Improve Transformer},

    author  = {Noam Shazeer},

    year    = {2020},

    url     = {https://arxiv.org/abs/2002.05202}

}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/lucidrains/soft-moe-pytorch

Awesome Lists containing this project

README