{"id":13993986,"url":"https://github.com/lucidrains/st-moe-pytorch","last_synced_at":"2025-04-04T20:04:18.268Z","repository":{"id":177897143,"uuid":"619198044","full_name":"lucidrains/st-moe-pytorch","owner":"lucidrains","description":"Implementation of ST-Moe, the latest incarnation of MoE after years of research at Brain, in Pytorch","archived":false,"fork":false,"pushed_at":"2024-06-17T00:48:47.000Z","size":182,"stargazers_count":323,"open_issues_count":4,"forks_count":28,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-03-28T19:05:39.378Z","etag":null,"topics":["artificial-intelligence","conditional-computation","deep-learning","mixture-of-experts"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lucidrains.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-03-26T15:00:16.000Z","updated_at":"2025-03-27T11:44:14.000Z","dependencies_parsed_at":"2024-01-18T04:52:40.190Z","dependency_job_id":"46c89443-edb0-454b-bc7a-017679345d5a","html_url":"https://github.com/lucidrains/st-moe-pytorch","commit_stats":null,"previous_names":["lucidrains/st-moe-pytorch"],"tags_count":35,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Fst-moe-pytorch","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Fst-moe-pytorch/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Fst-moe-pytorch/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Fst-moe-pytorch/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lucidrains","download_url":"https://codeload.github.com/lucidrains/st-moe-pytorch/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247242669,"owners_count":20907133,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["artificial-intelligence","conditional-computation","deep-learning","mixture-of-experts"],"created_at":"2024-08-09T14:02:39.422Z","updated_at":"2025-04-04T20:04:18.222Z","avatar_url":"https://github.com/lucidrains.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"\u003cimg src=\"./st-moe.png\" width=\"450px\"\u003e\u003c/img\u003e\n\n## ST-MoE - Pytorch\n\nImplementation of \u003ca href=\"https://arxiv.org/abs/2202.08906\"\u003eST-MoE\u003c/a\u003e, the latest incarnation of mixture of experts after years of research at Brain, in Pytorch. Will be largely a transcription of the \u003ca href=\"https://github.com/tensorflow/mesh/blob/master/mesh_tensorflow/transformer/moe.py\"\u003eofficial Mesh Tensorflow implementation\u003c/a\u003e. If you have any papers you think should be added, while I have my attention on mixture of experts, please open an issue.\n\nThis should be SOTA for mixture-of-experts for autoregressive transformers. It is rumored that GPT4 is using 16 experts with top2 gating.\n\nFor non-autoregressive, would recommend going with the simpler and better \u003ca href=\"https://github.com/lucidrains/soft-moe-pytorch\"\u003eSoft MoE\u003c/a\u003e.\n\n## Install\n\n```bash\n$ pip install st-moe-pytorch\n```\n\n## Appreciation\n\n- \u003ca href=\"https://stability.ai/\"\u003eStabilityAI\u003c/a\u003e for the generous sponsorship, as well as my other sponsors, for affording me the independence to open source artificial intelligence.\n\n- \u003ca href=\"https://github.com/arankomat\"\u003eAran Komatsuzaki\u003c/a\u003e for consultation on mixture-of-experts, for removal of 2-level MoE and simplifications to code\n\n## Usage\n\n```python\nimport torch\nfrom st_moe_pytorch import MoE\n\nmoe = MoE(\n    dim = 512,\n    num_experts = 16,               # increase the experts (# parameters) of your model without increasing computation\n    gating_top_n = 2,               # default to top 2 gating, but can also be more (3 was tested in the paper with a lower threshold)\n    threshold_train = 0.2,          # at what threshold to accept a token to be routed to second expert and beyond - 0.2 was optimal for 2 expert routing, and apparently should be lower for 3\n    threshold_eval = 0.2,\n    capacity_factor_train = 1.25,   # experts have fixed capacity per batch. we need some extra capacity in case gating is not perfectly balanced.\n    capacity_factor_eval = 2.,      # capacity_factor_* should be set to a value \u003e=1\n    balance_loss_coef = 1e-2,       # multiplier on the auxiliary expert balancing auxiliary loss\n    router_z_loss_coef = 1e-3,      # loss weight for router z-loss\n)\n\ninputs = torch.randn(4, 1024, 512)\nout, total_aux_loss, balance_loss, router_z_loss = moe(inputs) # (4, 1024, 512), (1,), (1,), (1,)\n\n# for the entire mixture of experts block, in context of transformer\n\nfrom st_moe_pytorch import SparseMoEBlock\n\nmoe_block = SparseMoEBlock(\n    moe,\n    add_ff_before = True,\n    add_ff_after = True\n)\n\nout, total_aux_loss, balance_loss, router_z_loss = moe_block(inputs) # (4, 1024, 512), (1,) (1,), (1,)\n\n# the total auxiliary loss will need to be summed and then added to the main loss\n\n# the other two losses are the unweighted breakdown for logging purposes\n```\n\n## Todo\n\n- [x] add the router z-loss proposed in paper\n- [x] add the geglu expert with multiplicative gating\n- [x] add an entire sparse moe block, complete with rmsnorm + residual as well as the ability to specify a feedforward before or after for stability\n- [x] double check equation for router z-loss for experts inner in hierarchical moe\n- [x] redo all the transcribed code from google with einops, as it is not very clear\n- [x] consult some MoE experts in the open source community; question why hierarchical MoE is needed, in light of results from soft-MoE\n- [x] offer top-n gating generalization, as it seems top3 (with smaller threshold) can work even better\n- [x] figure out if there was an error in \u003ca href=\"https://github.com/lucidrains/mixture-of-experts/blob/master/mixture_of_experts/mixture_of_experts.py#L210\"\u003ea previous transcription\u003c/a\u003e - no there was not an error\n- [x] allow for different thresholds for second vs third routed expert\n- [x] add coordinate descent based routing\n- [x] make first naive non-optimized attempt at distributed code for mixture of experts\n\n- [ ] distributed\n    - [x] handle any world size less than number of experts\n    - [x] handle any world size greater than number of experts - for now, just have remainder machines do nothing\n    - [x] support variable batch sizes\n    - [x] support variable seq lengths\n    - [ ] figure out how to move assert.py to pytests\n    - [ ] simplify the variable sequence length test code from another folder and move in so other researchers gain confidence\n    - [ ] optimize\n    - [ ] figure out what is faster, all gather, or broadcast with async followed by barrier\n    - [ ] make all distributed code pluggable, for different strategies\n    - [ ] figure out why there is tiny error in gradients\n\n- [ ] improvise a `Top2GatingWithCoordinateDescent` for `MoE` without `importance`\n\n## Citations\n\n```bibtex\n@inproceedings{Zoph2022STMoEDS,\n    title   = {ST-MoE: Designing Stable and Transferable Sparse Expert Models},\n    author  = {Barret Zoph and Irwan Bello and Sameer Kumar and Nan Du and Yanping Huang and Jeff Dean and Noam M. Shazeer and William Fedus},\n    year    = {2022}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flucidrains%2Fst-moe-pytorch","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flucidrains%2Fst-moe-pytorch","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flucidrains%2Fst-moe-pytorch/lists"}