{"id":13604654,"url":"https://github.com/MachineLearningSystem/fastmoe-thu","last_synced_at":"2025-04-12T02:31:09.809Z","repository":{"id":185461763,"uuid":"612456385","full_name":"MachineLearningSystem/fastmoe-thu","owner":"MachineLearningSystem","description":"A fast MoE impl for PyTorch","archived":false,"fork":true,"pushed_at":"2023-02-13T09:03:20.000Z","size":843,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2024-08-02T19:35:25.133Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://fastmoe.ai","language":null,"has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"laekov/fastmoe","license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MachineLearningSystem.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2023-03-11T02:03:06.000Z","updated_at":"2023-03-11T02:02:50.000Z","dependencies_parsed_at":"2023-08-02T03:16:49.474Z","dependency_job_id":null,"html_url":"https://github.com/MachineLearningSystem/fastmoe-thu","commit_stats":null,"previous_names":["machinelearningsystem/fastmoe-thu"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2Ffastmoe-thu","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2Ffastmoe-thu/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2Ffastmoe-thu/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2Ffastmoe-thu/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MachineLearningSystem","download_url":"https://codeload.github.com/MachineLearningSystem/fastmoe-thu/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223489604,"owners_count":17153785,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T19:00:49.798Z","updated_at":"2024-11-07T09:30:40.878Z","avatar_url":"https://github.com/MachineLearningSystem.png","language":null,"readme":"\u003cimg height='60px' src='doc/logo/rect.png'/\u003e\n\n[Release note](doc/release-note.md)\n| [中文文档](doc/readme-cn.md)\n| [Slack workspace](https://join.slack.com/t/fastmoe/shared_invite/zt-mz0ai6ol-ggov75D62YsgHfzShw8KYw)\n\n## Introduction\n\nAn easy-to-use and efficient system to support the Mixture of Experts (MoE) \nmodel for PyTorch. \n\n## Installation\n\n### Prerequisites\n\nPyTorch with CUDA is required. The repository is currently tested with PyTorch\nv1.10.0 and CUDA 11.3, with designed compatibility to older and newer versions.\n\nThe minimum version of supported PyTorch is `1.7.2` with CUDA `10`. However,\nthere are a few known issues that requires manual modification of FastMoE's\ncode with specific older dependents.\n\nIf the distributed expert feature is enabled, NCCL with P2P communication\nsupport, typically versions `\u003e=2.7.5`, is needed. \n\n### Installing\n\nFastMoE contains a set of PyTorch customized opearators, including both C and\nPython components. Use `python setup.py install` to easily install and enjoy\nusing FastMoE for training.\n\nThe distributed expert feature is enabled by default. If you want to disable\nit, pass environment variable `USE_NCCL=0` to the setup script.\n\nNote that an extra NCCL developer package is needed, which has to be consistent\nwith your PyTorch's NCCL version, which can be inspected by running\n`torch.cuda.nccl.version()`. The \n[official PyTorch docker image](https://hub.docker.com/r/pytorch/pytorch) is\nrecommended, as the environment is well-setup there. Otherwise, you can access\nthe [download link of all NCCL\nversions](https://developer.nvidia.com/nccl/nccl-legacy-downloads) to download\nthe NCCL package that is suitable for you.\n\n## Usage \n\n### FMoEfy a Transformer model\n\nTransformer is currently one of the most popular models to be extended by MoE. Using\nFastMoE, a Transformer-based model can be extended as MoE by an one-key plugin\nshown as follow.\n\nFor example, when using [Megatron-LM](https://github.com/nvidia/megatron-lm),\nusing the following lines can help you easily scale up the MLP layers to\nmultiple experts.\n\n```python\nmodel = ...\n\nfrom fmoe.megatron import fmoefy\nmodel = fmoefy(model, num_experts=\u003cnumber of experts per worker\u003e)\n\ntrain(model, ...)\n```\n\nA detailed tutorial to _moefy_ Megatron-LM can be found\n[here](examples/megatron).\n\n### Using FastMoE as a PyTorch module\n\nAn example MoE transformer model can be seen in the\n[Transformer-XL](examples/transformer-xl) example. The easist way is to replace\nthe MLP layer by the `FMoE` layers.\n\n### Using FastMoE in Parallel\n\nFastMoE supports both data parallel and model parallel. \n\n#### Data Parallel\n\nIn FastMoE's data parallel mode, both the gate and the experts are replicated on each worker. \nThe following figure shows the forward pass of a 3-expert MoE with 2-way data parallel.\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"doc/fastmoe_data_parallel.png\" width=\"600\"\u003e\n\u003c/p\u003e\n\nFor data parallel, no extra coding is needed. FastMoE works seamlessly with PyTorch's `DataParallel` or `DistributedDataParallel`.\nThe only drawback of data parallel is that the number of experts is constrained by each worker's memory.\n\n#### Model Parallel\n\nIn FastMoE's model parallel mode, the gate network is still replicated on each worker but\nexperts are placed separately across workers.\nThus, by introducing additional communication cost, FastMoE enjoys a large expert pool whose size is proportional to the number of workers.\n\nThe following figure shows the forward pass of a 6-expert MoE with 2-way model parallel. Note that experts 1-3 are located in worker 1 while experts 4-6 are located in worker 2.\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"doc/fastmoe_model_parallel.png\" width=\"600\"\u003e\n\u003c/p\u003e\n\nFastMoE's model parallel requires sophiscated parallel strategies that neither PyTorch nor\nMegatron-LM provides. The `fmoe.DistributedGroupedDataParallel` module is\nintroduced to replace PyTorch's DDP module.\n\n#### Faster Performance Features\n\nFrom a PPoPP'22 paper, _FasterMoE: modeling and optimizing training of\nlarge-scale dynamic pre-trained models_, we have adopted techniques to make\nFastMoE's model parallel much more efficient.\n\nThese optimizations are named as **Faster Performance Features**, and can be\nenabled via several environment variables. Their usage and constraints are\ndetailed in [a separate document](doc/fastermoe).\n\n## Citation\n\nFor the core FastMoE system.\n\n```\n@article{he2021fastmoe,\n      title={FastMoE: A Fast Mixture-of-Expert Training System}, \n      author={Jiaao He and Jiezhong Qiu and Aohan Zeng and Zhilin Yang and Jidong Zhai and Jie Tang},\n      journal={arXiv preprint arXiv:2103.13262},\n      year={2021}\n}\n```\n\nFor the [faster performance features](doc/fastermoe).\n\n```\n@inproceedings{he2022fastermoe,\n    author = {He, Jiaao and Zhai, Jidong and Antunes, Tiago and Wang, Haojie and Luo, Fuwen and Shi, Shangfeng and Li, Qin},\n    title = {FasterMoE: Modeling and Optimizing Training of Large-Scale Dynamic Pre-Trained Models},\n    year = {2022},\n    isbn = {9781450392044},\n    publisher = {Association for Computing Machinery},\n    address = {New York, NY, USA},\n    url = {https://doi.org/10.1145/3503221.3508418},\n    doi = {10.1145/3503221.3508418},\n    booktitle = {Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming},\n    pages = {120–134},\n    numpages = {15},\n    keywords = {parallelism, distributed deep learning, performance modeling},\n    location = {Seoul, Republic of Korea},\n    series = {PPoPP '22}\n}\n```\n\n## Troubleshootings / Discussion\n\nIf you have any problem using FastMoE, or you are interested in getting involved in developing FastMoE, feel free to join [our slack channel](https://join.slack.com/t/fastmoe/shared_invite/zt-mz0ai6ol-ggov75D62YsgHfzShw8KYw).\n","funding_links":[],"categories":["Paper-Code"],"sub_categories":["MoE"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMachineLearningSystem%2Ffastmoe-thu","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FMachineLearningSystem%2Ffastmoe-thu","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMachineLearningSystem%2Ffastmoe-thu/lists"}