{"id":21864075,"url":"https://github.com/opensparsellms/llama-moe-v2","last_synced_at":"2025-08-11T21:20:26.377Z","repository":{"id":264822632,"uuid":"894237728","full_name":"OpenSparseLLMs/LLaMA-MoE-v2","owner":"OpenSparseLLMs","description":"🚀 LLaMA-MoE v2: Exploring Sparsity of LLaMA from Perspective of Mixture-of-Experts with Post-Training","archived":false,"fork":false,"pushed_at":"2024-12-03T07:26:19.000Z","size":2318,"stargazers_count":78,"open_issues_count":3,"forks_count":11,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-07T13:05:11.162Z","etag":null,"topics":["attention","fine-tuning","instruction-tuning","llama","llama3","mixture-of-experts","moe","sft","sparsity"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2411.15708","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/OpenSparseLLMs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-11-26T02:07:00.000Z","updated_at":"2025-04-07T10:41:10.000Z","dependencies_parsed_at":"2025-01-10T12:42:06.520Z","dependency_job_id":"c5fa0e9a-aeea-4c9f-8b60-71ed45a7433a","html_url":"https://github.com/OpenSparseLLMs/LLaMA-MoE-v2","commit_stats":null,"previous_names":["opensparsellms/llama-moe-v2"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenSparseLLMs%2FLLaMA-MoE-v2","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenSparseLLMs%2FLLaMA-MoE-v2/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenSparseLLMs%2FLLaMA-MoE-v2/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenSparseLLMs%2FLLaMA-MoE-v2/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/OpenSparseLLMs","download_url":"https://codeload.github.com/OpenSparseLLMs/LLaMA-MoE-v2/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247657276,"owners_count":20974344,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["attention","fine-tuning","instruction-tuning","llama","llama3","mixture-of-experts","moe","sft","sparsity"],"created_at":"2024-11-28T04:07:21.957Z","updated_at":"2025-04-07T13:05:17.332Z","avatar_url":"https://github.com/OpenSparseLLMs.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n  \u003ch1\u003eLLaMA-MoE v2: Exploring Sparsity of LLaMA from Perspective of Mixture-of-Experts with Post-Training\u003c/h1\u003e\n  \u003cimg src=\"docs/imgs/title-favicon.png\" width=\"200\" alt=\"LLaMA-MoE favicon\" style=\"border-radius: 5%;\"\u003e\u003cbr /\u003e\n  \u003cspan style=\"color:red\"\u003e📢 \u003cstrong\u003e\u003ci\u003eA SMALLER AFFORDABLE MoE MODEL FOR EVERYONE!!\u003c/i\u003e\u003c/strong\u003e\u003c/span\u003e\n  \u003cdiv\u003e\n    \u003ca href=\"https://huggingface.co/LLaMA-MoE-v2\" target=\"_blank\"\u003e🤗 Model Weights\u003c/a\u003e | \u003ca href=\"#quick-start\"\u003e🚀 Quick Start\u003c/a\u003e | \u003ca href=\"#installation\"\u003e⚙️ Installation Guide\u003c/a\u003e | \u003ca href=\"#expert-construction\"\u003e🚧 Expert Construction\u003c/a\u003e | \u003ca href=\"#sft\"\u003e💬 Supervised Fine-Tuning (SFT)\u003c/a\u003e | \u003ca href=\"#evaluation\"\u003e💎 Evaluation\u003c/a\u003e  \u003cbr /\u003e \n    \u003ca href=\"https://arxiv.org/pdf/2411.15708\" target=\"_blank\" style=\"display: inline-block; margin-top: 10px;\"\u003e 📃 Technical Report \u003c/a\u003e\n  \u003c/div\u003e\n\u003c/div\u003e\n\n\n\n\n\u003ch2 id=\"updates\"\u003e🚀 Updates\u003c/h2\u003e\n\n📆[2024-12-03] 🎈 We scale the training data to 8.4B token and release the new MLP-MoE (8top2) model. The new model can achieve near 59.6 on GSM8K and 57.1 on HumanEval.  \n\n\n\u003ch2 id=\"llama-moe\"\u003e🎉 Introduction\u003c/h2\u003e\n\nLLaMA-MoE-v2 is a series of open-sourced Mixture-of-Expert (MoE) models based on [LLaMA3](https://github.com/facebookresearch/llama).\nWe build LLaMA-MoE-v2 with the following two steps:\n1. **Partition** LLaMA's FFN layers or Attention layers into sparse experts and insert top-K gate for each layer of experts.\n2. Supervised fine-tuning the constructed MoE models using open-source data with a two-stage training.\n\n![Overall Framework](./docs/imgs/llama_moev2.jpg )\n\n\u003ch2 id=\"features\"\u003e🔥 Features\u003c/h2\u003e\n\n1. **Support building Attention MoE and MLP MoE**:\n   1. build Attention MoE models with attention layers\n   2. build MLP MoE models with MLP layers\n2. **Multiple Expert Construction Methods**:\n   1. random MLP MoE construction (vanilla)\n   2. residual MLP MoE construction (residual)\n3. **Packed Padding Training**\n4. **Support training with megablocks**\n4. **Two-stage \u0026 Open-source data for SFT**:\n    \u003cdetails\u003e\n    \u003csummary\u003eFirst-stage\u003c/summary\u003e\n\n     - [OpenHermes-2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5)\n     - [SlimOrca](https://huggingface.co/datasets/Open-Orca/SlimOrca)\n     - [sharegpt_gpt4](https://huggingface.co/datasets/shibing624/sharegpt_gpt4)\n     - [lima](https://huggingface.co/datasets/GAIR/lima)\n     - [Infinity-Instruct](https://huggingface.co/datasets/BAAI/Infinity-Instruct)\n     - [Llama-3-Magpie-Air-3M-v0.1](https://huggingface.co/datasets/Magpie-Align/Llama-3-Magpie-Air-3M-v0.1)\n\n     \u003c/details\u003e\n     \u003cdetails\u003e\n    \u003csummary\u003eTwo-stage\u003c/summary\u003e\n\n     - [Infinity-Instruct](https://huggingface.co/datasets/BAAI/Infinity-Instruct)\n     - [MetaMathQA](https://huggingface.co/datasets/meta-math/MetaMathQA)\n\n5. **Support building MoE for different Models**\n\n    \u003cdetails\u003e\n    \u003csummary\u003emodels\u003c/summary\u003e\n\n    - [Llama3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)\n\n\n\u003ch2 id=\"quick-start\"\u003e🚀 QuickStart\u003c/h2\u003e\n\n```python\n# python\u003e=3.10\n\nimport torch\nfrom transformers import AutoTokenizer, AutoModelForCausalLM\n\nmodel_dir = \"LLaMA-MoE-v2/LLaMA-MoE-v2-3_5B-2_8\"\ntokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)\nmodel = AutoModelForCausalLM.from_pretrained(model_dir, torch_dtype=torch.bfloat16, trust_remote_code=True)\nmodel.eval()\nmodel.to(\"cuda:0\")\n\ninput_text = \"Suzhou is famous for?\"\n\ninput_text = f\"\u003c|start_header_id|\u003euser\u003c|end_header_id|\u003e\\n\\n{input_text}\u003c|eot_id|\u003e\u003c|start_header_id|\u003eassistant\u003c|end_header_id|\u003e\\n\\n\"\n\ninputs = tokenizer(input_text, return_tensors=\"pt\")\ninputs = inputs.to(\"cuda:0\")\n\npred = model.generate(**inputs, max_length=50, temperature=0.0)\nprint(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))\n```\n\n\u003ch2 id=\"installation\"\u003e⚙️ Installation\u003c/h2\u003e\n\n1. Prepare conda environment: `conda create -n smoe python=3.11` (If your environment name is not `smoe`, you may need to change environment in launching scripts)\n2. Add correct environment variables in `~/.bashrc` (`gcc` is set to newer version for installing `flash-attn`). e.g.:\n    ```bash\n    export PATH=/mnt/petrelfs/share/cuda-11.8/bin:$PATH\n    export LD_LIBRARY_PATH=/mnt/petrelfs/share/cuda-11.8/lib64:$LD_LIBRARY_PATH\n    export PATH=/mnt/petrelfs/share/gcc-10.1.0/bin:$PATH\n    export LD_LIBRARY_PATH=/mnt/petrelfs/share/gcc-10.1.0/lib64:$LD_LIBRARY_PATH\n    ```\n3. Take the variables into effect: `source ~/.bashrc`\n4. Install PyTorch (CUDA-11.8): `pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118`\n5. Install dependencies: `pip install -r requirements.txt`\n6. Install `flash-attn`: `pip install flash-attn==2.6.1 --no-build-isolation`. You may need to follow the [flash-attn installation instructions](https://github.com/Dao-AILab/flash-attention?tab=readme-ov-file#installation-and-features) to avoid some errors.\n7. Install the latest Git: `conda install git`\n8. Clone the repo: `git@github.com:LLaMA-MoE/LLaMA-MoE-v2.git` (If you don't setup the ssh key to GitHub, you may not able to clone through ssh. Check the [docs](https://docs.github.com/en/authentication/connecting-to-github-with-ssh/adding-a-new-ssh-key-to-your-github-account) about it.)\n9. Change current directory: `cd LLaMA-MoE-v2`\n10. Install `smoe` in [editable mode](https://pip.pypa.io/en/stable/cli/pip_install/#cmdoption-e): `pip install -e .[dev]`\n11. Setup `pre-commit` hooks: `pre-commit install`\n\n\n\u003ch2 id=\"performance\"\u003e📊 Model Performance\u003c/h2\u003e\n\n| Model                     | \\#Activated Experts | \\#Experts | \\#Activated Params |                      SFT Model                                  |\n| :------------------------ | :-----------------: | :-------: | :----------------: | :------------------------------------: |\n| **LLaMA-MLP-MoE (2/8)**  |          2          |     8     |        3.8B        | [🤗 SFT](https://huggingface.co/llama-moe/LLaMA-MoE-v2-3_8B-2_8-sft)    |\n| **LLaMA-MLP-MoE (1+1/7)**|          2          |     8     |        3.8B        | [🤗 SFT](https://huggingface.co/llama-moe/LLaMA-MoE-v2-3_8B-residual-sft)  |\n\n\n\n| Model | #Training Tokens | MMLU(5) | GSM8k(8) | HumanEval(pass@10) | IFEval | BoolQ(32) | SciQ | PIQA | ARC-c(25) | TruthfulQA | HellaSwag(10) |\n|:---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|\n| [LLaMA3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) | 15T | 67.2 | 76.5 | 71.4 | 76.5 | 83.0 | 93.2 | 78.5 | 61.9 | 51.7 | 78.8 |\n| [INCITE-3B](https://huggingface.co/togethercomputer/RedPajama-INCITE-Instruct-3B-v1) | 1T | 25.1 | 2.1 | 6.92 | 30.1 | 66.5 | 94.7 | 74.4 | 40.2 | 36.4 | 65.6 |\n| [Sheared-LLaMA-2.7B](https://huggingface.co/princeton-nlp/Sheared-LLaMA-2.7B-ShareGPT) | 50B | 28.2 | 1.9 | 3.2 | 28.8 | 67.6 | 75.8 | 41.1 | 47.6 | 71.2 | 39.0 |\n| [Gemma-2-2b](https://huggingface.co/google/gemma-2-2b-it) | 2T | 53.0 | 26.3 | 46.1 | 34.9 | 72.3 | 75.8 | 67.5 | 52.6 | 50.8 | 69.0 |\n| [Salamandra-2b](https://huggingface.co/BSC-LT/salamandra-2b-instruct) | 7.8T | 25.1 | 1.90 | 5.82 | 27.7 | 68.0 | 89.8 | 74.7 | 46.3 | 43.4 | 62.3 |\n| [SmolLM2-1.7B](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct) | 11T | 50.4 | 38.5 | 39.1 | 29.0 | 68.2 | 84.3 | 76.0 | 53.2 | 39.9 | 72.6 |\n| [OpenMoE-3B-9B](https://huggingface.co/OrionZheng/openmoe-8b-chat) | 1T | 26.5 | 1.36 | 1.01 | 31.2 | 61.7 | 68.4 | 65.7 | 33.3 | 40.5 | 56.5 |\n| [LLaMA-MoE-3B-7B](https://huggingface.co/llama-moe/LLaMA-MoE-v1-3_5B-2_8-sft) | 200B | 28.2 | 4.62 | 12.0 | 28.1 | 68.1 | 88.8 | 77.9 | 44.0 | 33.3 | 73.2 |\n| [OLMoE-1B-7B](https://huggingface.co/allenai/OLMoE-1B-7B-0924-SFT) | 1T | 53.8 | 40.9 | 40.5 | 35.5 | 80.9 | 94.9 | 80.1 | 55.6 | 43.3 | 79.6 |\n| **MLP-MoE (8top2)** | **7B** | 40.6 | 53.1 | 53.5 | 32.7 | 74.6 | 90.6 | 69.3 | 42.8 | 45.6 | 59.0 |\n| **MLP-MoE (8top2)** | **8.4B** | 41.0 | **59.6** | **57.1** | 31.7 | 74.5 | 90.2 | 69.5 | 43.3 | 46.9 | 58.1 |\n| **MLP-MoE (1+7top1)** | **7B** | 42.7 | 55.0 | 51.2 | **36.0** | 76.9 | 88.8 | 67.9 | 40.2 | 46.9 | 53.7 |\n\n\n\n\n\u003ch2 id=\"expert-construction\"\u003e🚧 Expert Construction for MLP MoE\u003c/h2\u003e\n\n- Vanilla LLaMA-MoE-v2: `sbatch scripts/expert_construction/convert/convert_mixtral_v2.sh`\n- Residual LLaMA-MoE-v2: `sbatch scripts/expert_construction/convert/convert_mixtral_residual_v2.sh`\n\nFor more information, please refer to [Expert Construction docs](docs/expert_construction/README.md).\n\n\n\u003ch2 id=\"sft\"\u003e💬 Supervised Fine-Tuning (SFT)\u003c/h2\u003e\n\n- **NOTICE:** Please create `logs/` folder manually: `mkdir -p logs`\n\n  We provide simple examples of SFT to build chatbots. Please refer to [SFT docs](docs/supervised_fine_tuning/LLaMA-MoE-v2.md) for more details.\n\n\n\n\n\u003ch2 id=\"citation\"\u003e📑 Citation\u003c/h2\u003e\n\n```bibtex\n@misc{llama-moe-v2,\n  title={LLaMA-MoE v2: Exploring Sparsity of LLaMA from Perspective of Mixture-of-Experts with Post-Training},\n  author={Xiaoye Qu, Daize Dong, Xuyang Hu, Tong Zhu, Weigao Sun, Yu Cheng},\n  year={2024},\n  month={Nov},\n  url={https://arxiv.org/abs/2411.15708}\n}\n```\n\n\u003chr\u003e\n\u003cp align=\"center\"\u003eLLaMA-MoE Team w/ ❤️\u003c/p\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopensparsellms%2Fllama-moe-v2","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fopensparsellms%2Fllama-moe-v2","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopensparsellms%2Fllama-moe-v2/lists"}