https://github.com/pjlab-sys4nlp/llama-moe

⛷️ LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training (EMNLP 2024)
https://github.com/pjlab-sys4nlp/llama-moe

continual-pre-training expert-partition llama llm mixture-of-experts moe

Last synced: 2 months ago
JSON representation

⛷️ LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training (EMNLP 2024)

Host: GitHub
URL: https://github.com/pjlab-sys4nlp/llama-moe
Owner: pjlab-sys4nlp
License: apache-2.0
Created: 2023-07-24T06:15:51.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2024-12-06T04:47:07.000Z (7 months ago)
Last Synced: 2025-04-30T03:39:36.236Z (3 months ago)
Topics: continual-pre-training, expert-partition, llama, llm, mixture-of-experts, moe
Language: Python
Homepage: https://arxiv.org/abs/2406.16554
Size: 1.69 MB
Stars: 959
Watchers: 8
Forks: 56
Open Issues: 6
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

StarryDivineSky - pjlab-sys4nlp/llama-moe - MoE：将 LLaMA 的 FFN 划分为稀疏专家，并为每一层专家插入 top-K 门。使用来自 Sheared LLaMA 的优化数据采样权重和来自 SlimPajama 的过滤数据集，持续预训练初始化的 MoE 模型。 (A01_文本生成_文本对话 / 大语言对话模型及数据)
awesome-mixture-of-experts - Dec 2023
awesome-ai-papers - [llama-moe - pytorch](https://github.com/lucidrains/PEER-pytorch)\]\[[GRIN-MoE](https://github.com/microsoft/GRIN-MoE)\]\[[MoE-plus-plus](https://github.com/SkyworkAI/MoE-plus-plus)\]\[[MoH](https://github.com/SkyworkAI/MoH)\] (NLP / 3. Pretraining)
awesome-ai-papers - [llama-moe - pytorch](https://github.com/lucidrains/PEER-pytorch)\]\[[GRIN-MoE](https://github.com/microsoft/GRIN-MoE)\] (NLP / 3. Pretraining)

README

        


  LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training

  


  📢 A SMALLER AFFORDABLE MoE MODEL FOR EVERYONE!!

  

    🤗 Model Weights | 🚀 Quick Start | ⚙️ Installation Guide | 🚧 Expert Construction | 🚅 Continual Pre-training | 💎 Evaluation | 💬 Supervised Fine-Tuning (SFT)

  

  📃 Technical Report



🎉 Introduction


LLaMA-MoE is a series of open-sourced Mixture-of-Expert (MoE) models based on [LLaMA](https://github.com/facebookresearch/llama) and [SlimPajama](https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama).

We build LLaMA-MoE with the following two steps:

1. Partition LLaMA's FFNs into sparse experts and insert top-K gate for each layer of experts.

2. Continually pre-train the initialized MoE model with an optimized data sampling weights from [Sheared LLaMA](https://arxiv.org/abs/2310.06694) and filtered datasets from [SlimPajama](https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama).

![MoE Routing](./docs/imgs/MoE-Routing.gif)

🔥 Features


1. **Lightweight Models**: The number of activated model parameters is only 3.0~3.5B, which is friendly for deployment and research usage.

2. **Multiple Expert Construction Methods**:

   1. Neuron-Independent: Random, Clustering, Co-activation Graph, Gradient ([Zhang et al., 2022](http://arxiv.org/abs/2110.01786), [Zuo et al., 2022](http://arxiv.org/abs/2204.07675))

   2. Neuron-Sharing: Inner, Inter (residual)

3. **Multiple MoE Gating Strategies**:

   1. TopK Noisy Gate ([Shazeer et al., 2017](http://arxiv.org/abs/1701.06538))

   2. Switch Gating ([Fedus et al., 2022](http://arxiv.org/abs/2101.03961))

4. **Fast Continual Pre-training**:

   1. FlashAttention-v2 integrated ([Dao, 2023](https://github.com/Dao-AILab/flash-attention))

   2. Fast streaming dataset loading

5. **Abundant Monitor Items**:

   1. Gate load, gate importance

   2. Loss on steps, loss on tokens, balance loss

   3. TGS (tokens/GPU/second), MFU (model FLOPs utilization)

   4. Other visualization utilities

6. **Dynamic Weight Sampling**:

   1. Self-defined static sampling weights

   2. Sheared LLaMA's dynamic batch loading ([Xia et al., 2023](http://arxiv.org/abs/2310.06694))

🚀 QuickStart


```python

# python>=3.10

import torch

from transformers import AutoTokenizer, AutoModelForCausalLM

model_dir = "llama-moe/LLaMA-MoE-v1-3_5B-2_8"

tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(model_dir, torch_dtype=torch.bfloat16, trust_remote_code=True)

model.eval()

model.to("cuda:0")

input_text = "Suzhou is famous of"

inputs = tokenizer(input_text, return_tensors="pt")

inputs = inputs.to("cuda:0")

pred = model.generate(**inputs, max_length=50, temperature=0.0)

print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))

# Suzhou is famous of its beautiful gardens. The most famous one is the Humble Administrator's Garden. It is a classical Chinese garden with a history of more than 600 years. The garden is divided into three

```

⚙️ Installation


1. Prepare conda environment: `conda create -n smoe python=3.11` (If your environment name is not `smoe`, you may need to change environment in launching scripts)

2. Add correct environment variables in `~/.bashrc` (`gcc` is set to newer version for installing `flash-attn`). e.g.:

    ```bash

    export PATH=/mnt/petrelfs/share/cuda-11.8/bin:$PATH

    export LD_LIBRARY_PATH=/mnt/petrelfs/share/cuda-11.8/lib64:$LD_LIBRARY_PATH

    export PATH=/mnt/petrelfs/share/gcc-10.1.0/bin:$PATH

    export LD_LIBRARY_PATH=/mnt/petrelfs/share/gcc-10.1.0/lib64:$LD_LIBRARY_PATH

    ```

3. Take the variables into effect: `source ~/.bashrc`

4. Install PyTorch (CUDA-11.8): `pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118`

5. Install dependencies: `pip install -r requirements.txt`

6. Install `flash-attn`: `pip install flash-attn==2.0.1 --no-build-isolation`. You may need to follow the [flash-attn installation instructions](https://github.com/Dao-AILab/flash-attention?tab=readme-ov-file#installation-and-features) to avoid some errors.

7. Install the latest Git: `conda install git`

8. Clone the repo: `git clone [email protected]:pjlab-sys4nlp/llama-moe.git` (If you don't setup the ssh key to GitHub, you may not able to clone through ssh. Check the [docs](https://docs.github.com/en/authentication/connecting-to-github-with-ssh/adding-a-new-ssh-key-to-your-github-account) about it.)

9. Change current directory: `cd llama-moe`

10. Install `smoe` in [editable mode](https://pip.pypa.io/en/stable/cli/pip_install/#cmdoption-e): `pip install -e .[dev]`

11. Setup `pre-commit` hooks: `pre-commit install`

📊 Model Performance


| Model                     | \#Activated Experts | \#Experts | \#Activated Params |                         Foundation Model                          |                              SFT Model                               |

| :------------------------ | :-----------------: | :-------: | :----------------: | :---------------------------------------------------------------: | :------------------------------------------------------------------: |

| **LLaMA-MoE-3.0B**        |          2          |    16     |        3.0B        | [🤗 base](https://huggingface.co/llama-moe/LLaMA-MoE-v1-3_0B-2_16) | [🤗 SFT](https://huggingface.co/llama-moe/LLaMA-MoE-v1-3_0B-2_16-sft) |

| **LLaMA-MoE-3.5B (4/16)** |          4          |    16     |        3.5B        | [🤗 base](https://huggingface.co/llama-moe/LLaMA-MoE-v1-3_5B-4_16) | [🤗 SFT](https://huggingface.co/llama-moe/LLaMA-MoE-v1-3_5B-4_16-sft) |

| **LLaMA-MoE-3.5B (2/8)**  |          2          |     8     |        3.5B        | [🤗 base](https://huggingface.co/llama-moe/LLaMA-MoE-v1-3_5B-2_8)  | [🤗 SFT](https://huggingface.co/llama-moe/LLaMA-MoE-v1-3_5B-2_8-sft)  |

- Foundation models

| Model                                                                                 | Average  |   SciQ   |   PIQA   | WinoGrande |  ARC-e   | ARC-c (25) | HellaSwag (10) |  LogiQA  | BoolQ (32) | LAMBADA  | NQ (32)  | MMLU (5) |

| :------------------------------------------------------------------------------------ | :------: | :------: | :------: | :--------: | :------: | :--------: | :------------: | :------: | :--------: | :------: | :------: | :------: |

| [OPT-2.7B](https://huggingface.co/facebook/opt-2.7b)                                  |   50.3   |   78.9   |   74.8   |    60.8    |   54.4   |    34.0    |      61.4      |   25.8   |    63.3    |   63.6   |   10.7   |   25.8   |

| [Pythia-2.8B](https://huggingface.co/EleutherAI/pythia-2.8b)                          |   51.5   |   83.2   |   73.6   |    59.6    |   58.8   |    36.7    |      60.7      |   28.1   |    65.9    |   64.6   |   8.7    |   26.8   |

| [INCITE-BASE-3B](https://huggingface.co/togethercomputer/RedPajama-INCITE-Base-3B-v1) |   53.7   |   85.6   |   73.9   |    63.5    |   61.7   |    40.3    |      64.7      |   27.5   |    65.8    |   65.4   |   15.2   |   27.2   |

| [Open-LLaMA-3B-v2](https://huggingface.co/openlm-research/open_llama_3b_v2)           |   55.6   |   88.0   |   77.9   |    63.1    |   63.3   |    40.1    |      71.4      |   28.1   |    69.2    |   67.4   |   16.0   |   26.8   |

| [Sheared-LLaMA-2.7B](https://huggingface.co/princeton-nlp/Sheared-LLaMA-2.7B)         |   56.4   |   87.5   |   76.9   |    65.0    |   63.3   |    41.6    |      71.0      |   28.3   |    73.6    |   68.3   |   17.6   | **27.3** |

| **LLaMA-MoE-3.0B**                                                                    |   55.5   |   84.2   |   77.5   |    63.6    |   60.2   |    40.9    |      70.8      | **30.6** |    71.9    |   66.6   |   17.0   |   26.8   |

| **LLaMA-MoE-3.5B (4/16)**                                                             | **57.7** |   87.6   | **77.9** |    65.5    | **65.6** |  **44.2**  |    **73.3**    |   29.7   |  **75.0**  | **69.5** | **20.3** |   26.8   |

| **LLaMA-MoE-3.5B (2/8)**                                                              |   57.6   | **88.4** |   77.6   |  **66.7**  |   65.3   |    43.1    |    **73.3**    |   29.6   |    73.9    |   69.4   |   19.8   |   27.0   |

- SFT models

| Model                                  | MMLU  | ARC-c | HellaSeag | TruthfulQA | MT-Bench |

| :------------------------------------- | :---: | :---: | :-------: | :--------: | :------: |

| Sheared LLaMA-2.7B ShareGPT            | 28.41 | 41.04 |   71.21   |   47.65    |   3.79   |

| Sheared LLaMA-2.7B Deita6K (Our Impl.) | 25.24 | 43.69 |   71.70   |   49.00    |   4.06   |

| LLaMA-MoE-v1-3.0B (2/16)               | 23.61 | 43.43 |   72.28   |   44.24    |   4.15   |

| LLaMA-MoE-v1-3.5B (4/16)               | 26.49 | 48.29 |   75.10   |   45.91    |   4.60   |

| LLaMA-MoE-v1-3.5B (2/8)                | 25.53 | 45.99 |   74.95   |   44.39    |   4.72   |

🚧 Expert Construction


- Neuron-Independent

  - Independent_Random: `bash ./scripts/expert_construction/split/run_split_random.sh`

  - Independent_Clustering: `bash ./scripts/expert_construction/split/run_split_clustering.sh`

- Neuron-Sharing

  - Sharing_Inner: `bash ./scripts/expert_construction/split/run_split_gradient.sh`

  - Sharing_Inter: `bash ./scripts/expert_construction/split/run_split_gradient_residual.sh`

For more information, please refer to [Expert Construction docs](docs/expert_construction/README.md).

🚅 Continual Pre-training


### Tokenization

Download [SlimPajama](https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama) into `/path_to_data` and put data from different domains into separate folders:

  - `/path_to_data/en_arxiv`

  - `/path_to_data/en_book`

  - `/path_to_data/en_c4`

  - `/path_to_data/en_cc`

  - `/path_to_data/en_stack`

  - `/path_to_data/en_wikipedia`

  - `/path_to_data/github`

Each file should be end with `*.jsonl` and each line looks like:

```

{"id": "id-info", "content": "raw text to be tokenized"}

```

Run the following command to tokenize the data in each folder:

```bash

python -m smoe.utils.tokenize \

  -f jsonl \

  -t /path_to_tokenizer \

  -i /path_to_data/en_arxiv \

  -o /path_to_data_tokenized/en_arxiv

```

### Continual Pre-training (CPT)

- **NOTICE:** Please create `logs/` folder manually: `mkdir -p logs`

- To run the continual pre-training, please check the [CPT docs](docs/continual_pretraining/README.md).

💎 Evaluation


- For evalution on Natural Questions (NQ), please refer to [opencompass](https://github.com/Spico197/opencompass/tree/main).

- For other tasks, please refer to [lm-eval-harness](https://github.com/spico197/smoe-eval).

💬 Supervised Fine-Tuning (SFT)


We provide simple examples of SFT to build chatbots.

Please refer to [SFT docs](/mnt/petrelfs/zhutong/smoe/docs/supervised_fine_tuning/SFT.md) and `/mnt/petrelfs/zhutong/smoe/scripts/sft` for more details.

📑 Citation


```bibtex

@article{llama-moe,

  title={LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training},

  author={Tong Zhu and Xiaoye Qu and Daize Dong and Jiacheng Ruan and Jingqi Tong and Conghui He and Yu Cheng},

  journal={arXiv preprint arXiv:2406.16554},

  year={2024},

  url={https://arxiv.org/abs/2406.16554},

}

```



LLaMA-MoE Team w/ ❤️

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/pjlab-sys4nlp/llama-moe

Awesome Lists containing this project

README

LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training

🎉 Introduction

🔥 Features

🚀 QuickStart

⚙️ Installation

📊 Model Performance

🚧 Expert Construction

🚅 Continual Pre-training

💎 Evaluation

💬 Supervised Fine-Tuning (SFT)

📑 Citation