Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/pjlab-sys4nlp/llama-moe
⛷️ LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training (EMNLP 2024)
https://github.com/pjlab-sys4nlp/llama-moe
continual-pre-training expert-partition llama llm mixture-of-experts moe
Last synced: 26 days ago
JSON representation
⛷️ LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training (EMNLP 2024)
- Host: GitHub
- URL: https://github.com/pjlab-sys4nlp/llama-moe
- Owner: pjlab-sys4nlp
- License: apache-2.0
- Created: 2023-07-24T06:15:51.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-06-25T02:33:49.000Z (6 months ago)
- Last Synced: 2024-11-15T22:08:14.435Z (27 days ago)
- Topics: continual-pre-training, expert-partition, llama, llm, mixture-of-experts, moe
- Language: Python
- Homepage: https://arxiv.org/abs/2406.16554
- Size: 1.66 MB
- Stars: 883
- Watchers: 8
- Forks: 46
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-mixture-of-experts - Dec 2023
- awesome-ai-papers - [llama-moe - pytorch](https://github.com/lucidrains/PEER-pytorch)\]\[[GRIN-MoE](https://github.com/microsoft/GRIN-MoE)\]\[[MoE-plus-plus](https://github.com/SkyworkAI/MoE-plus-plus)\]\[[MoH](https://github.com/SkyworkAI/MoH)\] (NLP / 3. Pretraining)
- StarryDivineSky - pjlab-sys4nlp/llama-moe - MoE:将 LLaMA 的 FFN 划分为稀疏专家,并为每一层专家插入 top-K 门。使用来自 Sheared LLaMA 的优化数据采样权重和来自 SlimPajama 的过滤数据集,持续预训练初始化的 MoE 模型。 (A01_文本生成_文本对话 / 大语言对话模型及数据)
- awesome-ai-papers - [llama-moe - pytorch](https://github.com/lucidrains/PEER-pytorch)\]\[[GRIN-MoE](https://github.com/microsoft/GRIN-MoE)\] (NLP / 3. Pretraining)
README
LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training
📢 A SMALLER AFFORDABLE MoE MODEL FOR EVERYONE!!
🤗 Model Weights | 🚀 Quick Start | ⚙️ Installation Guide | 🚧 Expert Construction | 🚅 Continual Pre-training | 💎 Evaluation | 💬 Supervised Fine-Tuning (SFT)
📃 Technical Report🎉 Introduction
LLaMA-MoE is a series of open-sourced Mixture-of-Expert (MoE) models based on [LLaMA](https://github.com/facebookresearch/llama) and [SlimPajama](https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama).
We build LLaMA-MoE with the following two steps:
1. Partition LLaMA's FFNs into sparse experts and insert top-K gate for each layer of experts.
2. Continually pre-train the initialized MoE model with an optimized data sampling weights from [Sheared LLaMA](https://arxiv.org/abs/2310.06694) and filtered datasets from [SlimPajama](https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama).![MoE Routing](./docs/imgs/MoE-Routing.gif)
🔥 Features
1. **Lightweight Models**: The number of activated model parameters is only 3.0~3.5B, which is friendly for deployment and research usage.
2. **Multiple Expert Construction Methods**:
1. Neuron-Independent: Random, Clustering, Co-activation Graph, Gradient ([Zhang et al., 2022](http://arxiv.org/abs/2110.01786), [Zuo et al., 2022](http://arxiv.org/abs/2204.07675))
2. Neuron-Sharing: Inner, Inter (residual)
3. **Multiple MoE Gating Strategies**:
1. TopK Noisy Gate ([Shazeer et al., 2017](http://arxiv.org/abs/1701.06538))
2. Switch Gating ([Fedus et al., 2022](http://arxiv.org/abs/2101.03961))
4. **Fast Continual Pre-training**:
1. FlashAttention-v2 integrated ([Dao, 2023](https://github.com/Dao-AILab/flash-attention))
2. Fast streaming dataset loading
5. **Abundant Monitor Items**:
1. Gate load, gate importance
2. Loss on steps, loss on tokens, balance loss
3. TGS (tokens/GPU/second), MFU (model FLOPs utilization)
4. Other visualization utilities
6. **Dynamic Weight Sampling**:
1. Self-defined static sampling weights
2. Sheared LLaMA's dynamic batch loading ([Xia et al., 2023](http://arxiv.org/abs/2310.06694))🚀 QuickStart
```python
# python>=3.10import torch
from transformers import AutoTokenizer, AutoModelForCausalLMmodel_dir = "llama-moe/LLaMA-MoE-v1-3_5B-2_8"
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_dir, torch_dtype=torch.bfloat16, trust_remote_code=True)
model.eval()
model.to("cuda:0")input_text = "Suzhou is famous of"
inputs = tokenizer(input_text, return_tensors="pt")
inputs = inputs.to("cuda:0")pred = model.generate(**inputs, max_length=50, temperature=0.0)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
# Suzhou is famous of its beautiful gardens. The most famous one is the Humble Administrator's Garden. It is a classical Chinese garden with a history of more than 600 years. The garden is divided into three
```⚙️ Installation
1. Prepare conda environment: `conda create -n smoe python=3.11` (If your environment name is not `smoe`, you may need to change environment in launching scripts)
2. Add correct environment variables in `~/.bashrc` (`gcc` is set to newer version for installing `flash-attn`). e.g.:
```bash
export PATH=/mnt/petrelfs/share/cuda-11.8/bin:$PATH
export LD_LIBRARY_PATH=/mnt/petrelfs/share/cuda-11.8/lib64:$LD_LIBRARY_PATH
export PATH=/mnt/petrelfs/share/gcc-10.1.0/bin:$PATH
export LD_LIBRARY_PATH=/mnt/petrelfs/share/gcc-10.1.0/lib64:$LD_LIBRARY_PATH
```
3. Take the variables into effect: `source ~/.bashrc`
4. Install PyTorch (CUDA-11.8): `pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118`
5. Install dependencies: `pip install -r requirements.txt`
6. Install `flash-attn`: `pip install flash-attn==2.0.1 --no-build-isolation`. You may need to follow the [flash-attn installation instructions](https://github.com/Dao-AILab/flash-attention?tab=readme-ov-file#installation-and-features) to avoid some errors.
7. Install the latest Git: `conda install git`
8. Clone the repo: `git clone [email protected]:pjlab-sys4nlp/llama-moe.git` (If you don't setup the ssh key to GitHub, you may not able to clone through ssh. Check the [docs](https://docs.github.com/en/authentication/connecting-to-github-with-ssh/adding-a-new-ssh-key-to-your-github-account) about it.)
9. Change current directory: `cd llama-moe`
10. Install `smoe` in [editable mode](https://pip.pypa.io/en/stable/cli/pip_install/#cmdoption-e): `pip install -e .[dev]`
11. Setup `pre-commit` hooks: `pre-commit install`📊 Model Performance
| Model | \#Activated Experts | \#Experts | \#Activated Params | Foundation Model | SFT Model |
| :------------------------ | :-----------------: | :-------: | :----------------: | :---------------------------------------------------------------: | :------------------------------------------------------------------: |
| **LLaMA-MoE-3.0B** | 2 | 16 | 3.0B | [🤗 base](https://huggingface.co/llama-moe/LLaMA-MoE-v1-3_0B-2_16) | [🤗 SFT](https://huggingface.co/llama-moe/LLaMA-MoE-v1-3_0B-2_16-sft) |
| **LLaMA-MoE-3.5B (4/16)** | 4 | 16 | 3.5B | [🤗 base](https://huggingface.co/llama-moe/LLaMA-MoE-v1-3_5B-4_16) | [🤗 SFT](https://huggingface.co/llama-moe/LLaMA-MoE-v1-3_5B-4_16-sft) |
| **LLaMA-MoE-3.5B (2/8)** | 2 | 8 | 3.5B | [🤗 base](https://huggingface.co/llama-moe/LLaMA-MoE-v1-3_5B-2_8) | [🤗 SFT](https://huggingface.co/llama-moe/LLaMA-MoE-v1-3_5B-2_8-sft) |- Foundation models
| Model | Average | SciQ | PIQA | WinoGrande | ARC-e | ARC-c (25) | HellaSwag (10) | LogiQA | BoolQ (32) | LAMBADA | NQ (32) | MMLU (5) |
| :------------------------------------------------------------------------------------ | :------: | :------: | :------: | :--------: | :------: | :--------: | :------------: | :------: | :--------: | :------: | :------: | :------: |
| [OPT-2.7B](https://huggingface.co/facebook/opt-2.7b) | 50.3 | 78.9 | 74.8 | 60.8 | 54.4 | 34.0 | 61.4 | 25.8 | 63.3 | 63.6 | 10.7 | 25.8 |
| [Pythia-2.8B](https://huggingface.co/EleutherAI/pythia-2.8b) | 51.5 | 83.2 | 73.6 | 59.6 | 58.8 | 36.7 | 60.7 | 28.1 | 65.9 | 64.6 | 8.7 | 26.8 |
| [INCITE-BASE-3B](https://huggingface.co/togethercomputer/RedPajama-INCITE-Base-3B-v1) | 53.7 | 85.6 | 73.9 | 63.5 | 61.7 | 40.3 | 64.7 | 27.5 | 65.8 | 65.4 | 15.2 | 27.2 |
| [Open-LLaMA-3B-v2](https://huggingface.co/openlm-research/open_llama_3b_v2) | 55.6 | 88.0 | 77.9 | 63.1 | 63.3 | 40.1 | 71.4 | 28.1 | 69.2 | 67.4 | 16.0 | 26.8 |
| [Sheared-LLaMA-2.7B](https://huggingface.co/princeton-nlp/Sheared-LLaMA-2.7B) | 56.4 | 87.5 | 76.9 | 65.0 | 63.3 | 41.6 | 71.0 | 28.3 | 73.6 | 68.3 | 17.6 | **27.3** |
| **LLaMA-MoE-3.0B** | 55.5 | 84.2 | 77.5 | 63.6 | 60.2 | 40.9 | 70.8 | **30.6** | 71.9 | 66.6 | 17.0 | 26.8 |
| **LLaMA-MoE-3.5B (4/16)** | **57.7** | 87.6 | **77.9** | 65.5 | **65.6** | **44.2** | **73.3** | 29.7 | **75.0** | **69.5** | **20.3** | 26.8 |
| **LLaMA-MoE-3.5B (2/8)** | 57.6 | **88.4** | 77.6 | **66.7** | 65.3 | 43.1 | **73.3** | 29.6 | 73.9 | 69.4 | 19.8 | 27.0 |- SFT models
| Model | MMLU | ARC-c | HellaSeag | TruthfulQA | MT-Bench |
| :------------------------------------- | :---: | :---: | :-------: | :--------: | :------: |
| Sheared LLaMA-2.7B ShareGPT | 28.41 | 41.04 | 71.21 | 47.65 | 3.79 |
| Sheared LLaMA-2.7B Deita6K (Our Impl.) | 25.24 | 43.69 | 71.70 | 49.00 | 4.06 |
| LLaMA-MoE-v1-3.0B (2/16) | 23.61 | 43.43 | 72.28 | 44.24 | 4.15 |
| LLaMA-MoE-v1-3.5B (4/16) | 26.49 | 48.29 | 75.10 | 45.91 | 4.60 |
| LLaMA-MoE-v1-3.5B (2/8) | 25.53 | 45.99 | 74.95 | 44.39 | 4.72 |🚧 Expert Construction
- Neuron-Independent
- IndependentRandom: `bash ./scripts/expert_construction/split/run_split_random.sh`
- IndependentClustering: `bash ./scripts/expert_construction/split/run_split_clustering.sh`
- Neuron-Sharing
- SharingInner: `bash ./scripts/expert_construction/split/run_split_gradient.sh`
- SharingInter: `bash ./scripts/expert_construction/split/run_split_gradient_residual.sh`For more information, please refer to [Expert Construction docs](docs/expert_construction/README.md).
🚅 Continual Pre-training
### Tokenization
Download [SlimPajama](https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama) into `/path_to_data` and put data from different domains into separate folders:
- `/path_to_data/en_arxiv`
- `/path_to_data/en_book`
- `/path_to_data/en_c4`
- `/path_to_data/en_cc`
- `/path_to_data/en_stack`
- `/path_to_data/en_wikipedia`
- `/path_to_data/github`Each file should be end with `*.jsonl` and each line looks like:
```
{"id": "id-info", "content": "raw text to be tokenized"}
```Run the following command to tokenize the data in each folder:
```bash
python -m smoe.utils.tokenize \
-f jsonl \
-t /path_to_tokenizer \
-i /path_to_data/en_arxiv \
-o /path_to_data_tokenized/en_arxiv
```### Continual Pre-training (CPT)
- **NOTICE:** Please create `logs/` folder manually: `mkdir -p logs`
- To run the continual pre-training, please check the [CPT docs](docs/continual_pretraining/README.md).💎 Evaluation
- For evalution on Natural Questions (NQ), please refer to [opencompass](https://github.com/Spico197/opencompass/tree/main).
- For other tasks, please refer to [lm-eval-harness](https://github.com/spico197/smoe-eval).💬 Supervised Fine-Tuning (SFT)
We provide simple examples of SFT to build chatbots.
Please refer to [SFT docs](/mnt/petrelfs/zhutong/smoe/docs/supervised_fine_tuning/SFT.md) and `/mnt/petrelfs/zhutong/smoe/scripts/sft` for more details.📑 Citation
```bibtex
@article{llama-moe,
title={LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training},
author={Tong Zhu and Xiaoye Qu and Daize Dong and Jiacheng Ruan and Jingqi Tong and Conghui He and Yu Cheng},
journal={arXiv preprint arXiv:2406.16554},
year={2024},
url={https://arxiv.org/abs/2406.16554},
}
```
LLaMA-MoE Team w/ ❤️