https://github.com/SHI-Labs/CuMo

CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
https://github.com/SHI-Labs/CuMo

Last synced: 7 months ago
JSON representation

CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts

Host: GitHub
URL: https://github.com/SHI-Labs/CuMo
Owner: SHI-Labs
License: apache-2.0
Created: 2024-05-08T05:11:08.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2024-06-08T06:04:21.000Z (about 1 year ago)
Last Synced: 2024-11-23T20:02:37.252Z (7 months ago)
Language: Python
Homepage:
Size: 7.65 MB
Stars: 134
Watchers: 2
Forks: 10
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

StarryDivineSky - SHI-Labs/CuMo - upcycled Top-K 稀疏门控专家混合模块整合到视觉编码器和 MLP 连接器中，从而增强了多模态的能力LLMs。我们进一步采用辅助损失的三阶段培训方法，以稳定培训过程并保持专家的平衡负载。CuMo 在开源数据集上进行了专门训练，LLMs并在多个 VQA 和可视化指令跟踪基准上实现了与其他最先进的多模态相当的性能。 (多模态大模型 / 网络服务_其他)

README

        
# CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts











[Jiachen Li](https://chrisjuniorli.github.io/),

[Xinyao Wang](),

[Sijie Zhu](https://jeff-zilence.github.io/),

[Chia-wen Kuo](https://sites.google.com/view/chiawen-kuo/home),

[Lu Xu](),

[Fan Chen](),

[Jitesh Jain](https://praeclarumjj3.github.io/),

[Humphrey Shi](https://www.humphreyshi.com/home),

[Longyin Wen](https://scholar.google.com/citations?user=PO9WFl0AAAAJ&hl=en)

## Release

- [06/07] We released checkpoints of CuMo after pre-training and pre-finetuning stages at [CuMo-misc](https://huggingface.co/shi-labs/CuMo-misc).

- [05/10] Check out the [Demo](https://huggingface.co/spaces/shi-labs/CuMo-7b-zero) based on Gradio zero gpu space.

- [05/09] Check out the [Arxiv](https://arxiv.org/abs/2405.05949) version of the paper!

- [05/08] We released **CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts** with [project page](https://chrisjuniorli.github.io/project/CuMo/) and [codes](https://github.com/SHI-Labs/CuMo).

## Contents

- [Release](#release)

- [Contents](#contents)

- [Overview](#overview)

- [Installation](#installation)

- [Model Zoo](#model-zoo)

- [Demo setup](#demo-setup)

  - [Gradio Web UI](#gradio-web-ui)

  - [CLI Inference](#cli-inference)

- [Getting Started](#getting-started)

- [Citation](#citation)

- [Acknowledgement](#acknowledgement)

- [License](#license)

## Overview







In this project, we delve into the usage and training recipe of leveraging MoE in multimodal LLMs. We propose __CuMo__, which incorporates Co-upcycled Top-K sparsely-gated Mixture-of-experts blocks into the vision encoder and the MLP connector, thereby enhancing the capabilities of multimodal LLMs. We further adopt a three-stage training approach with auxiliary losses to stabilize the training process and maintain a balanced loading of experts.

CuMo is exclusively trained on open-sourced datasets and achieves comparable performance to other state-of-the-art multimodal LLMs on multiple VQA and visual-instruction-following benchmarks.







## Installation

1. Clone this repo.

```bash

git clone https://github.com/SHI-Labs/CuMo.git

cd CuMo

```

2. Install dependencies.

*We used python 3.9 venv for all experiments and it should be compatible with python 3.9 or 3.10 under anaconda if you prefer to use it.*

```bash

venv:

python -m venv /path/to/new/virtual/cumo

source /path/to/new/virtual/cumo/bin/activate

anaconda:

conda create -n cumo python=3.9 -y

conda activate cumo

pip install --upgrade pip

pip install -e .

```

3. Install additional packages for training CuMo

```

pip install -e ".[train]"

pip install flash-attn --no-build-isolation

```

## Model Zoo

The CuMo model weights are open-sourced at Huggingface: 

| Model | Base LLM | Vision Encoder | MLP Connector | Download |

|----------|----------|----------|----------|----------------|

| CuMo-7B | Mistral-7B-Instruct-v0.2 | CLIP-MoE | MLP-MoE | 🤗 [HF ckpt](https://huggingface.co/shi-labs/CuMo-mistral-7b) |

| CuMo-8x7B | Mixtral-8x7B-Instruct-v0.1 | CLIP-MoE | MLP-MoE | 🤗 [HF ckpt](https://huggingface.co/shi-labs/CuMo-mixtral-8x7b) |

The intermediate checkpoints after pre-training and pre-finetuning are also released at Huggingface:

| Model | Base LLM | Stage | Download |

|----------|----------|----------|--------------|

| CuMo-7B | Mistral-7B-Instruct-v0.2 | Pre-Training | 🤗 [HF ckpt](https://huggingface.co/shi-labs/CuMo-misc/tree/main/cumo-mistral-7b) |

| CuMo-8x7B | Mixtral-8x7B-Instruct-v0.1 | Pre-Finetuning | 🤗 [HF ckpt](https://huggingface.co/shi-labs/CuMo-misc/tree/main/cumo-mixtral-8x7b) |

## Demo setup

### Gradio Web UI

We provide a Gradio Web UI based [demo](https://huggingface.co/spaces/shi-labs/CuMo-7b-zero). You can also setup the demo locally with

```bash

CUDA_VISIBLE_DEVICES=0 python -m cumo.serve.app \

    --model-path checkpoints/CuMo-mistral-7b

```

you can add `--bits 8` or `--bits 4` to save the GPU memory.

### CLI Inference

If you prefer to star a demo without a web UI, you can use the following commands to run a demo with CuMo-Mistral-7b on your terminal:

```Shell

CUDA_VISIBLE_DEVICES=0 python -m cumo.serve.cli \

    --model-path checkpoints/CuMo-mistral-7b \

    --image-file cumo/serve/examples/waterview.jpg

```

you can add `--load-4bit` or `--load-8bit` to save the GPU memory.

## Getting Started

Please refer to [Getting Started](docs/getting_started.md) for dataset preparation, training, and inference details of CuMo.

## Citation

```

@article{li2024cumo,

  title={CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts},

  author={Li, Jiachen and Wang, Xinyao and Zhu, Sijie and Kuo, Chia-wen and Xu, Lu and Chen, Fan and Jain, Jitesh and Shi, Humphrey and Wen, Longyin},

  journal={arXiv:},

  year={2024}

}

```

## Acknowledgement

We thank the authors of [LLaVA](https://github.com/haotian-liu/LLaVA), [MoE-LLaVA](https://github.com/PKU-YuanGroup/MoE-LLaVA), [S^2](https://github.com/bfshi/scaling_on_scales),

[st-moe-pytorch](https://github.com/lucidrains/st-moe-pytorch), [mistral-src](https://github.com/mistralai/mistral-src) for releasing the source codes.

## License

[![Code License](https://img.shields.io/badge/Code%20License-Apache_2.0-yellow.svg)](LICENSE)

[![Weight License](https://img.shields.io/badge/Weight%20License-CC%20By%20NC%204.0-red)](WEIGHT_LICENSE)

The weights of checkpoints are licensed under CC BY-NC 4.0 for non-commercial use. The codebase is licensed under Apache 2.0. This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses.

The content produced by any version of CuMo is influenced by uncontrollable variables such as randomness, and therefore, the accuracy of the output cannot be guaranteed by this project. This project does not accept any legal liability for the content of the model output, nor does it assume responsibility for any losses incurred due to the use of associated resources and output results.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/SHI-Labs/CuMo

Awesome Lists containing this project

README