Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/SHI-Labs/CuMo
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
https://github.com/SHI-Labs/CuMo
Last synced: 3 months ago
JSON representation
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
- Host: GitHub
- URL: https://github.com/SHI-Labs/CuMo
- Owner: SHI-Labs
- License: apache-2.0
- Created: 2024-05-08T05:11:08.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2024-06-08T06:04:21.000Z (5 months ago)
- Last Synced: 2024-07-30T05:37:30.705Z (3 months ago)
- Language: Python
- Homepage:
- Size: 7.65 MB
- Stars: 120
- Watchers: 2
- Forks: 8
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- StarryDivineSky - SHI-Labs/CuMo - upcycled Top-K 稀疏门控专家混合模块整合到视觉编码器和 MLP 连接器中,从而增强了多模态的能力LLMs。我们进一步采用辅助损失的三阶段培训方法,以稳定培训过程并保持专家的平衡负载。CuMo 在开源数据集上进行了专门训练,LLMs并在多个 VQA 和可视化指令跟踪基准上实现了与其他最先进的多模态相当的性能。 (多模态大模型 / 网络服务_其他)
README
# CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
[Jiachen Li](https://chrisjuniorli.github.io/),
[Xinyao Wang](),
[Sijie Zhu](https://jeff-zilence.github.io/),
[Chia-wen Kuo](https://sites.google.com/view/chiawen-kuo/home),
[Lu Xu](),
[Fan Chen](),
[Jitesh Jain](https://praeclarumjj3.github.io/),
[Humphrey Shi](https://www.humphreyshi.com/home),
[Longyin Wen](https://scholar.google.com/citations?user=PO9WFl0AAAAJ&hl=en)## Release
- [06/07] We released checkpoints of CuMo after pre-training and pre-finetuning stages at [CuMo-misc](https://huggingface.co/shi-labs/CuMo-misc).
- [05/10] Check out the [Demo](https://huggingface.co/spaces/shi-labs/CuMo-7b-zero) based on Gradio zero gpu space.
- [05/09] Check out the [Arxiv](https://arxiv.org/abs/2405.05949) version of the paper!
- [05/08] We released **CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts** with [project page](https://chrisjuniorli.github.io/project/CuMo/) and [codes](https://github.com/SHI-Labs/CuMo).## Contents
- [Release](#release)
- [Contents](#contents)
- [Overview](#overview)
- [Installation](#installation)
- [Model Zoo](#model-zoo)
- [Demo setup](#demo-setup)
- [Gradio Web UI](#gradio-web-ui)
- [CLI Inference](#cli-inference)
- [Getting Started](#getting-started)
- [Citation](#citation)
- [Acknowledgement](#acknowledgement)
- [License](#license)## Overview
In this project, we delve into the usage and training recipe of leveraging MoE in multimodal LLMs. We propose __CuMo__, which incorporates Co-upcycled Top-K sparsely-gated Mixture-of-experts blocks into the vision encoder and the MLP connector, thereby enhancing the capabilities of multimodal LLMs. We further adopt a three-stage training approach with auxiliary losses to stabilize the training process and maintain a balanced loading of experts.
CuMo is exclusively trained on open-sourced datasets and achieves comparable performance to other state-of-the-art multimodal LLMs on multiple VQA and visual-instruction-following benchmarks.
## Installation
1. Clone this repo.
```bash
git clone https://github.com/SHI-Labs/CuMo.git
cd CuMo
```2. Install dependencies.
*We used python 3.9 venv for all experiments and it should be compatible with python 3.9 or 3.10 under anaconda if you prefer to use it.*
```bash
venv:
python -m venv /path/to/new/virtual/cumo
source /path/to/new/virtual/cumo/bin/activateanaconda:
conda create -n cumo python=3.9 -y
conda activate cumopip install --upgrade pip
pip install -e .
```3. Install additional packages for training CuMo
```
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
```## Model Zoo
The CuMo model weights are open-sourced at Huggingface:
| Model | Base LLM | Vision Encoder | MLP Connector | Download |
|----------|----------|----------|----------|----------------|
| CuMo-7B | Mistral-7B-Instruct-v0.2 | CLIP-MoE | MLP-MoE | 🤗 [HF ckpt](https://huggingface.co/shi-labs/CuMo-mistral-7b) |
| CuMo-8x7B | Mixtral-8x7B-Instruct-v0.1 | CLIP-MoE | MLP-MoE | 🤗 [HF ckpt](https://huggingface.co/shi-labs/CuMo-mixtral-8x7b) |The intermediate checkpoints after pre-training and pre-finetuning are also released at Huggingface:
| Model | Base LLM | Stage | Download |
|----------|----------|----------|--------------|
| CuMo-7B | Mistral-7B-Instruct-v0.2 | Pre-Training | 🤗 [HF ckpt](https://huggingface.co/shi-labs/CuMo-misc/tree/main/cumo-mistral-7b) |
| CuMo-8x7B | Mixtral-8x7B-Instruct-v0.1 | Pre-Finetuning | 🤗 [HF ckpt](https://huggingface.co/shi-labs/CuMo-misc/tree/main/cumo-mixtral-8x7b) |## Demo setup
### Gradio Web UI
We provide a Gradio Web UI based [demo](https://huggingface.co/spaces/shi-labs/CuMo-7b-zero). You can also setup the demo locally with
```bash
CUDA_VISIBLE_DEVICES=0 python -m cumo.serve.app \
--model-path checkpoints/CuMo-mistral-7b
```
you can add `--bits 8` or `--bits 4` to save the GPU memory.### CLI Inference
If you prefer to star a demo without a web UI, you can use the following commands to run a demo with CuMo-Mistral-7b on your terminal:
```Shell
CUDA_VISIBLE_DEVICES=0 python -m cumo.serve.cli \
--model-path checkpoints/CuMo-mistral-7b \
--image-file cumo/serve/examples/waterview.jpg
```
you can add `--load-4bit` or `--load-8bit` to save the GPU memory.## Getting Started
Please refer to [Getting Started](docs/getting_started.md) for dataset preparation, training, and inference details of CuMo.
## Citation
```
@article{li2024cumo,
title={CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts},
author={Li, Jiachen and Wang, Xinyao and Zhu, Sijie and Kuo, Chia-wen and Xu, Lu and Chen, Fan and Jain, Jitesh and Shi, Humphrey and Wen, Longyin},
journal={arXiv:},
year={2024}
}
```## Acknowledgement
We thank the authors of [LLaVA](https://github.com/haotian-liu/LLaVA), [MoE-LLaVA](https://github.com/PKU-YuanGroup/MoE-LLaVA), [S^2](https://github.com/bfshi/scaling_on_scales),
[st-moe-pytorch](https://github.com/lucidrains/st-moe-pytorch), [mistral-src](https://github.com/mistralai/mistral-src) for releasing the source codes.## License
[![Code License](https://img.shields.io/badge/Code%20License-Apache_2.0-yellow.svg)](LICENSE)
[![Weight License](https://img.shields.io/badge/Weight%20License-CC%20By%20NC%204.0-red)](WEIGHT_LICENSE)The weights of checkpoints are licensed under CC BY-NC 4.0 for non-commercial use. The codebase is licensed under Apache 2.0. This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses.
The content produced by any version of CuMo is influenced by uncontrollable variables such as randomness, and therefore, the accuracy of the output cannot be guaranteed by this project. This project does not accept any legal liability for the content of the model output, nor does it assume responsibility for any losses incurred due to the use of associated resources and output results.