An open API service indexing awesome lists of open source software.

https://github.com/xiaoachen98/open-llava-next

A repoduction of LLaVA-NeXT series for facilating the large multi-modal model community
https://github.com/xiaoachen98/open-llava-next

Last synced: about 1 month ago
JSON representation

A repoduction of LLaVA-NeXT series for facilating the large multi-modal model community

Awesome Lists containing this project

README

        

# Open-LLaVA-NeXT

An open-source implementation of **LLaVA-NeXT** series for facilitating the large multi-modal model community.

**Resources:** [[🤗HuggingFace](https://huggingface.co/collections/Lin-Chen/open-llava-next-665051533fa1a30553fcee8d)]

## 💡 Highlights

- 🔥 All training data and checkpoints at each stage are open-sourced, friendly for research usage.
- 🔥 Able to reproduce the results of **[LLaVA-NeXT](https://llava-vl.github.io/blog/2024-05-10-llava-next-stronger-llms/)**.
- 🔥 Based on the **[LLaVA](https://github.com/haotian-liu/LLaVA)** codebase with minimal modification, easy to follow.

## 🤖 Model Zoo

See more details in [ModelZoo.md](docs/ModelZoo.md).

| Name | ViT | LLM | Weights | MME | SEED | SQA | MMB | MMB-CN | TextVQA | GQA |
|---|---|---|---|---|---|---|---|---|---|---|
| llava-next-vicuna-7b | CLIP-L-336 | Vicuna-7B | [SFT](https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b) | 1519 | 70.2 | 70.1 | 67.4 | 60.6 | 64.9 | 64.2 |
| open-llava-next-vicuna-7b| CLIP-L-336 | Vicuna-7B | [PT](https://huggingface.co/Lin-Chen/open-llava-next-vicuna-7b/tree/main/pretrain), [SFT](https://huggingface.co/Lin-Chen/open-llava-next-vicuna-7b) | 1540 | 71.1 | 70.7 | 68.5 | 60.7 | 67.2 | 64.3 |
| llava-next-llama3-8b| CLIP-L-336 | LLaMA3-8B | [SFT](https://huggingface.co/lmms-lab/llama3-llava-next-8b) | 1591 | 72.7 | 73.4 | 72.6 | 69.0 | 65.0 | 65.5 |
| open-llava-next-llama3-8b| CLIP-L-336 | LLaMA3-8B | [PT](https://huggingface.co/Lin-Chen/open-llava-next-llama3-8b), [SFT](https://huggingface.co/Lin-Chen/open-llava-next-llama3-8b) | 1552 | 74.4 | 77.3 | 74.4 | 70.4 | 69.8 | 65.9 |

## 👨‍💻 ToDo

- [x] Reproduce LLaVA-Next-LLaMA3-8B
- [ ] Integrate [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) for convenient evaluation

## 🔧 Install

1. Clone this repository and navigate to Open-LLaVA-NeXT folder
```bash
git clone https://github.com/xiaoachen98/Open-LLaVA-NeXT.git
cd Open-LLaVA-NeXT
```

2. Install Package
```Shell
conda create -n llava-next python=3.10 -y
conda activate llava-next
pip install --upgrade pip # enable PEP 660 support
pip install -e .
```

1. Install additional packages for training
```
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
```

## Data Preparation

You should follow this instruction **[Data.md](docs/Data.md)** to manage the training datasets.

## Training Overview

Open-LLaVA-NeXT training consists of two stages: (1) feature alignment stage: use 558K subset of the LAION-CC-SBU dataset to connect a *frozen pretrained* vision encoder to a *frozen LLM*; (2) visual instruction tuning stage: finetune the entire model with 1M **completely open source** data. Detailed data statics is provided in [Visual Instruction Tuning](https://github.com/xiaoachen98/Open-LLaVA-NeXT?tab=readme-ov-file#visual-instruction-tuning). We take the Vicuna-v1.5-7B variant as example to present the training and evaluation details.

Open-LLaVA-NeXT series are trained on A100 GPUs with 80GB memory. To train on fewer GPUs, you can reduce the `per_device_train_batch_size` and increase the `gradient_accumulation_steps` accordingly. And utilizing DeepSpeed ZeRO-3 can further reduce the memory requirements. Always keep the global batch size the same: `per_device_train_batch_size` x `gradient_accumulation_steps` x `num_gpus`.

### Hyperparameters
We use a same set of hyperparameters as LLaVA in finetuning. Both hyperparameters used in pretraining and finetuning are provided below.

1. Pretraining

| Hyperparameter | Global Batch Size | Projector lr | Epochs | Max length | Weight decay |
| --- | ---: | ---: | ---: | ---: | ---: |
| Open-LLaVA-NeXT-7B | 256 | 1e-3 | 1 | 4096 | 0 |

2. Finetuning

| Hyperparameter | Global Batch Size | LLM lr | Projector lr | Vision Tower lr | Epochs | Max length | Weight decay |
| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| Open-LLaVA-NeXT-7B | 128 | 2e-5 | 2e-5 | 2e-6 | 1 | 4096 | 0 |

### Pretrain

Please download the 558K subset of the LAION-CC-SBU dataset with BLIP captions [here](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain).

Pretrain takes around 5 hours for Open-LLaVA-NeXT-7B on 16 x A100 (80G).

Training script with DeepSpeed ZeRO-2: [`pretrain.sh`](scripts/v1_6/train/7b/pretrain.sh).

- `--mm_projector_type mlp2x_gelu`: the two-layer MLP vision-language connector.
- `--vision_tower openai/clip-vit-large-patch14-336`: CLIP ViT-L/14 336px.

### Visual Instruction Tuning

1. Prepare data
You should follow the instructions for data preparation in [Data](docs/Data.md).
2. Prepare MLP projectors
You may download our pretrained projectors in [Model Zoo](docs/ModelZoo.md), or specify your own MLP projector after pre-training.
3. Start training
Visual instruction tuning takes around 20 hours for Open-LLaVA-NeXT-7B on 16x A100 (80G).

Training script with DeepSpeed ZeRO-2: [`finetune.sh`](scripts/v1_6/train/7b/finetune.sh).

New options to note:

- `--unfreeze_mm_vision_tower True`: finetune vision tower.
- `--mm_vision_tower_lr 2e-6`: learning rate of vision tower.
- `--image_aspect_ratio anyres`: Process an image with variable resolutions.
- `--mm_patch_merge_type spatial_unpad`: This unpads a PyTorch tensor of a padded and resized image, and by inserting learnable newline vectors into image tokens, the model becomes aware of two-dimensional spatial information. This is used to process image token.

## Evaluation

See [Evaluation.md](docs/Evaluation.md).

## Citation

If you find this project useful in your research, please consider cite:

```bibtex
@misc{chen2024open,
title={Open-LLaVA-NeXT: An open-source implementation of LLaVA-NeXT series for facilitating the large multi-modal model community.},
author={Chen, Lin and Xing, Long},
howpublished = {\url{https://github.com/xiaoachen98/Open-LLaVA-NeXT}},
year={2024},
doi={10.5281/zenodo.13935471}
}
```

## ❤️ Acknowledgments

- [LLaVA](https://github.com/haotian-liu/LLaVA): the codebase we built upon. Thanks for their brilliant contributions to the community! We just can't wait to use LLaVA-NeXT.
- [ShareGPT4V](https://github.com/InternLM/InternLM-XComposer/tree/main/projects/ShareGPT4V): Thanks for their code about finetuning the vision tower.
- [VLMEvalKit](https://github.com/open-compass/VLMEvalKit): the amazing open-sourced suit for evaluating various LMMs!