Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/xmu-xiaoma666/vmllm
The official repository for “vMLLM: Boosting Multi-modal Large Language Model with Enhanced Visual Features”.
https://github.com/xmu-xiaoma666/vmllm
Last synced: about 1 month ago
JSON representation
The official repository for “vMLLM: Boosting Multi-modal Large Language Model with Enhanced Visual Features”.
- Host: GitHub
- URL: https://github.com/xmu-xiaoma666/vmllm
- Owner: xmu-xiaoma666
- License: apache-2.0
- Created: 2024-11-09T12:24:38.000Z (about 1 month ago)
- Default Branch: master
- Last Pushed: 2024-11-09T13:16:37.000Z (about 1 month ago)
- Last Synced: 2024-11-09T14:22:40.219Z (about 1 month ago)
- Language: Python
- Size: 3.39 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# vMLLM: Boosting Multi-modal Large Language Model with Enhanced Visual Features
This repository contains the reference code for the paper "vMLLM: Boosting Multi-modal Large Language Model with Enhanced Visual Features"
![](images/vMLLM.png)
## Experiment setup
```
# create conda envs and install corresponding package
bash running_script/install_envs.sh
```## Data preparation
- The image can be downloaded following the readme.md of following repos:
please refer to [LLaVA](https://github.com/haotian-liu/LLaVA/tree/main), [MG-LLaVA](https://github.com/PhoenixZ810/MG-LLaVA), and [MGM](https://github.com/dvlab-research/MGM)
- The merged annotation file can be downloaded from the [vMLLM_data](https://huggingface.co/datasets/xmu-xiaoma666/vMLLM_data/tree/main)## Training
Training using CLIP-base vision encoder:
```
bash running_script/running_vMLLM_clip_base.sh
```Training using CLIP-large vision encoder:
```
bash running_script/running_vMLLM_clip_large.sh
```Training using SigLIP-base vision encoder:
```
bash running_script/running_vMLLM_siglip_base.sh
```Training using SigLIP-SO vision encoder:
```
bash running_script/running_vMLLM_siglip_so.sh
```## Chat Ability
![](images/ability.png)
## Performance
| **Method** | **VE** | **LLM** | **Res.** | **MMB** | **MM-Vet** | **MathVista** | **GQA** | **AI2D** | **SQA**$^{I}$ | **Seed**$^{I}$ | **VizWiz** |
|------------|----------|-----------------|----------|---------|------------|---------------|---------|----------|---------------|---------------|------------|
| vMLLM | SigLIP-SO | Vicuna-7B | 384 | 71.4 | 39.1 | 34.7 | 63.3 | 70.6 | 71.5 | 70.9 | 63.5 |
| vMLLM | SigLIP-SO | Vicuna-13B | 384 | 75.3 | 43.3 | 36.4 | 64.5 | 75.8 | 76.1 | 72.5 | 64.8 |
| vMLLM | SigLIP-SO | LLaMA3-8B | 384 | 77.1 | 43.4 | 41.0 | 65.3 | 77.1 | 75.6 | 74.0 | 65.1 |## Model Zoo
| LLM | SFT | Pretrain |
|-------|------|----------|
| Vicuna-7B | [vMLLM_7B_sft](https://huggingface.co/xmu-xiaoma666/vMLLM_7B_sft)| [vMLLM_7B_pretrain](https://huggingface.co/xmu-xiaoma666/vMLLM_7B_pretrain) |
| LLaMA3-8B | [vMLLM_8B_sft](https://huggingface.co/xmu-xiaoma666/vMLLM_8B_sft) | [vMLLM_8B_pretrain](https://huggingface.co/xmu-xiaoma666/vMLLM_8B_pretrain) |
| Vicuna-13B | [vMLLM_13B_sft](https://huggingface.co/xmu-xiaoma666/vMLLM_13B_sft) | [vMLLM_13B_pretrain](https://huggingface.co/xmu-xiaoma666/vMLLM_13B_pretrain) |## Evaluation
We use the [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval) toolkit for multi-benchmark evalution