https://github.com/opendatalab/mllm-dataengine
MLLM-DataEngine: An Iterative Refinement Approach for MLLM
https://github.com/opendatalab/mllm-dataengine
Last synced: 9 months ago
JSON representation
MLLM-DataEngine: An Iterative Refinement Approach for MLLM
- Host: GitHub
- URL: https://github.com/opendatalab/mllm-dataengine
- Owner: opendatalab
- License: apache-2.0
- Created: 2023-08-30T07:10:30.000Z (over 2 years ago)
- Default Branch: master
- Last Pushed: 2024-05-24T05:03:28.000Z (over 1 year ago)
- Last Synced: 2025-03-22T03:51:15.094Z (10 months ago)
- Language: Python
- Size: 25.7 MB
- Stars: 45
- Watchers: 2
- Forks: 5
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
README
# MLLM-DataEngine: Closing the Loop of Instruction Tuning Data Generation 
**Shanghai Artificial Intelligence Laboratory**
[[Paper]](https://arxiv.org/pdf/2308.13566) [[Data(huggingface)]](https://huggingface.co/collections/juliozhao/mllm-dataengine-v2-65fa4baf180d06d37c7eda08) [[Data(opendatalab)]](https://openxlab.org.cn/datasets/zzy8782180/DataEngine-InstData) [[Model]](https://huggingface.co/collections/juliozhao/mllm-dataengine-v2-65fa4baf180d06d37c7eda08)
## Introduction
We propose MLLM-DataEngine, a novel closed-loop system that bridges data generation, model training, and evaluation. Within each loop iteration, the MLLM-DataEngine first analyzes the weakness of the model based on the evaluation results, then generates a proper incremental dataset for the next training iteration, and enhances the model capability iteratively.

Compared with previous instruction fine-tuning dataset collection methods which are separate from the benchmarking, MLLM-DataEngine shows better targeting and generates higher-quality data and improve MLLMs's capabilities more effectively.

## News and Updates
- `2024.05` 🎉🎉🎉 MLLM-DataEngine-v2 are publicly available! Compared to previous version([version1.0](https://github.com/opendatalab/MLLM-DataEngine/tree/version1.0
)), MLLM-DataEngine-v2 generated instruction fine-tuning (SFT) data has **larger amount, higher quality, and more diversity**. Meanwhile, MLLM-DataEngine-v2 supports **SOTA open-source models** ([LLaVA-1.5](https://github.com/haotian-liu/LLaVA) and [MiniGPT4-v2](https://github.com/Vision-CAIR/MiniGPT-4)) and shows significant improvements on various public benchmarks.
- `2023.09` 🎉🎉🎉 MLLM-DataEngine are publicly available, supporting [MiniGPT4](https://github.com/Vision-CAIR/MiniGPT-4) and achieves **greatly improved score on MMBenchmark** (see [paper](https://arxiv.org/pdf/2308.13566)).
## Dataset Format
The MLLM-DataEngine generate data contains a clear, consice instruction, and corresponding answer. Besides, the instruction-answer pair is reformatted into multi-choices question answering format. The generated data is organized in the following format:
```json
[
{
"instruction": "Where is the man wearing a black backpack positioned in the picture?",
"answer": "The man wearing a black backpack is located at the left side of the image, roughly in the middle between top and bottom",
"short_answer": "Letf middle",
"options": ["Top right", "Bottom right", "Bottom left", "Left middle"],
"choide_answer": "D",
"image": "vg/VG_100K_2/2404787.jpg",
"qtype": 4,
},
]
```
```instruction```: a clear, consice instruction
```answer```: direct answer to the instruction
```short_answer```: the short answer to the instruction
```options```: four options corresponding to the instruction
```choice_answer```: correct choice answer option
```image```: Visual Genome image path
```qtype```: question type in SEED-Bench, demonstrated in the following:
```json
{
"1": "Scene Understanding",
"2": "Instance Identity",
"3": "Instance Attributes",
"4": "Instance Location",
"5": "Instances Counting",
"6": "Spatial Relation",
"7": "Instance Interaction",
"8": "Visual Reasoning",
"9": "Text Understanding",
}
```
## Main Results
### LLaVA-1.5-lora-7b
| Incremental Dataset | Data Amount | SEED | MMB | MME | GQA | VQAv2 | ScienceQA |
| :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: |
| None(baseline) | - | 66.04 | 66.66 | 1475/290(1765) | 57.27 | 77.56 | 70.67/68.27 |
| MLLM-DataEngine | 220k | **68.57** | **67.18** | **1511/303(1814)** | **58.02** | **78.18** | **73.17/71.15** |
### MiniGPT4-v2
| Incremental Dataset | Data Amount | SEED | MMB | OKVQA | VizWiz | VSR |
| :--: | :--: | :--: | :--: | :--: | :--: | :--: |
| None(baseline) | - | 49.21 | 38.83 | 56.03 | 53.08 | 61.37 |
| MLLM-DataEngine | 270k | **63.83** | **52.92** | **56.87** | **54.39** | **62.43** |
## Model Training and Evaluation
| MiniGPT4-v2 | LLaVA-1.5 |
| :--: | :--: |
| [doc](MiniGPT-4/README.md) | [doc](LLaVA/README.md) |
## Acknowledgement
- [MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4). The MiniGPT-4 part of HA-DPO is based on the official MiniGPT-4 implementation.
- [LLaVA-1.5](https://github.com/haotian-liu/LLaVA). The LLaVA-v1.5 part of HA-DPO is based on the official LLaVA-1.5 implementation, which is a great open-source work on LVLM.
## Citation
If you're using MLLM-DataEngine in your research or applications, please cite using this BibTeX:
```bibtex
@misc{zhao2023mllmdataengine,
title={MLLM-DataEngine: An Iterative Refinement Approach for MLLM},
author={Zhiyuan Zhao and Linke Ouyang and Bin Wang and Siyuan Huang and Pan Zhang and Xiaoyi Dong and Jiaqi Wang and Conghui He},
year={2023},
eprint={2308.13566},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
```
## Contact us
If you have any questions, comments or suggestions, please do not hesitate to contact us at zhaozhiyuan@pjlab.org.cn.
## License
[Apache License 2.0](LICENSE.txt)