Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/jingyi0000/awesome-visual-instruction-tuning

Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey
https://github.com/jingyi0000/awesome-visual-instruction-tuning

List: awesome-visual-instruction-tuning

multi-modal-language-model multi-modal-model survey visual-instruction-tuning

Last synced: about 2 months ago
JSON representation

Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey

Awesome Lists containing this project

README

        

## Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey

This is the repository of **Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey**, a systematic review of visual instruction tuning. For details, please refer to:

**Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey**
[[Paper](https://arxiv.org/abs/2312.16602)]

[![arXiv](https://img.shields.io/badge/arXiv-2312.16602-b31b1b.svg)](https://arxiv.org/abs/2312.16602)
[![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-green.svg)](https://GitHub.com/Naereen/StrapDown.js/graphs/commit-activity)
[![PR's Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat)](http://makeapullrequest.com)

## Abstract

Traditional computer vision generally solves each single task independently by a dedicated model with the task instruction implicitly designed in the model architecture, arising two limitations: (1) it leads to task-specific models, which require multiple models for different tasks and restrict the potential synergies from diverse tasks; (2) it leads to a pre-defined and fixed model interface that has limited interactivity and adaptability in following user' task instructions. To address them, Visual Instruction Tuning (VIT) has been intensively studied recently, which finetunes a large vision model with language as task instructions, aiming to learn from a wide range of vision tasks described by language instructions a general-purpose multimodal model that can follow arbitrary instructions and thus solve arbitrary tasks specified by the user. This work aims to provide a systematic review of visual instruction tuning, covering (1) the background that presents computer vision task paradigms and the development of VIT; (2) the foundations of VIT that introduce commonly used network architectures, visual instruction tuning frameworks and objectives, and evaluation setups and tasks; (3) the commonly used datasets in visual instruction tuning and evaluation; (4) the review of existing VIT methods that categorizes them with a taxonomy according to both the studied vision task and the method design and highlights the major contributions, strengths, and shortcomings of them; (5) the comparison and discussion of VIT methods over various instruction-following benchmarks; (6) several challenges, open directions and possible future works in visual instruction tuning research.

## Citation
If you find our work useful in your research, please consider citing:
```
@article{huang2023visual,
title={Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey},
author={Huang, Jiaxing and Zhang, Jingyi and Jiang, Kai and Qiu, Han and Lu, Shijian},
journal={arXiv preprint arXiv:2312.16602},
year={2023}
}
```

## Menu
- [Datasets](#datasets)
- [Datasets for Visual Instruction Tuning](#datasets-for-visual-instruction-tuning)
- [Datasets for Instruction-tuned Model Evaluation](#datasets-for-instruction-tuned-model-evaluation)
- [Visual Instruction Tuning Methods](#visual-instruction-tuning-methods)
- [Instruction-based Image Learning](#instruction-based-image-learning)
- [Instruction-based Image Learning for Discriminative Tasks](#instruction-based-image-learning-for-discriminative-tasks)
- [Instruction-based Image Learning for Generative Tasks](#instruction-based-image-learning-for-generative-tasks)
- [Instruction-based Image Learning for Complex Reasoning Tasks](#instruction-based-image-learning-for-complex-reasoning-tasks)
- [Instruction-based Video Learning](#instruction-based-video-learning)
- [Instruction-based 3D Vision Learning](#instruction-based-3d-vision-learning)
- [Instruction-based Medical Vision Learning](#instruction-based-medical-vision-learning)
- [Instruction-based Document Vision Learning](#instruction-based-document-vision-learning)

## Datasets

### Datasets for Visual Instruction Tuning

### Datasets for Instruction-tuned Model Evaluation

## Visual Instruction Tuning Methods

### Instruction-based Image Learning

#### Instruction-based Image Learning for Discriminative Tasks
#### Instruction-based Image Learning for Generative Tasks
#### Instruction-based Image Learning for Complex Reasoning Tasks

### Instruction-based Video Learning

### Instruction-based 3D Vision Learning

### Instruction-based Medical Vision Learning

### Instruction-based Document Vision Learning