Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/vincentlux/Awesome-Multimodal-LLM

Reading list for Multimodal Large Language Models
https://github.com/vincentlux/Awesome-Multimodal-LLM

List: Awesome-Multimodal-LLM

awesome-list computer-vision large-language-models machine-learning multimodal-large-language-models multimodal-machine-learning natural-language-processing paper-list vision-language-model

Last synced: 3 months ago
JSON representation

Reading list for Multimodal Large Language Models

Awesome Lists containing this project

README

        

# Awesome-Multimodal-LLM [![Awesome](https://awesome.re/badge.svg)](https://awesome.re)
A curated list of papers related to multi-modal machine learning, especially multi-modal large language models (LLMs).

## Table of Contents
- [Tutorials](#tutorials)
- [Datasets](#datasets)
- [Research Papers](#research-papers)
- [Survey Papers](#survey-papers)
- [Core Areas](#core-areas)
- [Multimodal Understanding](#multimodal-understanding)
- [Vision-Centric Understanding](#vision-centric-understanding)
- [Embodied-Centric Understanding](#embodied-centric-understanding)
- [Domain-Specific Models](#domain-specific-models)
- [Multimodal Evaluation](#multimodal-evaluation)

# Tutorials

[Recent Advances in Vision Foundation Models](https://vlp-tutorial.github.io/), CVPR 2023 Workshop [[pdf]](https://datarelease.blob.core.windows.net/tutorial/vision_foundation_models_2023/slides/Chunyuan_cvpr2023_tutorial_lmm.pdf)

# Datasets

[M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning](https://arxiv.org/abs/2306.04387), arxiv 2023 [[data]](https://huggingface.co/datasets/MMInstruction/M3IT)

[LLaVA Instruction 150K](https://llava-vl.github.io/), arxiv 2023 [[data]](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K)

[Youku-mPLUG 10M](https://arxiv.org/abs/2306.04362), arxiv 2023 [[data]](https://github.com/X-PLUG/Youku-mPLUG)

[MULTIINSTRUCT: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning](https://arxiv.org/abs/2212.10773), ACL 2023 [[data]](https://github.com/VT-NLP/MultiInstruct)

# Research Papers

## Survey Papers

[A Survey on Multimodal Large Language Models](https://arxiv.org/abs/2306.13549), arxiv 2023 [[project page]](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models)

[Vision-Language Models for Vision Tasks: A Survey](https://arxiv.org/abs/2304.00685), arxiv 2023

## Core Areas

### Multimodal Understanding

[Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic](https://arxiv.org/abs/2306.15195), arxiv 2023 [[code]](https://github.com/shikras/shikra)

[PandaGPT: One Model To Instruction-Follow Them All](http://arxiv.org/abs/2305.16355), arxiv 2023 [[code]](https://github.com/yxuansu/PandaGPT)

[Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic](https://arxiv.org/abs/2306.15195), arxiv 2023 [[code]](https://github.com/shikras/shikra)

[MIMIC-IT: Multi-Modal In-Context Instruction Tuning](https://arxiv.org/abs/2305.03726), arxiv 2023 [[code]](https://github.com/Luodian/Otter)

[LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding](https://arxiv.org/abs/2306.17107), arxiv 2023 [[code]](https://github.com/SALT-NLP/LLaVAR)

[MetaVL: Transferring In-Context Learning Ability From Language Models to Vision-Language Models](https://arxiv.org/abs/2306.01311), arxiv 2023

[mPLUG-Owl : Modularization Empowers Large Language Models with Multimodality](https://arxiv.org/abs/2304.14178), arxiv 2023 [[code]](https://github.com/X-PLUG/mPLUG-Owl)

[InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500), arxiv 2023 [[code]](https://github.com/salesforce/LAVIS/tree/main/projects/instructblip)

[BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models](https://arxiv.org/abs/2301.12597), ICML 2023 [[code]](https://github.com/salesforce/LAVIS/tree/main/projects/blip2)

[Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models](https://arxiv.org/abs/2305.15023), arxiv 2023 [[code]](https://github.com/luogen1996/LaVIN)

[MultiModal-GPT: A Vision and Language Model for Dialogue with Humans](https://arxiv.org/abs/2305.04790), arxiv 2023 [[code]](https://github.com/open-mmlab/Multimodal-GPT)

[LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model](https://arxiv.org/abs/2304.15010), arxiv 2023 [[code]](https://github.com/ZrrSkywalker/LLaMA-Adapter)

[Language Is Not All You Need: Aligning Perception with Language Models](https://arxiv.org/abs/2302.14045v2), arxiv 2023 [[code]](https://github.com/microsoft/unilm)

[ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities](http://arxiv.org/abs/2305.11172), arxiv 2023 [[code]](https://github.com/OFA-Sys/ONE-PEACE)

[X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages](https://arxiv.org/abs/2305.04160), arxiv 2023 [[code]](https://github.com/phellonchen/X-LLM)

[Visual Instruction Tuning](https://arxiv.org/abs/2304.08485), arxiv 2023 [[code]](https://github.com/haotian-liu/LLaVA)

[Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models](https://arxiv.org/abs/2303.04671), arxiv 2023 [[code]](https://github.com/microsoft/TaskMatrix)

[PaLI: A Jointly-Scaled Multilingual Language-Image Model](http://arxiv.org/abs/2209.06794), ICLR 2023 [[blog]](https://ai.googleblog.com/2022/09/pali-scaling-language-image-learning-in.html)

[Grounding Language Models to Images for Multimodal Inputs and Outputs](https://arxiv.org/abs/2301.13823), ICML 2023 [[code]](https://github.com/kohjingyu/fromage)

[OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework](https://arxiv.org/abs/2202.03052), ICML 2022 [[code]](https://github.com/OFA-Sys/OFA)

[Flamingo: a Visual Language Model for Few-Shot Learning](https://arxiv.org/abs/2204.14198), NeurIPS 2022

### Vision-Centric Understanding

[LISA: Reasoning Segmentation via Large Language Model](https://arxiv.org/abs/2308.00692), arxiv 2023 [[code]](https://github.com/dvlab-research/LISA)

[Contextual Object Detection with Multimodal Large Language Models](https://arxiv.org/abs/2305.18279), arxiv 2023 [[code]](https://github.com/yuhangzang/ContextDET)

[KOSMOS-2: Grounding Multimodal Large Language Models to the World](https://arxiv.org/abs/2306.14824), arxiv 2023 [[code]](https://github.com/microsoft/unilm/tree/master/kosmos-2)

[Contextual Object Detection with Multimodal Large Language Models](https://arxiv.org/abs/2305.18279), arxiv 2023 [[code]](https://github.com/yuhangzang/ContextDET)

[Fast Segment Anything](https://arxiv.org/abs/2306.12156), arxiv 2023 [[code]](https://github.com/CASIA-IVA-Lab/FastSAM)

[Multi-Modal Classifiers for Open-Vocabulary Object Detection](https://arxiv.org/abs/2306.05493), ICML 2023 [[code]](https://github.com/prannaykaul/mm-ovod)

[Instruction-ViT: Multi-Modal Prompts for Instruction Learning in ViT](https://arxiv.org/abs/2305.00201), arxiv 2023

[Images Speak in Images: A Generalist Painter for In-Context Visual Learning](https://arxiv.org/abs/2212.02499), arxiv 2023 [[code]](https://github.com/baaivision/Painter)

[Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding](https://arxiv.org/abs/2306.02858), arxiv 2023 [[code]](https://github.com/DAMO-NLP-SG/Video-LLaMA)

[SegGPT: Segmenting Everything In Context](http://arxiv.org/abs/2304.03284), arxiv 2023 [[code]](https://github.com/baaivision/Painter)

[VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks](http://arxiv.org/abs/2305.11175), arxiv 2023 [[code]](https://github.com/OpenGVLab/VisionLLM)

[Matcher: Segment Anything with One Shot Using All-Purpose Feature Matching](https://arxiv.org/abs/2305.13310), arxiv 2023

[Personalize Segment Anything Model with One Shot](https://arxiv.org/abs/2305.03048), arxiv 2023 [[code]](https://github.com/ZrrSkywalker/Personalize-SAM)

[Segment Anything](https://arxiv.org/abs/2304.02643), arxiv 2023 [[code]](https://github.com/facebookresearch/segment-anything)

[Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks](https://arxiv.org/abs/2211.09808), CVPR 2023 [[code]](https://github.com/fundamentalvision/Uni-Perceiver)

[A Generalist Framework for Panoptic Segmentation of Images and Videos](https://arxiv.org/abs/2210.06366), arxiv 2022

[A Unified Sequence Interface for Vision Tasks](http://arxiv.org/abs/2206.07669), NeurIPS 2022 [[code]](https://github.com/google-research/pix2seq)

[Pix2seq: A language modeling framework for object detection](https://arxiv.org/abs/2109.10852), ICLR 2022 [[code]](https://github.com/google-research/pix2seq)

### Embodied-Centric Understanding

[Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition](https://arxiv.org/abs/2307.14535), arxiv 2023 [[code]](https://github.com/columbia-ai-robotics/scalingup)

[RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control](https://robotics-transformer2.github.io/assets/rt2.pdf), preprint [[project page]](https://www.deepmind.com/blog/rt-2-new-model-translates-vision-and-language-into-action)

[VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models](https://arxiv.org/abs/2307.05973), arxiv 2023 [[project page]](https://voxposer.github.io/)

[MotionGPT: Human Motion as a Foreign Language](https://arxiv.org/abs//2306.14795), arxiv 2023 [[code]](https://github.com/OpenMotionLab/MotionGPT)

[Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model](https://arxiv.org/abs/2305.11176), arxiv 2023 [[code]](https://github.com/OpenGVLab/Instruct2Act)

[PaLM-E: An Embodied Multimodal Language Model](https://arxiv.org/abs/2303.03378), arxiv 2023 [[blog]](https://palm-e.github.io/)

[Generative Agents: Interactive Simulacra of Human Behavior](https://arxiv.org/abs/2304.03442), arxiv 2023

[Vision-Language Models as Success Detectors](https://arxiv.org/abs/2303.07280), arxiv 2023

[TidyBot: Personalized Robot Assistance with Large Language Models](https://arxiv.org/abs/2305.05658), arxiv 2023 [[code]](https://github.com/jimmyyhwu/tidybot)

[Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action](https://arxiv.org/abs/2207.04429), CoRL 2022 [[blog]](https://sites.google.com/view/lmnav) [[code]](https://github.com/blazejosinski/lm_nav)

### Domain-Specific Models

[LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day](http://arxiv.org/abs/2306.00890), arxiv 2023 [[code]](https://github.com/microsoft/LLaVA-Med)

### Multimodal Evaluation

[LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark](https://arxiv.org/abs/2306.06687), arxiv 2023 [[code]](https://github.com/OpenLAMM/LAMM)

[LOVM: Language-Only Vision Model Selection](https://arxiv.org/abs/2306.08893), arxiv 2023 [[code]](https://github.com/orrzohar/LOVM)

[Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation](https://arxiv.org/abs/2303.05983), arxiv 2023 [[project page]](https://matrix-alpha.github.io/)