Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/vincentlux/Awesome-Multimodal-LLM
Reading list for Multimodal Large Language Models
https://github.com/vincentlux/Awesome-Multimodal-LLM
List: Awesome-Multimodal-LLM
awesome-list computer-vision large-language-models machine-learning multimodal-large-language-models multimodal-machine-learning natural-language-processing paper-list vision-language-model
Last synced: 16 days ago
JSON representation
Reading list for Multimodal Large Language Models
- Host: GitHub
- URL: https://github.com/vincentlux/Awesome-Multimodal-LLM
- Owner: vincentlux
- License: mit
- Created: 2023-06-06T06:31:23.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2023-08-17T02:26:26.000Z (over 1 year ago)
- Last Synced: 2024-12-02T06:02:27.271Z (19 days ago)
- Topics: awesome-list, computer-vision, large-language-models, machine-learning, multimodal-large-language-models, multimodal-machine-learning, natural-language-processing, paper-list, vision-language-model
- Homepage:
- Size: 110 KB
- Stars: 66
- Watchers: 3
- Forks: 7
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- ultimate-awesome - Awesome-Multimodal-LLM - Reading list for Multimodal Large Language Models . (Other Lists / Monkey C Lists)
- awesome-of-multimodal-dialogue-models - Awesome-Multimodal-LLM - Multimodal-LLM.svg) (Awesome Surveys / Previous Venues)
README
# Awesome-Multimodal-LLM [![Awesome](https://awesome.re/badge.svg)](https://awesome.re)
A curated list of papers related to multi-modal machine learning, especially multi-modal large language models (LLMs).## Table of Contents
- [Tutorials](#tutorials)
- [Datasets](#datasets)
- [Research Papers](#research-papers)
- [Survey Papers](#survey-papers)
- [Core Areas](#core-areas)
- [Multimodal Understanding](#multimodal-understanding)
- [Vision-Centric Understanding](#vision-centric-understanding)
- [Embodied-Centric Understanding](#embodied-centric-understanding)
- [Domain-Specific Models](#domain-specific-models)
- [Multimodal Evaluation](#multimodal-evaluation)# Tutorials
[Recent Advances in Vision Foundation Models](https://vlp-tutorial.github.io/), CVPR 2023 Workshop [[pdf]](https://datarelease.blob.core.windows.net/tutorial/vision_foundation_models_2023/slides/Chunyuan_cvpr2023_tutorial_lmm.pdf)
# Datasets
[M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning](https://arxiv.org/abs/2306.04387), arxiv 2023 [[data]](https://huggingface.co/datasets/MMInstruction/M3IT)
[LLaVA Instruction 150K](https://llava-vl.github.io/), arxiv 2023 [[data]](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K)
[Youku-mPLUG 10M](https://arxiv.org/abs/2306.04362), arxiv 2023 [[data]](https://github.com/X-PLUG/Youku-mPLUG)
[MULTIINSTRUCT: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning](https://arxiv.org/abs/2212.10773), ACL 2023 [[data]](https://github.com/VT-NLP/MultiInstruct)
# Research Papers
## Survey Papers
[A Survey on Multimodal Large Language Models](https://arxiv.org/abs/2306.13549), arxiv 2023 [[project page]](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models)
[Vision-Language Models for Vision Tasks: A Survey](https://arxiv.org/abs/2304.00685), arxiv 2023
## Core Areas
### Multimodal Understanding
[Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic](https://arxiv.org/abs/2306.15195), arxiv 2023 [[code]](https://github.com/shikras/shikra)
[PandaGPT: One Model To Instruction-Follow Them All](http://arxiv.org/abs/2305.16355), arxiv 2023 [[code]](https://github.com/yxuansu/PandaGPT)
[Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic](https://arxiv.org/abs/2306.15195), arxiv 2023 [[code]](https://github.com/shikras/shikra)
[MIMIC-IT: Multi-Modal In-Context Instruction Tuning](https://arxiv.org/abs/2305.03726), arxiv 2023 [[code]](https://github.com/Luodian/Otter)
[LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding](https://arxiv.org/abs/2306.17107), arxiv 2023 [[code]](https://github.com/SALT-NLP/LLaVAR)
[MetaVL: Transferring In-Context Learning Ability From Language Models to Vision-Language Models](https://arxiv.org/abs/2306.01311), arxiv 2023
[mPLUG-Owl : Modularization Empowers Large Language Models with Multimodality](https://arxiv.org/abs/2304.14178), arxiv 2023 [[code]](https://github.com/X-PLUG/mPLUG-Owl)
[InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500), arxiv 2023 [[code]](https://github.com/salesforce/LAVIS/tree/main/projects/instructblip)
[BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models](https://arxiv.org/abs/2301.12597), ICML 2023 [[code]](https://github.com/salesforce/LAVIS/tree/main/projects/blip2)
[Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models](https://arxiv.org/abs/2305.15023), arxiv 2023 [[code]](https://github.com/luogen1996/LaVIN)
[MultiModal-GPT: A Vision and Language Model for Dialogue with Humans](https://arxiv.org/abs/2305.04790), arxiv 2023 [[code]](https://github.com/open-mmlab/Multimodal-GPT)
[LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model](https://arxiv.org/abs/2304.15010), arxiv 2023 [[code]](https://github.com/ZrrSkywalker/LLaMA-Adapter)
[Language Is Not All You Need: Aligning Perception with Language Models](https://arxiv.org/abs/2302.14045v2), arxiv 2023 [[code]](https://github.com/microsoft/unilm)
[ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities](http://arxiv.org/abs/2305.11172), arxiv 2023 [[code]](https://github.com/OFA-Sys/ONE-PEACE)
[X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages](https://arxiv.org/abs/2305.04160), arxiv 2023 [[code]](https://github.com/phellonchen/X-LLM)
[Visual Instruction Tuning](https://arxiv.org/abs/2304.08485), arxiv 2023 [[code]](https://github.com/haotian-liu/LLaVA)
[Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models](https://arxiv.org/abs/2303.04671), arxiv 2023 [[code]](https://github.com/microsoft/TaskMatrix)
[PaLI: A Jointly-Scaled Multilingual Language-Image Model](http://arxiv.org/abs/2209.06794), ICLR 2023 [[blog]](https://ai.googleblog.com/2022/09/pali-scaling-language-image-learning-in.html)
[Grounding Language Models to Images for Multimodal Inputs and Outputs](https://arxiv.org/abs/2301.13823), ICML 2023 [[code]](https://github.com/kohjingyu/fromage)
[OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework](https://arxiv.org/abs/2202.03052), ICML 2022 [[code]](https://github.com/OFA-Sys/OFA)
[Flamingo: a Visual Language Model for Few-Shot Learning](https://arxiv.org/abs/2204.14198), NeurIPS 2022
### Vision-Centric Understanding
[LISA: Reasoning Segmentation via Large Language Model](https://arxiv.org/abs/2308.00692), arxiv 2023 [[code]](https://github.com/dvlab-research/LISA)
[Contextual Object Detection with Multimodal Large Language Models](https://arxiv.org/abs/2305.18279), arxiv 2023 [[code]](https://github.com/yuhangzang/ContextDET)
[KOSMOS-2: Grounding Multimodal Large Language Models to the World](https://arxiv.org/abs/2306.14824), arxiv 2023 [[code]](https://github.com/microsoft/unilm/tree/master/kosmos-2)
[Contextual Object Detection with Multimodal Large Language Models](https://arxiv.org/abs/2305.18279), arxiv 2023 [[code]](https://github.com/yuhangzang/ContextDET)
[Fast Segment Anything](https://arxiv.org/abs/2306.12156), arxiv 2023 [[code]](https://github.com/CASIA-IVA-Lab/FastSAM)
[Multi-Modal Classifiers for Open-Vocabulary Object Detection](https://arxiv.org/abs/2306.05493), ICML 2023 [[code]](https://github.com/prannaykaul/mm-ovod)
[Instruction-ViT: Multi-Modal Prompts for Instruction Learning in ViT](https://arxiv.org/abs/2305.00201), arxiv 2023
[Images Speak in Images: A Generalist Painter for In-Context Visual Learning](https://arxiv.org/abs/2212.02499), arxiv 2023 [[code]](https://github.com/baaivision/Painter)
[Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding](https://arxiv.org/abs/2306.02858), arxiv 2023 [[code]](https://github.com/DAMO-NLP-SG/Video-LLaMA)
[SegGPT: Segmenting Everything In Context](http://arxiv.org/abs/2304.03284), arxiv 2023 [[code]](https://github.com/baaivision/Painter)
[VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks](http://arxiv.org/abs/2305.11175), arxiv 2023 [[code]](https://github.com/OpenGVLab/VisionLLM)
[Matcher: Segment Anything with One Shot Using All-Purpose Feature Matching](https://arxiv.org/abs/2305.13310), arxiv 2023
[Personalize Segment Anything Model with One Shot](https://arxiv.org/abs/2305.03048), arxiv 2023 [[code]](https://github.com/ZrrSkywalker/Personalize-SAM)
[Segment Anything](https://arxiv.org/abs/2304.02643), arxiv 2023 [[code]](https://github.com/facebookresearch/segment-anything)
[Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks](https://arxiv.org/abs/2211.09808), CVPR 2023 [[code]](https://github.com/fundamentalvision/Uni-Perceiver)
[A Generalist Framework for Panoptic Segmentation of Images and Videos](https://arxiv.org/abs/2210.06366), arxiv 2022
[A Unified Sequence Interface for Vision Tasks](http://arxiv.org/abs/2206.07669), NeurIPS 2022 [[code]](https://github.com/google-research/pix2seq)
[Pix2seq: A language modeling framework for object detection](https://arxiv.org/abs/2109.10852), ICLR 2022 [[code]](https://github.com/google-research/pix2seq)
### Embodied-Centric Understanding
[Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition](https://arxiv.org/abs/2307.14535), arxiv 2023 [[code]](https://github.com/columbia-ai-robotics/scalingup)
[RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control](https://robotics-transformer2.github.io/assets/rt2.pdf), preprint [[project page]](https://www.deepmind.com/blog/rt-2-new-model-translates-vision-and-language-into-action)
[VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models](https://arxiv.org/abs/2307.05973), arxiv 2023 [[project page]](https://voxposer.github.io/)
[MotionGPT: Human Motion as a Foreign Language](https://arxiv.org/abs//2306.14795), arxiv 2023 [[code]](https://github.com/OpenMotionLab/MotionGPT)
[Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model](https://arxiv.org/abs/2305.11176), arxiv 2023 [[code]](https://github.com/OpenGVLab/Instruct2Act)
[PaLM-E: An Embodied Multimodal Language Model](https://arxiv.org/abs/2303.03378), arxiv 2023 [[blog]](https://palm-e.github.io/)
[Generative Agents: Interactive Simulacra of Human Behavior](https://arxiv.org/abs/2304.03442), arxiv 2023
[Vision-Language Models as Success Detectors](https://arxiv.org/abs/2303.07280), arxiv 2023
[TidyBot: Personalized Robot Assistance with Large Language Models](https://arxiv.org/abs/2305.05658), arxiv 2023 [[code]](https://github.com/jimmyyhwu/tidybot)
[Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action](https://arxiv.org/abs/2207.04429), CoRL 2022 [[blog]](https://sites.google.com/view/lmnav) [[code]](https://github.com/blazejosinski/lm_nav)
### Domain-Specific Models
[LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day](http://arxiv.org/abs/2306.00890), arxiv 2023 [[code]](https://github.com/microsoft/LLaVA-Med)
### Multimodal Evaluation
[LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark](https://arxiv.org/abs/2306.06687), arxiv 2023 [[code]](https://github.com/OpenLAMM/LAMM)
[LOVM: Language-Only Vision Model Selection](https://arxiv.org/abs/2306.08893), arxiv 2023 [[code]](https://github.com/orrzohar/LOVM)
[Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation](https://arxiv.org/abs/2303.05983), arxiv 2023 [[project page]](https://matrix-alpha.github.io/)