Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Awesome-Multimodal-Large-Language-Models
:sparkles::sparkles:Latest Advances on Multimodal Large Language Models
https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models
Last synced: 4 days ago
JSON representation
-
Multimodal Hallucination
- **IBD: Alleviating Hallucinations in Large Vision-Language Models via Image-Biased Decoding** - 02-28 | - | - |
- Star - Objective Reinforcement Mitigating Caption Hallucinations**](https://arxiv.org/pdf/2312.03631.pdf) <br> | arXiv | 2023-12-06 | [Github](https://github.com/assafbk/mocha_code) | - |
- Star - Grained Hallucination by Fine-Tuning Large Vision-Language Models with Caption Rewrites**](https://arxiv.org/pdf/2312.01701.pdf) <br> | arXiv | 2023-12-04 | [Github](https://github.com/Anonymousanoy/FOHE) | - |
- Star - Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation**](https://arxiv.org/pdf/2311.17911.pdf) <br> | arXiv | 2023-11-29 | [Github](https://github.com/shikiw/OPERA) | - |
- Star - Language Models through Visual Contrastive Decoding**](https://arxiv.org/pdf/2311.16922.pdf) <br> | arXiv | 2023-11-28 | [Github](https://github.com/DAMO-NLP-SG/VCD) | - |
- **Mitigating Hallucination in Visual Language Models with Visual Supervision** - 11-27 | - | - |
- Star - Language Models**](https://arxiv.org/pdf/2310.00754.pdf) <br> | arXiv | 2023-10-01 | [Github](https://github.com/YiyangZhou/LURE) | - |
- **Evaluation and Mitigation of Agnosia in Multimodal Large Language Models** - 09-07 | - | - |
- **CIEM: Contrastive Instruction Evaluation Method for Better Instruction Tuning** - 09-05 | - | - |
- **Evaluation and Analysis of Hallucination in Large Vision-Language Models** - 08-29 | - | - |
- Star - 08-24 | [Github](https://github.com/opendatalab/VIGC) | [Demo](https://opendatalab.github.io/VIGC) |
- Star - Modal Models via Robust Instruction Tuning**](https://arxiv.org/pdf/2306.14565.pdf) <br> | arXiv | 2023-06-26 | [Github](https://github.com/FuxiaoLiu/LRV-Instruction) | [Demo](https://7b6590ed039a06475d.gradio.live/) |
- Star - Language Models**](https://arxiv.org/pdf/2305.10355.pdf) <br> | EMNLP | 2023-05-17 | [Github](https://github.com/RUCAIBox/POPE) | - |
- **Mitigating Hallucination in Visual Language Models with Visual Supervision** - 11-27 | - | - |
- Star - 11-22 | [Github](https://github.com/Yuqifan1117/HalluciDoctor) | - |
- Star - Language Models**](https://arxiv.org/pdf/2311.01477.pdf) <br> | arXiv | 2023-11-02 | [Github](https://github.com/bcdnlp/FAITHSCORE) | - |
- **Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models** - 10-09 | - | - |
- **HallE-Switch: Rethinking and Controlling Object Existence Hallucinations in Large Vision Language Models for Detailed Caption** - 10-03 | - | - |
- Star - Language Models**](https://arxiv.org/pdf/2310.00754.pdf) <br> | arXiv | 2023-10-01 | [Github](https://github.com/YiyangZhou/LURE) | - |
- Star - 10-24 | [Github](https://github.com/BradyFU/Woodpecker) | [Demo](https://deb6a97bae6fab67ae.gradio.live/) |
- **Temporal Insight Enhancement: Mitigating Temporal Hallucination in Multimodal Large Language Models** - 01-18 | - | - |
- **Hallucination Augmented Contrastive Learning for Multimodal Large Language Model** - 12-12 | - | - |
- Star - free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation**](https://arxiv.org/pdf/2311.07397.pdf) <br> | arXiv | 2023-11-13 | [Github](https://github.com/junyangwang0410/AMBER) | - |
- **A Survey on Hallucination in Large Vision-Language Models** - 02-01 | - | - |
- **FIHA: Autonomous Hallucination Evaluation in Vision-Language Models with Davidson Scene Graphs** - 09-20 | [Link](https://anonymous.4open.science/r/FIHA-45BB) | - |
- **Alleviating Hallucination in Large Vision-Language Models with Active Retrieval Augmentation** - 08-01 | - | - |
- **Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs** - 07-31 | [Coming soon]() | - |
- Star - 12-12 | [Github](https://github.com/X-PLUG/mPLUG-HalOwl/tree/main/hacl) | - |
- Star - Switch: Rethinking and Controlling Object Existence Hallucinations in Large Vision Language Models for Detailed Caption**](https://arxiv.org/pdf/2310.01779.pdf) <br> | arXiv | 2023-10-03 | [Github](https://github.com/bronyayang/HallE_Switch) | - |
- Star - Language Models**](https://arxiv.org/pdf/2308.15126.pdf) <br> | arXiv | 2023-08-29 | [Github](https://github.com/junyangwang0410/HaELM) | - |
- Star - 06-24 | [Github](https://github.com/mrwu-mac/R-Bench) | - |
- **CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal Models** - 06-04 | [Coming soon]() | - |
- **Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models** - 10-09 | - | - |
- Star - 02-22 | [Github](https://github.com/yuezih/less-is-more) | - |
- Star - Language Models**](https://arxiv.org/pdf/2402.11622.pdf) <br> | arXiv | 2024-02-18 | [Github](https://github.com/Hyperwjf/LogicCheckGPT) | - |
- Star - 02-05 | [Github](https://github.com/OpenKG-ORG/EasyDetect) | - |
- **Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization** - 11-28 | [Github](https://github.com/opendatalab/HA-DPO) | [Comins Soon]() |
- **Detecting and Preventing Hallucinations in Large Vision Language Models** - 08-11 | - | - |
- **Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding** - 03-27 | - | - |
- Star - 03-20 | [Github](https://github.com/IVY-LVLM/Counterfactual-Inception) | - |
- Star - Free Method for Alleviating Hallucination in LVLMs**](https://arxiv.org/pdf/2407.21771) <br> | ECCV | 2024-07-31 | [Github](https://github.com/LALBJ/PAI) | - |
- **Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding** - 03-27 | - | - |
- **Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization** - 11-28 | [Github](https://github.com/opendatalab/HA-DPO) | [Comins Soon]() |
- **Mitigating Hallucination in Visual Language Models with Visual Supervision** - 11-27 | - | - |
- **Evaluation and Mitigation of Agnosia in Multimodal Large Language Models** - 09-07 | - | - |
- **CIEM: Contrastive Instruction Evaluation Method for Better Instruction Tuning** - 09-05 | - | - |
- **Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization** - 03-13 | - | - |
- **IBD: Alleviating Hallucinations in Large Vision-Language Models via Image-Biased Decoding** - 02-28 | - | - |
- Star - 02-06 | [Github](https://github.com/MasaiahHan/CorrelationQA) | - |
- **VDGD: Mitigating LVLM Hallucinations in Cognitive Prompts by Bridging the Visual Perception Gap** - 05-24 | [Coming soon]() | - |
- **Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization** - 03-13 | - | - |
- Star - Contrast Decoding**](https://arxiv.org/pdf/2403.00425.pdf) <br> | arXiv | 2024-03-01 | [Github](https://github.com/BillChan226/HALC) | - |
- Star - Language Models with Assembly of Global and Local Attention**](https://arxiv.org/pdf/2406.12718) <br> | arXiv | 2024-06-18 | [Github](https://github.com/Lackel/AGLA) | - |
- Star - 03-08 | [Github](https://github.com/yfzhang114/LLaVA-Align) | - |
- Star - Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models**](https://arxiv.org/pdf/2410.03577) <br> | arXiv | 2024-10-04 | [Github](https://github.com/1zhou-Wang/MemVR) | - |
- Star - Language Representations to Mitigate Hallucinations**](https://arxiv.org/pdf/2410.02762) <br> | arXiv | 2024-10-03 | [Github](https://github.com/nickjiang2378/vl-interp/) | - |
- **Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback** - 04-22 | - | - |
- **Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback** - 04-22 | - | - |
- **A Survey on Hallucination in Large Vision-Language Models** - 02-01 | - | - |
-
Our MLLM works
- [Read our new version
- Paper
- Paper
- [🍎 Project Page - MLLM/Freeze-Omni)] </div></font>
- Paper
- [🍎 Project Page - MLLM/VITA)] [[🤗 Hugging Face](https://huggingface.co/VITA-MLLM)] [[💬 WeChat (微信)](https://github.com/VITA-MLLM/VITA/blob/main/asset/wechat_5.jpg)] </div></font>
- Project Page - MME)** | **[Dataset](https://github.com/BradyFU/Video-MME?tab=readme-ov-file#-dataset)** | **[Leaderboard](https://video-mme.github.io/home_page.html#leaderboard)**
-
Datasets of Pre-Training for Alignment
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Kosmos-2: Grounding Multimodal Large Language Models to the World - Text-Bounding-Box |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation - Text |
- Microsoft COCO: Common Objects in Context - Text |
- Im2Text: Describing Images Using 1 Million Captioned Photographs - Text |
- Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning - Text |
- LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs - Text |
- Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark - Text |
- The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World - Text |
- InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation - Text |
- ShareGPT4V: Improving Large Multi-Modal Models with Better Captions - Text |
- Microsoft COCO: Common Objects in Context - Text |
- Im2Text: Describing Images Using 1 Million Captioned Photographs - Text |
- LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models - Text |
- AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding - Text |
- Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models - Text |
- AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding - Text |
- Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark - Text |
- WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research - Text |
- AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline - Text |
- AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale - Text |
- Kosmos-2: Grounding Multimodal Large Language Models to the World - Text-Bounding-Box |
- Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks - Text |
- MSR-VTT: A Large Video Description Dataset for Bridging Video and Language - Text |
- Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval - Text |
- WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Microsoft COCO: Common Objects in Context - Text |
- LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks - Text |
- Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval - Text |
- WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research - Text |
- AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline - Text |
- AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- ShareGPT4V: Improving Large Multi-Modal Models with Better Captions - Text |
- InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- ShareGPT4Video: Improving Video Understanding and Generation with Better Captions - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages - Audio-Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Text |
-
Multimodal Chain-of-Thought
- **Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction** - 05-23 | - | - |
- **DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models** - 10-25 | [Github](https://toneyaya.github.io/ddcot/) | - |
- Star - of-Thought Reasoning in Language Models**](https://arxiv.org/pdf/2302.00923.pdf) <br> | arXiv | 2023-02-02 | [Github](https://github.com/amazon-science/mm-cot) | - |
- Star - Paper-Conference.pdf) <br> | NeurIPS | 2022-09-20 | [Github](https://github.com/lupantech/ScienceQA) | - |
- Star - Language Pre-Training via Embodied Chain of Thought**](https://arxiv.org/pdf/2305.15021.pdf) <br> | arXiv | 2023-05-24 | [Github](https://github.com/EmbodiedGPT/EmbodiedGPT_Pytorch) | - |
- Star - 06-27 | [Github](https://github.com/zeroQiaoba/Explainable-Multimodal-Emotion-Reasoning) | - |
- **T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Large Language Model Signals for Science Question Answering** - 05-05 | - | - |
- **Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings** - 05-03 | [Coming soon](https://github.com/dannyrose30/VCOT) | - |
- **Chain of Thought Prompt Tuning in Vision Language Models** - 04-16 | [Coming soon]() | - |
- Star - 06-27 | [Github](https://github.com/shikras/shikra) | [Demo](http://demo.zhaozhang.net:7860/) |
- Star - Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models**](https://arxiv.org/pdf/2310.16436.pdf) <br> | NeurIPS | 2023-10-25 | [Github](https://github.com/SooLab/DDCOT) | - |
- **Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings** - 05-03 | [Coming soon](https://github.com/dannyrose30/VCOT) | - |
- **Chain of Thought Prompt Tuning in Vision Language Models** - 04-16 | [Coming soon]() | - |
- Star - of-Thought Reasoning in Multi-Modal Language Models**](https://arxiv.org/pdf/2403.16999.pdf) <br> | arXiv | 2024-03-25 | [Github](https://github.com/deepcs233/Visual-CoT) | Local Demo |
- **T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Large Language Model Signals for Science Question Answering** - 05-05 | - | - |
- Star - of-Thought of MLLM**](https://arxiv.org/pdf/2404.16033.pdf) <br> | arXiv | 2024-04-24 | [Github](https://github.com/ggg0919/cantor) | Local Demo |
-
LLM-Aided Visual Reasoning
- Star - Modal Models**](https://arxiv.org/pdf/2403.18252.pdf) <br> | arXiv | 2024-03-27 | [Github](https://github.com/LaVi-Lab/Visual-Table) | - |
- Star - 10-26 | [Github](https://github.com/OpenGVLab/ControlLLM) | - |
- Star - 12-21 | [Github](https://github.com/penghao-wu/vstar) | Local Demo |
- Star - Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing**](https://arxiv.org/pdf/2311.00571.pdf) <br> | arXiv | 2023-11-01 | [Github](https://github.com/LLaVA-VL/LLaVA-Interactive-Demo) | [Demo](https://6dd3-20-163-117-69.ngrok-free.app/) |
- **MM-VID: Advancing Video Understanding with GPT-4V(vision)** - 10-30 | - | - |
- Star - 09-18 | [Github](https://github.com/mindagent/mindagent) | - |
- Star - 08-01 | [Github](https://github.com/dvlab-research/LISA) | [Demo](http://103.170.5.190:7860/) |
- Star - 06-28 | [Github](https://github.com/ContextualAI/lens) | [Demo](https://lens.contextual.ai/) |
- **Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models** - 06-15 | - | - |
- Star - modal Assistant that can Plan, Execute, Inspect, and Learn**](https://arxiv.org/pdf/2306.08640.pdf) <br> | arXiv | 2023-06-14 | [Github](https://github.com/showlab/assistgpt) | - |
- **Mindstorms in Natural Language-Based Societies of Mind** - 05-26 | - | - |
- Star - 05-24 | [Github](https://github.com/weixi-feng/LayoutGPT) | - |
- Star - Visual Chat Learns to Reject Human Instructions in Image Re-creation**](https://arxiv.org/pdf/2303.05983.pdf) <br> | arXiv | 2023-05-10 | [Github](https://github.com/matrix-alpha/Accountable-Textual-Visual-Chat) | - |
- Star - 03-14 | [Github](https://github.com/cvlab-columbia/viper) | Local Demo |
- Star - 05-24 | [Github](https://github.com/Hxyou/IdealGPT) | Local Demo |
- Star - X: Training-Free Name-Only Transfer of Vision-Language Models**](https://arxiv.org/pdf/2211.16198.pdf) <br> | arXiv | 2022-11-28 | [Github](https://github.com/vishaal27/SuS-X) | - |
- Star - 03-14 | [Github](https://github.com/cvlab-columbia/viper) | Local Demo |
- Star - 2 Answers: Automatic Questioning Towards Enriched Visual Descriptions**](https://arxiv.org/pdf/2303.06594.pdf) <br> | arXiv | 2023-03-12 | [Github](https://github.com/Vision-CAIR/ChatCaptioner) | Local Demo |
- **ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction** - 03-09 | - | - |
- Star - shot Learners**](https://arxiv.org/pdf/2303.02151.pdf) <br> | CVPR | 2023-03-03 | [Github](https://github.com/ZrrSkywalker/CaFo) | - |
- Star - X: Training-Free Name-Only Transfer of Vision-Language Models**](https://arxiv.org/pdf/2211.16198.pdf) <br> | arXiv | 2022-11-28 | [Github](https://github.com/vishaal27/SuS-X) | - |
- Star - world Learning**](https://arxiv.org/pdf/2211.11682.pdf) <br> | CVPR | 2022-11-21 | [Github](https://github.com/yangyangyang127/PointCLIP_V2) | - |
- Star - Shot Multimodal Reasoning with Language**](https://arxiv.org/pdf/2204.00598.pdf) <br> | arXiv | 2022-04-01 | [Github](https://github.com/google-research/google-research/tree/master/socraticmodels) | - |
- Star - 05-04 | [Github](https://github.com/ttengwang/Caption-Anything) | [Demo](https://huggingface.co/spaces/TencentARC/Caption-Anything) |
- Star - 11-18 | [Github](https://github.com/allenai/visprog) | Local Demo |
- Star - instruction**](https://arxiv.org/pdf/2305.18752.pdf) <br> | arXiv | 2023-05-30 | [Github](https://github.com/StevenGrove/GPT4Tools) | [Demo](https://c60eb7e9400930f31b.gradio.live/) |
- Star - 05-04 | [Github](https://github.com/ttengwang/Caption-Anything) | [Demo](https://huggingface.co/spaces/TencentARC/Caption-Anything) |
- Star - REACT: Prompting ChatGPT for Multimodal Reasoning and Action**](https://arxiv.org/pdf/2303.11381.pdf) <br> | arXiv | 2023-03-20 | [Github](https://github.com/microsoft/MM-REACT) | [Demo](https://huggingface.co/spaces/microsoft-cognitive-service/mm-react) |
- Star - and-Play Compositional Reasoning with Large Language Models**](https://arxiv.org/pdf/2304.09842.pdf) <br> | arXiv | 2023-04-19 | [Github](https://github.com/lupantech/chameleon-llm) | [Demo](https://chameleon-llm.github.io/) |
- Star - 03-30 | [Github](https://github.com/microsoft/JARVIS) | [Demo](https://huggingface.co/spaces/microsoft/HuggingGPT) |
- Star - REACT: Prompting ChatGPT for Multimodal Reasoning and Action**](https://arxiv.org/pdf/2303.11381.pdf) <br> | arXiv | 2023-03-20 | [Github](https://github.com/microsoft/MM-REACT) | [Demo](https://huggingface.co/spaces/microsoft-cognitive-service/mm-react) |
- Star - 03-08 | [Github](https://github.com/microsoft/TaskMatrix) | [Demo](https://huggingface.co/spaces/microsoft/visual_chatgpt) |
- **Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models** - 06-15 | - | - |
- **ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction** - 03-09 | - | - |
- **MM-VID: Advancing Video Understanding with GPT-4V(vision)** - 10-30 | - | - |
- **AVIS: Autonomous Visual Information Seeking with Large Language Models** - 06-13 | - | - |
- **AVIS: Autonomous Visual Information Seeking with Large Language Models** - 06-13 | - | - |
- Star - REACT: Prompting ChatGPT for Multimodal Reasoning and Action**](https://arxiv.org/pdf/2303.11381.pdf) <br> | arXiv | 2023-03-20 | [Github](https://github.com/microsoft/MM-REACT) | [Demo](https://huggingface.co/spaces/microsoft-cognitive-service/mm-react) |
- **Mindstorms in Natural Language-Based Societies of Mind** - 05-26 | - | - |
- Star - 03-14 | [Github](https://github.com/cvlab-columbia/viper) | Local Demo |
-
Multimodal Instruction Tuning
- **MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training** - 03-14 | - | - |
- Star - 12-28 | [Github](https://github.com/Meituan-AutoML/MobileVLM) | - |
- Star - 12-15 | [Github](https://github.com/CircleRadon/Osprey) | [Demo](http://111.0.123.204:8000/) |
- **Pixel Aligned Language Models** - 12-14 | [Coming soon]() | - |
- **See, Say, and Segment: Teaching LMMs to Overcome False Premises** - 12-13 | [Coming soon]() | - |
- Star - Language Models**](https://arxiv.org/pdf/2312.06109.pdf) <br> | arXiv | 2023-12-11 | [Github](https://github.com/Ucas-HaoranWei/Vary) | [Demo](http://region-31.seetacloud.com:22701/) |
- Star - enhanced Projector for Multimodal LLM**](https://arxiv.org/pdf/2312.06742.pdf) <br> | arXiv | 2023-12-11 | [Github](https://github.com/kakaobrain/honeybee) | - |
- Star - 12-06 | [Github](https://github.com/csuhan/OneLLM) | [Demo](https://huggingface.co/spaces/csuhan/OneLLM) |
- Star - 12-05 | [Github](https://github.com/Meituan-AutoML/Lenna) | - |
- Star - 12-01 | [Github](https://github.com/vlm-driver/Dolphins) | - |
- Star - 3D Understanding, Reasoning, and Planning**](https://arxiv.org/pdf/2311.18651.pdf) <br> | arXiv | 2023-11-30 | [Github](https://github.com/Open3DA/LL3DA) | [Coming soon]() |
- Star - 11-30 | [Github](https://github.com/huangb23/VTimeLLM/) | Local Demo |
- Star - VID: An Image is Worth 2 Tokens in Large Language Models**](https://arxiv.org/pdf/2311.17043.pdf) <br> | arXiv | 2023-11-28 | [Github](https://github.com/dvlab-research/LLaMA-VID) | [Coming soon]() |
- Star - 11-27 | [Github](https://github.com/dvlab-research/LLMGA) | [Demo](https://baa55ef8590b623f18.gradio.live/) |
- Star - Level Visual Knowledge**](https://arxiv.org/pdf/2311.11860.pdf) <br> | arXiv | 2023-11-20 | [Github](https://github.com/rshaojimmy/JiuTian) | - |
- Star - 11-18 | [Github](https://github.com/embodied-generalist/embodied-generalist) | [Demo](https://www.youtube.com/watch?v=mlnjz4eSjB4) |
- Star - 4V for Better Visual Instruction Tuning**](https://arxiv.org/pdf/2311.07574.pdf) <br> | arXiv | 2023-11-13 | [Github](https://github.com/X2FD/LVIS-INSTRUCT4V) | - |
- Star - modal Large Language Models**](https://arxiv.org/pdf/2311.07575.pdf) <br> | arXiv | 2023-11-13 | [Github](https://github.com/Alpha-VLLM/LLaMA2-Accessory) | [Demo](http://imagebind-llm.opengvlab.com/) |
- Star - modal Models**](https://arxiv.org/abs/2311.06607.pdf) <br> | arXiv | 2023-11-11 | [Github](https://github.com/Yuliang-Liu/Monkey) | [Demo](http://27.17.184.224:7681/) |
- **LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents** - 11-09 | [Coming soon]() | [Demo](https://llavaplus.ngrok.io/) |
- Star - Chat: An LMM for Chat, Detection and Segmentation**](https://arxiv.org/pdf/2311.04498.pdf) <br> | arXiv | 2023-11-08 | [Github](https://github.com/NExT-ChatV/NExT-Chat) | Local Demo |
- **CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding** - 11-06 | [Coming soon]() | - |
- Star - 11-06 | [Github](https://github.com/mbzuai-oryx/groundingLMM) | [Demo](https://glamm.mbzuai-oryx.ngrok.app/) |
- Star - 11-02| [Github](https://github.com/RUCAIBox/ComVint) | - |
- Star - 10-11 | [Github](https://github.com/apple/ml-ferret) | - |
- Star - 10-09 | [Github](https://github.com/THUDM/CogVLM) | [Demo](http://36.103.203.44:7861/) |
- Star - modal LLMs**](https://arxiv.org/pdf/2310.00582.pdf) | arXiv | 2023-10-01 | [Github](https://github.com/SY-Xuan/Pink) | - |
- Star - Language Foundation Models and Datasets Towards Universal Multimodal Assistants**](https://arxiv.org/pdf/2310.00653.pdf) <br> | arXiv | 2023-10-01 | [Github](https://github.com/thunlp/Muffin) | Local Demo |
- **AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model** - 09-27 | - | - |
- Star - 09-20 | [Github](https://github.com/RunpeiDong/DreamLLM) | [Coming soon]() |
- **An Empirical Study of Scaling Instruction-Tuned Large Multimodal Models** - 09-18 | [Coming soon]() | - |
- Star - XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition**](https://arxiv.org/pdf/2309.15112.pdf) <br> | arXiv | 2023-09-26 | [Github](https://github.com/InternLM/InternLM-XComposer) | Local Demo |
- Star - Modal LLMs**](https://arxiv.org/pdf/2307.08581.pdf) <br> | arXiv | 2023-07-17 | [Github](https://github.com/magic-research/bubogpt) | [Demo](https://huggingface.co/spaces/magicr/BuboGPT) |
- Star - 07-09 | [Github](https://github.com/BAAI-DCAI/Visual-Instruction-Tuning) | - |
- Star - of-Interest**](https://arxiv.org/pdf/2307.03601.pdf) <br> | arXiv | 2023-07-07 | [Github](https://github.com/jshilong/GPT4RoI) | [Demo](http://139.196.83.164:7000/) |
- Star - Style Language Model with Multimodal Inputs?**](https://arxiv.org/pdf/2307.02469.pdf) <br> | arXiv | 2023-07-05 | [Github](https://github.com/bytedance/lynx-llm) | - |
- Star - DocOwl: Modularized Multimodal Large Language Model for Document Understanding**](https://arxiv.org/pdf/2307.02499.pdf) <br> | arXiv | 2023-07-04 | [Github](https://github.com/X-PLUG/mPLUG-DocOwl) | [Demo](https://modelscope.cn/studios/damo/mPLUG-DocOwl/summary) |
- Star - 07-03 | [Github](https://github.com/ChenDelong1999/polite_flamingo) | [Demo](http://clever_flamingo.xiaoice.com/) |
- Star - Rich Image Understanding**](https://arxiv.org/pdf/2306.17107.pdf) <br> | arXiv | 2023-06-29 | [Github](https://github.com/SALT-NLP/LLaVAR) | [Demo](https://eba470c07c805702b8.gradio.live/) |
- Star - LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration**](https://arxiv.org/pdf/2306.09093.pdf) <br> | arXiv | 2023-06-15 | [Github](https://github.com/lyuchenyang/Macaw-LLM) | [Coming soon]() |
- Star - ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models**](https://arxiv.org/pdf/2306.05424.pdf) <br> | arXiv | 2023-06-08 | [Github](https://github.com/mbzuai-oryx/Video-ChatGPT) | [Demo](https://www.ival-mbzuai.com/video-chatgpt) |
- Star - Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day**](https://arxiv.org/pdf/2306.00890.pdf) <br> | arXiv | 2023-06-01 | [Github](https://github.com/microsoft/LLaVA-Med) | - |
- Star - Follow Them All**](https://arxiv.org/pdf/2305.16355.pdf) <br> | arXiv | 2023-05-25 | [Github](https://github.com/yxuansu/PandaGPT) | [Demo](https://huggingface.co/spaces/GMFTBY/PandaGPT) |
- Star - 05-25 | [Github](https://github.com/joez17/ChatBridge) | - |
- Star - Language Instruction Tuning for Large Language Models**](https://arxiv.org/pdf/2305.15023.pdf) <br> | arXiv | 2023-05-24 | [Github](https://github.com/luogen1996/LaVIN) | Local Demo |
- Star - 05-23 | [Github](https://github.com/OptimalScale/DetGPT) | [Demo](https://d3c431c0c77b1d9010.gradio.live/) |
- Star - 05-19 | [Github](https://github.com/microsoft/Pengi) | - |
- Star - Ended Decoder for Vision-Centric Tasks**](https://arxiv.org/pdf/2305.11175.pdf) <br> | arXiv | 2023-05-18 | [Github](https://github.com/OpenGVLab/VisionLLM) | - |
- Star - VQA: Visual Instruction Tuning for Medical Visual Question Answering**](https://arxiv.org/pdf/2305.10415.pdf) <br> | arXiv | 2023-05-17 | [Github](https://github.com/xiaoman-zhang/PMC-VQA) | - |
- Star - turn Interleaved Multimodal Instruction-following**](https://arxiv.org/pdf/2309.08637.pdf) <br> | arXiv | 2023-09-14 | [Github](https://github.com/SihengLi99/TextBind) | [Demo](https://ailabnlp.tencent.com/research_demos/textbind/) |
- Star - GPT: Any-to-Any Multimodal LLM**](https://arxiv.org/pdf/2309.05519.pdf) <br> | arXiv | 2023-09-11 | [Github](https://github.com/NExT-GPT/NExT-GPT) | [Demo](https://fc7a82a1c76b336b6f.gradio.live/) |
- Star - Modal Training Enhances LLMs in Truthfulness and Ethics**](https://arxiv.org/pdf/2309.07120.pdf) <br> | arXiv | 2023-09-13 | [Github](https://github.com/UCSC-VLAA/Sight-Beyond-Text) | - |
- **Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning** - 09-05 | - | - |
- Star - 08-31 | [Github](https://github.com/OpenRobotLab/PointLLM) | [Demo](http://101.230.144.196/) |
- Star - DataEngine: An Iterative Refinement Approach for MLLM**](https://arxiv.org/pdf/2308.13566.pdf) <br> | arXiv | 2023-08-25 | [Github](https://github.com/opendatalab/MLLM-DataEngine) | - |
- Star - Enhanced Visual Instruction Tuning for Multimodal Large Language Models**](https://arxiv.org/pdf/2308.13437.pdf) <br> | arXiv | 2023-08-25 | [Github](https://github.com/PVIT-official/PVIT) | [Demo](https://huggingface.co/spaces/PVIT/pvit) |
- Star - VL: A Frontier Large Vision-Language Model with Versatile Abilities**](https://arxiv.org/pdf/2308.12966.pdf) <br> | arXiv | 2023-08-24 | [Github](https://github.com/QwenLM/Qwen-VL) | [Demo](https://modelscope.cn/studios/qwen/Qwen-VL-Chat-Demo/summary) |
- Star - Shot Multimodal Learning across Languages**](https://arxiv.org/pdf/2308.12038.pdf) <br> | arXiv | 2023-08-23 | [Github](https://github.com/OpenBMB/VisCPM) | [Demo](https://huggingface.co/spaces/openbmb/viscpm-chat) |
- Star - Dialogue Data**](https://arxiv.org/pdf/2308.10253.pdf) <br> | arXiv | 2023-08-20 | [Github](https://github.com/icoz69/StableLLAVA) | - |
- Star - rich Visual Questions**](https://arxiv.org/pdf/2308.09936.pdf) <br> | arXiv | 2023-08-19 | [Github](https://github.com/mlpc-ucsd/BLIVA) | [Demo](https://huggingface.co/spaces/mlpc-lab/BLIVA) |
- Star - tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions**](https://arxiv.org/pdf/2308.04152.pdf) <br> | arXiv | 2023-08-08 | [Github](https://github.com/DCDmllm/Cheetah) | - |
- Star - Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World**](https://arxiv.org/pdf/2308.01907.pdf) <br> | arXiv | 2023-08-03 | [Github](https://github.com/OpenGVLab/All-Seeing) | [Demo](https://huggingface.co/spaces/OpenGVLab/all-seeing) |
- Star - 07-31 | [Github](https://github.com/rese1f/MovieChat) | Local Demo |
- Star - LLM: Injecting the 3D World into Large Language Models**](https://arxiv.org/pdf/2307.12981.pdf) <br> | arXiv | 2023-07-24 | [Github](https://github.com/UMass-Foundation-Model/3D-LLM) | - |
- Star - 06-26 | [Github](https://github.com/OpenMotionLab/MotionGPT) | - |
- Star - LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding**](https://arxiv.org/pdf/2306.02858.pdf) <br> | arXiv | 2023-06-05 | [Github](https://github.com/DAMO-NLP-SG/Video-LLaMA) | [Demo](https://huggingface.co/spaces/DAMO-NLP-SG/Video-LLaMA) |
- Star - Language Instruction Tuning for Large Language Models**](https://arxiv.org/pdf/2305.15023.pdf) <br> | arXiv | 2023-05-24 | [Github](https://github.com/luogen1996/LaVIN) | Local Demo |
- Star - 05-23 | [Github](https://github.com/OptimalScale/DetGPT) | [Demo](https://d3c431c0c77b1d9010.gradio.live/) |
- Star - 05-19 | [Github](https://github.com/microsoft/Pengi) | - |
- Star - Ended Decoder for Vision-Centric Tasks**](https://arxiv.org/pdf/2305.11175.pdf) <br> | arXiv | 2023-05-18 | [Github](https://github.com/OpenGVLab/VisionLLM) | - |
- Star - 05-18 | [Github](https://github.com/YuanGongND/ltu) | [Demo](https://github.com/YuanGongND/ltu) |
- Star - 6B** <br> | - | 2023-05-17 | [Github](https://github.com/THUDM/VisualGLM-6B) | Local Demo |
- Star - purpose Vision-Language Models with Instruction Tuning**](https://arxiv.org/pdf/2305.06500.pdf) <br> | arXiv | 2023-05-11 | [Github](https://github.com/salesforce/LAVIS/tree/main/projects/instructblip) | Local Demo |
- Star - LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages**](https://arxiv.org/pdf/2305.04160.pdf) <br> | arXiv | 2023-05-07 | [Github](https://github.com/phellonchen/X-LLM) | - |
- Star - Centric Video Understanding**](https://arxiv.org/pdf/2305.06355.pdf) <br> | arXiv | 2023-05-10 | [Github](https://github.com/OpenGVLab/Ask-Anything) | [Demo](https://ask.opengvlab.com/) |
- Star - GPT: A Vision and Language Model for Dialogue with Humans**](https://arxiv.org/pdf/2305.04790.pdf) <br> | arXiv | 2023-05-08 | [Github](https://github.com/open-mmlab/Multimodal-GPT) | [Demo](https://mmgpt.openmmlab.org.cn/) |
- Star - 05-05 | [Github](https://github.com/YunxinLi/LingCloud) | Local Demo |
- Star - Modal Zero-Shot Learning via Instruction Tuning**](https://arxiv.org/pdf/2212.10773.pdf) <br> | ACL | 2022-12-21 | [Github](https://github.com/VT-NLP/MultiInstruct) | - |
- Star - 04-17 | [GitHub](https://github.com/haotian-liu/LLaVA) | [Demo](https://llava.hliu.cc/) |
- Star - 4: Enhancing Vision-Language Understanding with Advanced Large Language Models**](https://arxiv.org/pdf/2304.10592.pdf) <br> | arXiv | 2023-04-20 | [Github](https://github.com/Vision-CAIR/MiniGPT-4) | - |
- Star - Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention**](https://arxiv.org/pdf/2303.16199.pdf) <br> | arXiv | 2023-03-28 | [Github](https://github.com/OpenGVLab/LLaMA-Adapter) | [Demo](https://huggingface.co/spaces/csuhan/LLaMA-Adapter) |
- Star - Owl: Modularization Empowers Large Language Models with Multimodality**](https://arxiv.org/pdf/2304.14178.pdf) <br> | arXiv | 2023-04-27 | [Github](https://github.com/X-PLUG/mPLUG-Owl) | [Demo](https://huggingface.co/spaces/MAGAer13/mPLUG-Owl) |
- Star - LLaVA: Learning United Visual Representation by Alignment Before Projection**](https://arxiv.org/pdf/2311.10122.pdf) <br> | arXiv | 2023-11-16 | [Github](https://github.com/PKU-YuanGroup/Video-LLaVA) | [Demo](https://huggingface.co/spaces/LanguageBind/Video-LLaVA) |
- Star - Language Pretraining to N-modality by Language-based Semantic Alignment**](https://arxiv.org/pdf/2310.01852.pdf) <br> | ICLR | 2023-10-03 | [Github](https://github.com/PKU-YuanGroup/LanguageBind) | [Demo](https://huggingface.co/spaces/LanguageBind/LanguageBind) |
- Star - sensitive Multimodal Large Language Model for Long Video Understanding**](https://arxiv.org/pdf/2312.02051.pdf) <br> | arXiv | 2023-12-04 | [Github](https://github.com/RenShuhuai-Andy/TimeChat) | Local Demo |
- Star - 11-27 | [Github](https://github.com/tingxueronghua/ChartLlama-code) | - |
- Star - 12-01 | [Github](https://github.com/mu-cai/vip-llava) | [Demo](https://pages.cs.wisc.edu/~mucai/vip-llava.html) |
- Star - VL**](https://github.com/01-ai/Yi/tree/main/VL) <br> | - | 2024-01-23 | [Github](https://github.com/01-ai/Yi/tree/main/VL) | Local Demo |
- Star - Linguistic Tasks**](https://arxiv.org/pdf/2312.14238.pdf) <br> | arXiv | 2023-12-21 | [Github](https://github.com/OpenGVLab/InternVL) | [Demo](https://internvl.opengvlab.com) |
- **SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities** - 01-22 | - | - |
- **ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning** - 07-18 | - | [Demo](https://chatspot.streamlit.app/) |
- Star - Free Vision-Language Models**](https://arxiv.org/pdf/2406.11832) <br> | arXiv | 2024-06-17 | [Github](https://github.com/baaivision/EVE) | Local Demo |
- Star - 06-04 | [Github](https://github.com/AIDC-AI/Parrot) | - |
- **See, Say, and Segment: Teaching LMMs to Overcome False Premises** - 12-13 | [Coming soon]() | - |
- **Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study** - 01-31 | [Coming soon]() | - |
- Star - Modal Alignment in Large Vision-Language Models with Modality Integration Rate**](https://arxiv.org/pdf/2410.07167) <br> | arXiv | 2024-10-09 | [Github](https://github.com/shikiw/Modality-Integration-Rate) | - |
- Star - to-Table Pre-training and Multitask Instruction Tuning**](https://arxiv.org/pdf/2401.02384) <br> | ACL | 2024-01-04 | [Github](https://github.com/OpenGVLab/ChartAst) | Local Demo |
- Star - Language Graph Reasoning**](https://arxiv.org/pdf/2402.02130) <br> | NeurIPS | 2024-02-03 | [Github](https://github.com/WEIYanbin1999/GITA/) | - |
- Star - 06-24 | [Github](https://github.com/EvolvingLMMs-Lab/LongVA) | Local Demo |
- Star - OneVision: Easy Visual Task Transfer**](https://arxiv.org/pdf/2408.03326) <br> | arXiv | 2024-08-06 | [Github](https://github.com/LLaVA-VL/LLaVA-NeXT) | [Demo](https://llava-onevision.lmms-lab.com) |
- Star - V: A GPT-4V Level MLLM on Your Phone**](https://arxiv.org/pdf/2408.01800) <br> | arXiv | 2024-08-03 | [Github](https://github.com/OpenBMB/MiniCPM-V) | [Demo](https://huggingface.co/spaces/openbmb/MiniCPM-Llama3-V-2_5) |
- **VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding** - 12-04 | - | - |
- Star - Temporal Modeling and Audio Understanding in Video-LLMs**](https://arxiv.org/pdf/2406.07476) <br> | arXiv | 2024-06-11 | [Github](https://github.com/DAMO-NLP-SG/VideoLLaMA2) | Local Demo |
- **GROUNDHOG: Grounding Large Language Models to Holistic Segmentation** - 02-26 | Coming soon | Coming soon |
- Star - 02-17 | [Github](https://github.com/ByungKwanLee/CoLLaVO-Crayon-Large-Language-and-Vision-mOdel) | - |
- Star - Language Models Diving into Details through Chain of Manipulations**](https://arxiv.org/pdf/2402.04236.pdf) <br> | arXiv | 2024-02-06 | [Github](https://github.com/THUDM/CogCoM) | - |
- **M<sup>3</sup>IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning** - 06-07 | - | - |
- Star - Source Interactive Omni Multimodal LLM**](https://arxiv.org/pdf/2408.05211) <br> | arXiv | 2024-08-09 | [Github](https://github.com/VITA-MLLM/VITA) | - |
- **An Empirical Study of Scaling Instruction-Tuned Large Multimodal Models** - 09-18 | [Coming soon]() | - |
- Star - Gemini: Mining the Potential of Multi-modality Vision Language Models**](https://arxiv.org/pdf/2403.18814.pdf) <br> | arXiv | 2024-03-27 | [Github](https://github.com/dvlab-research/MiniGemini) | [Demo](http://103.170.5.190:7860) |
- Star - 02-19 | [Github](https://github.com/OpenMOSS/AnyGPT) | - |
- **CoLLaVO: Crayon Large Language and Vision mOdel** - 02-17 | - | - |
- Github - VL) |
- Star - 08-28 | [Github](https://github.com/NVlabs/Eagle) | [Demo](https://huggingface.co/spaces/NVEagle/Eagle-X5-13B-Chat) |
- Star - Language Understanding**](https://arxiv.org/pdf/2410.17434) <br> | arXiv | 2024-10-22 | [Github](https://github.com/Vision-CAIR/LongVU) | [Demo](https://huggingface.co/spaces/Vision-CAIR/LongVU) |
- Star - Level Visual Knowledge**](https://arxiv.org/pdf/2311.11860.pdf) <br> | arXiv | 2023-11-20 | [Github](https://github.com/rshaojimmy/JiuTian) | - |
- Star - 08-01 | [Github](https://github.com/dvlab-research/LISA) | [Demo](http://103.170.5.190:7860) |
- Star - 03-12 | [Github](https://github.com/ByungKwanLee/MoAI) | Local Demo |
- Star - 05-24 | [Github](https://github.com/alibaba/conv-llava) | - |
- Star - based Traversal of Rationale for Large Language and Vision Models**](https://arxiv.org/pdf/2405.15574) <br> | arXiv | 2024-05-24 | [Github](https://github.com/ByungKwanLee/Meteor) | Local Demo |
- Star - VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution**](https://arxiv.org/pdf/2409.12191) <br> | arXiv | 2024-09-18 | [Github](https://github.com/QwenLM/Qwen2-VL) | [Demo](https://huggingface.co/spaces/Qwen/Qwen2-VL) |
- Star - 05-25 | [Github](https://github.com/joez17/ChatBridge) | - |
- Star - synthesized Data for A Lite Vision-Language Model**](https://arxiv.org/pdf/2402.11684.pdf) <br> | arXiv | 2024-02-18 | [Github](https://github.com/FreedomIntelligence/ALLaVA) | [Demo](https://huggingface.co/FreedomIntelligence/ALLaVA-3B) |
- **CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding** - 11-06 | [Coming soon]() | - |
- Star - Language Instruction Tuning for Large Language Models**](https://arxiv.org/pdf/2305.15023.pdf) <br> | arXiv | 2023-05-24 | [Github](https://github.com/luogen1996/LaVIN) | Local Demo |
- Star - 05-23 | [Github](https://github.com/OptimalScale/DetGPT) | [Demo](https://d3c431c0c77b1d9010.gradio.live/) |
- **MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training** - 03-14 | - | - |
- Star - Grained Temporal Reasoning**](https://arxiv.org/pdf/2402.11435.pdf) <br> | arXiv | 2024-02-18 | [Github](https://github.com/DCDmllm/Momentor) | - |
- Star - modal LLMs to 1000 Images Efficiently via Hybrid Architecture**](https://arxiv.org/pdf/2409.02889) <br> | arXiv | 2024-09-04 | [Github](https://github.com/FreedomIntelligence/LongLLaVA) | - |
- Star - HD: Diving into High-Resolution Large Multimodal Models**](https://arxiv.org/pdf/2406.08487) <br> | arXiv | 2024-06-12 | [Github](https://github.com/yfzhang114/SliME) | - |
- Star - 05-31 | [Github](https://github.com/AIDC-AI/Ovis/) | - |
- **Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs** - 04-08 | - | - |
- Star - LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding**](https://arxiv.org/pdf/2404.05726.pdf) <br> | CVPR | 2024-04-08 | [Github](https://github.com/boheumd/MA-LMM) | - |
- **VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding** - 12-04 | - | - |
- Star - UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding**](https://arxiv.org/pdf/2311.08046) <br> | CVPR | 2023-11-14 | [Github](https://github.com/PKU-YuanGroup/Chat-UniVi) | - |
- **Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models** - 09-25 | [Huggingface](https://huggingface.co/allenai/MolmoE-1B-0924) | [Demo](https://molmo.allenai.org) |
- Star - LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding**](https://arxiv.org/pdf/2406.19389) <br> | arXiv | 2024-06-27 | [Github](https://github.com/lxtGH/OMG-Seg) | Local Demo |
- Star - 1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs**](https://arxiv.org/pdf/2406.16860) <br> | arXiv | 2024-06-24 | [Github](https://github.com/cambrian-mllm/cambrian) | Local Demo |
- Star - 04-22 | [Github](https://github.com/graphic-design-ai/graphist) | - |
- Star - Upcycled Mixture-of-Experts**](https://arxiv.org/pdf/2405.05949) <br> | arXiv | 2024-05-09 | [Github](https://github.com/SHI-Labs/CuMo) | Local Demo |
- **Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study** - 01-31 | [Coming soon]() | - |
- **GROUNDHOG: Grounding Large Language Models to Holistic Segmentation** - 02-26 | Coming soon | Coming soon |
- Star - Language Model for Spatial Affordance Prediction for Robotics**](https://arxiv.org/pdf/2406.10721) <br> | CoRL | 2024-06-15 | [Github](https://github.com/wentaoyuan/RoboPoint) | [Demo](https://007e03d34429a2517b.gradio.live/) |
- **Pixel Aligned Language Models** - 12-14 | [Coming soon]() | - |
- Star - VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model**](https://arxiv.org/pdf/2407.07577) <br> | arXiv | 2024-07-10 | [Github](https://github.com/jiyt17/IDA-VLM) | - |
- **Parrot: Multilingual Visual Instruction Tuning** - 06-04 | [Coming soon]() | - |
- **Ovis: Structural Embedding Alignment for Multimodal Large Language Model** - 05-31 | [Coming soon]() | - |
- Star - Language Models**](https://arxiv.org/pdf/2405.19315) <br> | arXiv | 2024-05-29 | [Github](https://github.com/gordonhu608/MQT-LLaVA) | [Demo](https://huggingface.co/spaces/gordonhu/MQT-LLaVA) |
- Star - Seeing Project V2: Towards General Relation Comprehension of the Open World**](https://arxiv.org/pdf/2402.19474.pdf) | arXiv | 2024-02-29 | [Github](https://github.com/OpenGVLab/all-seeing) | - |
- **VILA^2: VILA Augmented VILA** - 07-24 | - | - |
- **SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models** - 07-22 | - | - |
- **EVLM: An Efficient Vision-Language Model for Visual Understanding** - 07-19 | - | - |
- Star - 05-16 | [Github](https://github.com/YifanXu74/Libra) | Local Demo |
- **AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model** - 09-27 | - | - |
- **Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning** - 09-05 | - | - |
-
Benchmarks for Evaluation
- TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding - tuning dataset with timestamp annotations, covering diverse time-sensitive video-understanding tasks. |
- Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges
- BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models
- MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning - annotated benchmark with distinct tasks evaluating reasoning capabilities over charts |
- MVBench: A Comprehensive Multi-modal Video Understanding Benchmark - Anything/blob/main/video_chat2/MVBENCH.md) | A comprehensive multimodal benchmark for video understanding |
- OtterHD: A High-Resolution Multi-modality Model - AI/MagnifierBench) | A benchmark designed to probe models' ability of fine-grained perception |
- HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models - lab/HallusionBench) |An image-context reasoning benchmark for evaluation of hallucination |
- Aligning Large Multimodal Models with Factually Augmented RLHF - Bench) | A benchmark for hallucination evaluation |
- MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V - Bench) | GPT-4V evaluation with per-sample criteria |
- BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models
- MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning - annotated benchmark with distinct tasks evaluating reasoning capabilities over charts |
- OtterHD: A High-Resolution Multi-modality Model - AI/MagnifierBench) | A benchmark designed to probe models' ability of fine-grained perception |
- HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models - lab/HallusionBench) |An image-context reasoning benchmark for evaluation of hallucination |
- Aligning Large Multimodal Models with Factually Augmented RLHF - Bench) | A benchmark for hallucination evaluation |
- MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models
- Link-Context Learning for Multimodal LLMs - Portal) | A benchmark comprising exclusively of unseen generated image-label pairs designed for link-context learning |
- Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions - language instructions |
- SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs - scale chart-visual question-answering dataset |
- MMBench: Is Your Multi-modal Model an All-around Player? - compass/MMBench) | A systematically-designed objective benchmark for robustly evaluating the various abilities of vision-language models|
- What Matters in Training a GPT4-Style Language Model with Multimodal Inputs? - llm#prepare-data) | A comprehensive evaluation benchmark including both image and video tasks |
- MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models - Multimodal-Large-Language-Models/tree/Evaluation) | A comprehensive MLLM Evaluation benchmark |
- LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models - Modality-Arena) | An evaluation platform for MLLMs |
- Link-Context Learning for Multimodal LLMs - Portal) | A benchmark comprising exclusively of unseen generated image-label pairs designed for link-context learning |
- SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs - scale chart-visual question-answering dataset |
- MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities - Vet) | An evaluation benchmark that examines large multimodal models on complicated multimodal tasks |
- SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension - CVC/SEED-Bench) | A benchmark for evaluation of generative comprehension in MLLMs |
- M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models - NLP-SG/M3Exam) | A multilingual, multimodal, multilevel benchmark for evaluating MLLM |
- mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality - PLUG/mPLUG-Owl/tree/main/OwlEval) | Dataset for evaluation on multiple capabilities |
- Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
- Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning - Instruction#evaluationgavie) | A benchmark to evaluate the hallucination and instruction following ability |
- ✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models - assisted benchmark for quantitatively assessing a model's conversational competence across multiple images and dialogue turns based on three distinct criteria. |
- ✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models - assisted benchmark for quantitatively assessing a model's conversational competence across multiple images and dialogue turns based on three distinct criteria. |
- TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding - tuning dataset with timestamp annotations, covering diverse time-sensitive video-understanding tasks. |
- Towards End-to-End Embodied Decision Making via Multi-modal Large Language Model: Explorations with GPT4-Vision and Beyond - icler/PCA-EVAL) | A benchmark for evaluating multi-domain embodied decision-making. |
- LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content
- TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models - grained temporal understanding |
- Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models - YuanGroup/Video-Bench) | A benchmark for video-MLLM evaluation |
- Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset - cuhk/MathVision) | A diverse mathematical reasoning benchmark |
- CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark - Benchmark/CMMMU) | A Chinese benchmark involving reasoning and knowledge across multiple disciplines |
- Benchmarking Large Multimodal Models against Common Corruptions - sg/MMCBench) | A benchmark for examining self-consistency under common corruptions |
- LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark - benchmark) | A benchmark for evaluating the quantitative performance of MLLMs on various2D/3D vision tasks |
- MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models - Multimodal-Large-Language-Models/tree/Evaluation) | A comprehensive MLLM Evaluation benchmark |
- LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models - Modality-Arena) | An evaluation platform for MLLMs |
- Visually Dehallucinative Instruction Generation: Know What You Don't Know
- VL-ICL Bench: The Devil in the Details of Benchmarking Multimodal In-Context Learning - zong/VL-ICL) | A benchmark for M-ICL evaluation, covering a wide spectrum of tasks |
- Benchmarking Large Multimodal Models against Common Corruptions - sg/MMCBench) | A benchmark for examining self-consistency under common corruptions |
- Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs - roberts1/charting-new-territories) | A benchmark for evaluating geographic and geospatial capabilities |
- Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset - cuhk/MathVision) | A diverse mathematical reasoning benchmark |
- Visually Dehallucinative Instruction Generation: Know What You Don't Know
- MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans? - RealWorld) | A challenging benchmark that involves real-life scenarios |
- Can MLLMs Perform Text-to-Image In-Context Learning? - to-image ICL |
- M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts - centric benchmark |
- OtterHD: A High-Resolution Multi-modality Model - AI/MagnifierBench) | A benchmark designed to probe models' ability of fine-grained perception |
- HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models - lab/HallusionBench) |An image-context reasoning benchmark for evaluation of hallucination |
- ✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models - assisted benchmark for quantitatively assessing a model's conversational competence across multiple images and dialogue turns based on three distinct criteria. |
- Detecting and Preventing Hallucinations in Large Vision Language Models
- Aligning Large Multimodal Models with Factually Augmented RLHF - Bench) | A benchmark for hallucination evaluation |
- Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning - Instruction#evaluationgavie) | A benchmark to evaluate the hallucination and instruction following ability |
- SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension - CVC/SEED-Bench) | A benchmark for evaluation of generative comprehension in MLLMs |
- MMBench: Is Your Multi-modal Model an All-around Player? - compass/MMBench) | A systematically-designed objective benchmark for robustly evaluating the various abilities of vision-language models|
- CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs - nlp/CharXiv) | Chart understanding benchmark curated by human experts |
- TempCompass: Do Video LLMs Really Understand Videos?
- Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
- MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning - annotated benchmark with distinct tasks evaluating reasoning capabilities over charts |
- MVBench: A Comprehensive Multi-modal Video Understanding Benchmark - Anything/blob/main/video_chat2/MVBENCH.md) | A comprehensive multimodal benchmark for video understanding |
- Can MLLMs Perform Text-to-Image In-Context Learning? - to-image ICL |
- TempCompass: Do Video LLMs Really Understand Videos?
- Making Large Multimodal Models Understand Arbitrary Visual Prompts - Bench) | A benchmark for visual prompts |
- BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models
- Towards End-to-End Embodied Decision Making via Multi-modal Large Language Model: Explorations with GPT4-Vision and Beyond - icler/PCA-EVAL) | A benchmark for evaluating multi-domain embodied decision-making. |
- What Matters in Training a GPT4-Style Language Model with Multimodal Inputs? - llm#prepare-data) | A comprehensive evaluation benchmark including both image and video tasks |
- Making Large Multimodal Models Understand Arbitrary Visual Prompts - Bench) | A benchmark for visual prompts |
- VL-ICL Bench: The Devil in the Details of Benchmarking Multimodal In-Context Learning - zong/VL-ICL) | A benchmark for M-ICL evaluation, covering a wide spectrum of tasks |
- Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs - roberts1/charting-new-territories) | A benchmark for evaluating geographic and geospatial capabilities |
- Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models - YuanGroup/Video-Bench) | A benchmark for video-MLLM evaluation |
- OmniBench: Towards The Future of Universal Omni-Language Models - a-p/OmniBench) | A benchmark that evaluates models' capabilities of processing visual, acoustic, and textual inputs simultaneously |
- MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V - Bench) | GPT-4V evaluation with per-sample criteria |
- Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis - MME) | A comprehensive evaluation benchmark of Multi-modal LLMs in video analysis |
- Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges
- MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models
- M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models - NLP-SG/M3Exam) | A multilingual, multimodal, multilevel benchmark for evaluating MLLM |
- mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality - PLUG/mPLUG-Owl/tree/main/OwlEval) | Dataset for evaluation on multiple capabilities |
- LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark - benchmark) | A benchmark for evaluating the quantitative performance of MLLMs on various2D/3D vision tasks |
- Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions - language instructions |
- SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs - scale chart-visual question-answering dataset |
-
Foundation Models
- Star - 2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models**](https://arxiv.org/pdf/2301.12597.pdf) <br> | arXiv | 2023-01-30 | [Github](https://github.com/salesforce/LAVIS/tree/main/projects/blip2) | [Demo](https://colab.research.google.com/github/salesforce/LAVIS/blob/main/examples/blip2_instructed_generation.ipynb) |
- **Fuyu-8B: A Multimodal Architecture for AI Agents** - 10-17 | [Huggingface](https://huggingface.co/adept/fuyu-8b) | [Demo](https://huggingface.co/adept/fuyu-8b)
- Star - 07-30 | [Github](https://github.com/mshukor/UnIVAL) | [Demo](https://huggingface.co/spaces/mshukor/UnIVAL) |
- **PaLI-3 Vision Language Models: Smaller, Faster, Stronger** - 10-13 | - | - |
- **GPT-4V(ision) System Card** - 09-25 | - | - |
- Star - Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization**](https://arxiv.org/pdf/2309.04669.pdf) <br> | arXiv | 2023-09-09 | [Github](https://github.com/jy0205/LaVIT) | - |
- **Multimodal Foundation Models: From Specialists to General-Purpose Assistants** - 09-18 | - | - |
- Star - Language Learning with Decoupled Language Pre-training**](https://arxiv.org/pdf/2307.07063.pdf) <br> | NeurIPS | 2023-07-13 | [Github](https://github.com/yiren-jian/BLIText) | - |
- Star - 05-02 | [Github](https://github.com/VPGTrans/VPGTrans) | [Demo](https://3fc7715dbc44234a7f.gradio.live/) |
- **GPT-4 Technical Report** - 03-15 | - | - |
- **PaLM-E: An Embodied Multimodal Language Model** - 03-06 | - | [Demo](https://palm-e.github.io/#demo) |
- Star - Language Model with An Ensemble of Experts**](https://arxiv.org/pdf/2303.02506.pdf) <br> | arXiv | 2023-03-04 | [Github](https://github.com/NVlabs/prismer) | [Demo](https://huggingface.co/spaces/lorenmt/prismer) |
- Star - 10-06 | [Github](https://github.com/vimalabs/VIMA) | Local Demo |
- Star - Ended Embodied Agents with Internet-Scale Knowledge**](https://arxiv.org/pdf/2206.08853.pdf) <br> | NeurIPS | 2022-06-17 | [Github](https://github.com/MineDojo/MineDojo) | - |
- Star - Language Models are Unified Modal Learners**](https://arxiv.org/pdf/2206.07699.pdf) <br> | ICLR | 2022-06-15 | [Github](https://github.com/shizhediao/DaVinci) | - |
- Star - 07-11 | [Github](https://github.com/baaivision/Emu) | [Demo](http://218.91.113.230:9002/) |
- Star - Purpose Interfaces**](https://arxiv.org/pdf/2206.06336.pdf) <br> | arXiv | 2022-06-13 | [Github](https://github.com/microsoft/unilm) | - |
- **Gemini: A Family of Highly Capable Multimodal Models** - 12-06 | - | - |
- **The Llama 3 Herd of Models** - 07-31 | - | - |
- **Chameleon: Mixed-Modal Early-Fusion Foundation Models** - 05-16 | - | - |
- **PaLM-E: An Embodied Multimodal Language Model** - 03-06 | - | [Demo](https://palm-e.github.io/#demo) |
- **PaLI-3 Vision Language Models: Smaller, Faster, Stronger** - 10-13 | - | - |
- **Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context** - 02-15 | - | - |
- **Pixtral-12B** - 09-17 | - | - |
- **The Claude 3 Model Family: Opus, Sonnet, Haiku** - 03-04 | - | - |
- **Hello GPT-4o** - 05-13 | - | - |
- **Llama 3.2: Revolutionizing edge AI and vision with open, customizable models** - 09-25 | - | [Demo](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) |
- **GPT-4 Technical Report** - 03-15 | - | - |
-
Multimodal In-Context Learning
- Star - Context Configurations for Image Captioning**](https://arxiv.org/pdf/2305.14800.pdf) <br> | NeurIPS | 2023-05-24 | [Github](https://github.com/yongliang-wu/ExploreCfg) | - |
- Star - D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction**](https://arxiv.org/pdf/2303.05063.pdf) <br> | ICCV | 2023-03-09 | [Github](https://github.com/MAEHCM/ICL-D3IE) | - |
- Star - based Visual Question Answering**](https://arxiv.org/pdf/2303.01903.pdf) <br> | CVPR | 2023-03-03 | [Github](https://github.com/MILVLG/prophet) | - |
- **Hijacking Context in Large Multi-modal Models** - 12-07 | - | - |
- **Towards More Unified In-context Visual Understanding** - 12-05 | - | - |
- Star - language Model with Multi-Modal In-Context Learning**](https://arxiv.org/pdf/2309.07915.pdf) <br> | arXiv | 2023-09-14 | [Github](https://github.com/HaozheZhao/MIC) | [Demo](https://8904cdd23621858859.gradio.live/) |
- Star - Context Learning for Multimodal LLMs**](https://arxiv.org/pdf/2308.07891.pdf) <br> | arXiv | 2023-08-15 | [Github](https://github.com/isekai-portal/Link-Context-Learning) | [Demo](http://117.144.81.99:20488/) |
- Star - Flamingo: a Multimodal Medical Few-shot Learner**](https://arxiv.org/pdf/2307.15189.pdf) <br> | arXiv | 2023-07-27 | [Github](https://github.com/snap-stanford/med-flamingo) | Local Demo |
- Star - D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction**](https://arxiv.org/pdf/2303.05063.pdf) <br> | ICCV | 2023-03-09 | [Github](https://github.com/MAEHCM/ICL-D3IE) | - |
- Star - based Visual Question Answering**](https://arxiv.org/pdf/2303.01903.pdf) <br> | CVPR | 2023-03-03 | [Github](https://github.com/MILVLG/prophet) | - |
- Star - 3 for Few-Shot Knowledge-Based VQA**](https://ojs.aaai.org/index.php/AAAI/article/download/20215/19974) <br> | AAAI | 2022-06-28 | [Github](https://github.com/microsoft/PICa) | - |
- **Multimodal Few-Shot Learning with Frozen Language Models** - 06-25 | - | - |
- Star - IT: Multi-Modal In-Context Instruction Tuning**](https://arxiv.org/pdf/2306.05425.pdf) <br> | arXiv | 2023-06-08 | [Github](https://github.com/Luodian/Otter) | [Demo](https://otter.cliangyu.com/) |
- Star - Shot Learning**](https://arxiv.org/pdf/2204.14198.pdf) <br> | NeurIPS | 2022-04-29 | [Github](https://github.com/mlfoundations/open_flamingo) | [Demo](https://huggingface.co/spaces/dhansmair/flamingo-mini-cap) |
- Star - Shot Learning**](https://arxiv.org/pdf/2204.14198.pdf) <br> | NeurIPS | 2022-04-29 | [Github](https://github.com/mlfoundations/open_flamingo) | [Demo](https://huggingface.co/spaces/dhansmair/flamingo-mini-cap) |
- **Visual In-Context Learning for Large Vision-Language Models** - 02-18 | - | - |
- Star - to-Image In-Context Learning?**](https://arxiv.org/pdf/2402.01293.pdf) <br> | arXiv | 2024-02-02 | [Github](https://github.com/UW-Madison-Lee-Lab/CoBSAT) | - |
- **Visual In-Context Learning for Large Vision-Language Models** - 02-18 | - | - |
- **Multimodal Few-Shot Learning with Frozen Language Models** - 06-25 | - | - |
- **Hijacking Context in Large Multi-modal Models** - 12-07 | - | - |
- **Towards More Unified In-context Visual Understanding** - 12-05 | - | - |
-
Evaluation
- Star - 4V, Bard, and Other Large Multimodal Models**](https://arxiv.org/pdf/2310.02255.pdf) <br> | arXiv | 2023-10-03 | [Github](https://github.com/lupantech/MathVista) |
- **A Comprehensive Study of GPT-4V's Multimodal Capabilities in Medical Imaging** - 10-31 | - |
- Star - 10-02 | [Github](https://github.com/ys-zong/FoolyourVLLMs) |
- Star - Context Learning**](https://arxiv.org/pdf/2310.00647.pdf) <br> | arXiv | 2023-10-01 | [Github](https://github.com/mshukor/EvALign-ICL) |
- Star - 10-12 | [Github](https://github.com/zjunlp/EasyEdit) |
- Stars - 4V? Early Explorations of Gemini in Visual Expertise**](https://arxiv.org/pdf/2312.12436.pdf) <br> | arXiv | 2023-12-19 | [Github](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models) |
- Stars - style Visual Capability of Large Multimodal Models**](https://arxiv.org/pdf/2312.02896.pdf) <br> | arXiv | 2023-12-05 | [Github](https://github.com/AIFEG/BenchLMM) |
- Star - 11-27 | [Github](https://github.com/UCSC-VLAA/vllm-safety-benchmark) |
- Star - Bench, Evaluating Multi-modal LLMs using GPT-4V**](https://arxiv.org/pdf/2311.13951) <br> | arXiv | 2023-11-23 | [Github](https://github.com/FreedomIntelligence/MLLM-Bench) |
- **VLM-Eval: A General Evaluation on Video Large Language Models** - 11-20 | [Coming soon]() |
- Star - 4V(ision): Bias and Interference Challenges**](https://arxiv.org/pdf/2311.03287.pdf) <br> | arXiv | 2023-11-06 | [Github](https://github.com/gzcch/Bingo) |
- Star - 4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving**](https://arxiv.org/pdf/2311.05332.pdf) <br> | arXiv | 2023-11-09 | [Github](https://github.com/PJLab-ADG/GPT4V-AD-Exploration) |
- **Towards Generic Anomaly Detection and Understanding: Large-scale Visual-linguistic Model (GPT-4V) Takes the Lead** - 11-05 | - |
- Star - 4V(ision)**](https://arxiv.org/pdf/2310.16534.pdf) <br> | arXiv | 2023-10-25 | [Github](https://github.com/albertwy/GPT-4V-Evaluation) |
- Star - 4V(ision) : A Quantitative and In-depth Evaluation**](https://arxiv.org/pdf/2310.16809.pdf) <br> | arXiv | 2023-10-25 | [Github](https://github.com/SCUT-DLVCLab/GPT-4V_OCR) |
- Star - Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models**](https://arxiv.org/pdf/2310.14566.pdf) <br> | arXiv | 2023-10-23 | [Github](https://github.com/tianyi-lab/HallusionBench) |
- Star - modal Model an All-around Player?**](https://arxiv.org/pdf/2307.06281.pdf) <br> | arXiv | 2023-07-12 | [Github](https://github.com/open-compass/MMBench) |
- Star - 05-13 | [Github](https://github.com/Yuliang-Liu/MultimodalOCR) |
- Star - LION: Evaluating and Refining Vision-Language Instruction Tuning Datasets**](https://arxiv.org/pdf/2310.06594.pdf) <br> | arXiv | 2023-10-10 | [Github](https://github.com/liaoning97/REVO-LION) |
- **The Dawn of LMMs: Preliminary Explorations with GPT-4V(vision)** - 09-29 | - |
- Star - Language Models by Language Models**](https://arxiv.org/pdf/2308.16890.pdf) <br>| arXiv | 2023-08-31 | [Github](https://github.com/OFA-Sys/TouchStone) |
- Star - Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs**](https://arxiv.org/pdf/2308.03349.pdf) <br> | arXiv | 2023-08-07 | [Github](https://github.com/findalexli/SciGraphQA) |
- Star - Vet: Evaluating Large Multimodal Models for Integrated Capabilities**](https://arxiv.org/pdf/2308.02490.pdf) <br> | arXiv | 2023-08-04 | [Github](https://github.com/yuweihao/MM-Vet) |
- Star - Bench: Benchmarking Multimodal LLMs with Generative Comprehension**](https://arxiv.org/pdf/2307.16125.pdf) <br> | arXiv | 2023-07-30 | [Github](https://github.com/AILab-CVC/SEED-Bench) |
- Star - modal Model an All-around Player?**](https://arxiv.org/pdf/2307.06281.pdf) <br> | arXiv | 2023-07-12 | [Github](https://github.com/open-compass/MMBench) |
- Star - 06-08 | [Github](https://github.com/DAMO-NLP-SG/M3Exam) |
- Star - 05-13 | [Github](https://github.com/Yuliang-Liu/MultimodalOCR) |
- Star - eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models**](https://arxiv.org/pdf/2306.09265.pdf) <br> | arXiv | 2023-06-15 | [Github](https://github.com/OpenGVLab/Multi-Modality-Arena) |
- Star - Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark**](https://arxiv.org/pdf/2306.06687.pdf) <br> | arXiv | 2023-06-11 | [Github](https://github.com/OpenLAMM/LAMM#lamm-benchmark) |
- Star - Following Models**](https://arxiv.org/pdf/2308.16463.pdf) <br> | arXiv | 2023-08-31 | [Github](https://github.com/HYPJUDY/Sparkles#sparkleseval) |
- Stars - 01-11 | [Github](https://github.com/tsb0601/MMVP) |
- Stars - 01-22 | [Github](https://github.com/sail-sg/MMCBench) |
- Stars - VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models**](https://arxiv.org/pdf/2310.10942) <br> | TPAMI | 2023-10-17 | [Github](https://github.com/guoyang9/UNK-VQA) |
- Stars - RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?**](https://arxiv.org/pdf/2408.13257) <br> | arXiv | 2024-08-23 | [Github](https://github.com/yfzhang114/MME-RealWorld) |
- Stars - Modal Reasoning Capability via Chart-to-Code Generation**](https://arxiv.org/pdf/2406.09961) <br> | arXiv | 2024-04-15 | [Github](https://github.com/ChartMimic/ChartMimic) |
- Stars - 06-26 | [Github](https://github.com/princeton-nlp/CharXiv) |
- **VLM-Eval: A General Evaluation on Video Large Language Models** - 11-20 | [Coming soon]() |
- **A Comprehensive Study of GPT-4V's Multimodal Capabilities in Medical Imaging** - 10-31 | - |
- **Towards Generic Anomaly Detection and Understanding: Large-scale Visual-linguistic Model (GPT-4V) Takes the Lead** - 11-05 | - |
- **The Dawn of LMMs: Preliminary Explorations with GPT-4V(vision)** - 09-29 | - |
- Stars - Language Models**](https://arxiv.org/pdf/2409.15272) <br> | arXiv | 2024-09-23 | [Github](https://github.com/multimodal-art-projection/OmniBench) |
- Stars - MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis**](https://arxiv.org/pdf/2405.21075) <br> | arXiv | 2024-05-31 | [Github](https://github.com/BradyFU/Video-MME) |
- Stars - 06-29 | [Github](https://github.com/chenllliang/MMEvalPro) |
- Stars - scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs**](https://arxiv.org/pdf/2406.20098) <br> | arXiv | 2024-06-28 | [Github](https://github.com/MBZUAI-LLM/web2code) |
-
Others
- Star - 07-16 | [Github](https://github.com/AILab-CVC/SEED) |
- Star - 12-21 | [Github](https://github.com/SHI-Labs/VCoder) | Local Demo |
- Star - Modal LLMs**](https://arxiv.org/pdf/2312.04302.pdf) <br> | arXiv | 2023-12-07 | [Github](https://github.com/dvlab-research/Prompt-Highlighter) | - |
- Star - trained Models Help Vision Models on Perception Tasks?**](https://arxiv.org/pdf/2306.00693.pdf) <br> | arXiv | 2023-06-01 | [Github](https://github.com/huawei-noah/Efficient-Computing/tree/master/GPT4Image/) | - |
- Star - 05-29 | [Github](https://github.com/yuhangzang/ContextDET) | [Demo](https://huggingface.co/spaces/yuhangzang/ContextDet-Demo) |
- Star - 05-26 | [Github](https://github.com/kohjingyu/gill) | - |
- Star - Language Models**](https://arxiv.org/pdf/2305.16934.pdf) <br> | arXiv | 2023-05-26 | [Github](https://github.com/yunqing-me/AttackVLM) | - |
- Star - 01-31 | [Github](https://github.com/kohjingyu/fromage) | [Demo](https://huggingface.co/spaces/jykoh/fromage) |
- IMAD: IMage-Augmented multi-modal Dialogue
- Star - Language Models**](https://arxiv.org/pdf/2305.16934.pdf) <br> | arXiv | 2023-05-26 | [Github](https://github.com/yunqing-me/AttackVLM) | - |
- Star - 01-31 | [Github](https://github.com/kohjingyu/fromage) | [Demo](https://huggingface.co/spaces/jykoh/fromage) |
- IMAD: IMage-Augmented multi-modal Dialogue
- Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions? - vision-language.github.io/infoseek/) | A VQA dataset that focuses on asking information-seeking questions |
- Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities - vision-language.github.io/oven/) | A dataset that focuses on recognizing the Visual Entity on the Wikipedia, from images in the wild |
- Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions? - vision-language.github.io/infoseek/) | A VQA dataset that focuses on asking information-seeking questions |
- Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities - vision-language.github.io/oven/) | A dataset that focuses on recognizing the Visual Entity on the Wikipedia, from images in the wild |
- Star - Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models**](https://arxiv.org/pdf/2402.02207.pdf) <br> | arXiv | 2024-02-03 | [Github](https://github.com/ys-zong/VLGuard) | - |
- Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models - oryx/Video-ChatGPT#quantitative-evaluation-bar_chart) | A quantitative evaluation framework for video-based dialogue models |
- IMAD: IMage-Augmented multi-modal Dialogue
- Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation - tuning dataset for learning to reject instructions |
- Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation - tuning dataset for learning to reject instructions |
-
Multimodal RLHF
- Star - 12-17 | [Github](https://github.com/vlf-silkie/VLFeedback) | - |
- Star - V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback**](https://arxiv.org/pdf/2312.00849.pdf) <br> | arXiv | 2023-12-01 | [Github](https://github.com/RLHF-V/RLHF-V) | [Demo](http://120.92.209.146:8081/) |
- Star - 09-25 | [Github](https://github.com/llava-rlhf/LLaVA-RLHF) | [Demo](http://pitt.lti.cs.cmu.edu:7890/) |
- **Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization** - 10-09 | - | - |
-
Datasets of Multimodal Chain-of-Thought
- Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering - download-the-dataset) | Large-scale multi-choice dataset, featuring multimodal science questions and diverse domains |
- EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought - scale embodied planning dataset |
- Explainable Multimodal Emotion Reasoning - Multimodal-Emotion-Reasoning) | A benchmark dataset for explainable emotion reasoning task |
- EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought - scale embodied planning dataset |
- Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction - time dataset that can be used to evaluate VideoCOT |
- Explainable Multimodal Emotion Reasoning - Multimodal-Emotion-Reasoning) | A benchmark dataset for explainable emotion reasoning task |
-
Datasets of Multimodal Instruction Tuning
- To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning - Instruct4V) | A visual instruction dataset via self-instruction from GPT-4V |
- What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning - data) | A synthetic instruction dataset for complex visual reasoning |
- mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding - PLUG/mPLUG-DocOwl/tree/main/DocLLM) | An instruction tuning dataset featuring a wide range of visual-text understanding tasks including OCR-free document understanding |
- Visual Instruction Tuning with Polite Flamingo - 1M/tree/main) | A collection of 37 vision-language datasets with responses rewritten by Polite Flamingo. |
- LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding - tuning dataset for Text-rich Image Understanding |
- MotionGPT: Human Motion as a Foreign Language - tuning dataset including multiple human motion-related tasks |
- Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration - LLM/tree/main/data) | A large-scale multi-modal instruction dataset in terms of multi-turn dialogue |
- MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models - CAIR/cc_sbu_align) | Multimodal aligned dataset for improving model's usability and generation's fluency |
- Visual Instruction Tuning - Instruct-150K) | Multimodal instruction-following data generated by GPT|
- StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data
- BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs - quality instruction-tuning dataset including audio-text audio caption data and audio-image-text localization data |
- SVIT: Scaling up Visual Instruction Tuning - scale dataset with 4.2M informative visual instruction tuning data, including conversations, detailed descriptions, complex reasoning and referring QAs |
- mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding - PLUG/mPLUG-DocOwl/tree/main/DocLLM) | An instruction tuning dataset featuring a wide range of visual-text understanding tasks including OCR-free document understanding |
- LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding - tuning dataset for Text-rich Image Understanding |
- MotionGPT: Human Motion as a Foreign Language - tuning dataset including multiple human motion-related tasks |
- Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration - LLM/tree/main/data) | A large-scale multi-modal instruction dataset in terms of multi-turn dialogue |
- LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day - Med#llava-med-dataset) | A large-scale, broad-coverage biomedical instruction-following dataset |
- GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction - related instruction datasets |
- ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst - chatbridge.github.io/) | Multimodal instruction tuning dataset covering 16 multimodal tasks |
- DetGPT: Detect What You Need via Reasoning - tuning dataset with 5000 images and around 30000 query-answer pairs|
- PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering - zhang.github.io/PMC-VQA/) | Large-scale medical visual question-answering dataset |
- VideoChat: Chat-Centric Video Understanding - centric multimodal instruction dataset |
- LMEye: An Interactive Perception Network for Large Language Models - modal instruction-tuning dataset |
- MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models - CAIR/cc_sbu_align) | Multimodal aligned dataset for improving model's usability and generation's fluency |
- MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning - NLP/MultiInstruct) | The first multimodal instruction tuning benchmark dataset |
- ChartLlama: A Multimodal LLM for Chart Understanding and Generation - Dataset) | A multi-modal instruction-tuning dataset for chart understanding and generation |
- SVIT: Scaling up Visual Instruction Tuning - scale dataset with 4.2M informative visual instruction tuning data, including conversations, detailed descriptions, complex reasoning and referring QAs |
- GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction - related instruction datasets |
- X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages - LLM) | Chinese multimodal instruction dataset |
- Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models - oryx/Video-ChatGPT#video-instruction-dataset-open_file_folder) | 100K high-quality video instruction dataset |
- UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models - VQA) | A dataset designed to teach models to refrain from answering unanswerable questions |
- ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model - 4V) | Vision and language caption and instruction dataset generated by GPT4V |
- Visually Dehallucinative Instruction Generation - aligned visual instruction dataset |
- ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model - 4V) | Vision and language caption and instruction dataset generated by GPT4V |
- Visually Dehallucinative Instruction Generation - aligned visual instruction dataset |
- To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning - Instruct4V) | A visual instruction dataset via self-instruction from GPT-4V |
- What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning - data) | A synthetic instruction dataset for complex visual reasoning |
- StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data
- mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding - PLUG/mPLUG-DocOwl/tree/main/DocLLM) | An instruction tuning dataset featuring a wide range of visual-text understanding tasks including OCR-free document understanding |
- Visual Instruction Tuning with Polite Flamingo - 1M/tree/main) | A collection of 37 vision-language datasets with responses rewritten by Polite Flamingo. |
- ChartLlama: A Multimodal LLM for Chart Understanding and Generation - Dataset) | A multi-modal instruction-tuning dataset for chart understanding and generation |
- MotionGPT: Human Motion as a Foreign Language - tuning dataset including multiple human motion-related tasks |
- LMEye: An Interactive Perception Network for Large Language Models - modal instruction-tuning dataset |
- BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs - quality instruction-tuning dataset including audio-text audio caption data and audio-image-text localization data |
- VideoChat: Chat-Centric Video Understanding - centric multimodal instruction dataset |
- M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts - scale 3D instruction tuning dataset |
- VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models
- ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst - chatbridge.github.io/) | Multimodal instruction tuning dataset covering 16 multimodal tasks |
- DetGPT: Detect What You Need via Reasoning - tuning dataset with 5000 images and around 30000 query-answer pairs|
- MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning - NLP/MultiInstruct) | The first multimodal instruction tuning benchmark dataset |
- LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day - Med#llava-med-dataset) | A large-scale, broad-coverage biomedical instruction-following dataset |
- MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models - CAIR/cc_sbu_align) | Multimodal aligned dataset for improving model's usability and generation's fluency |
- Visual Instruction Tuning - Instruct-150K) | Multimodal instruction-following data generated by GPT|
- M<sup>3</sup>IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning - scale, broad-coverage multimodal instruction tuning dataset |
- LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding - tuning dataset for Text-rich Image Understanding |
- ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning - | A high-quality instruction-tuning dataset including image-text and region-text pairs |
-
Datasets of Multimodal RLHF
- Silkie: Preference Distillation for Large Visual Language Models - language feedback dataset annotated by AI |
- Silkie: Preference Distillation for Large Visual Language Models - language feedback dataset annotated by AI |
- Silkie: Preference Distillation for Large Visual Language Models - language feedback dataset annotated by AI |
-
Datasets of In-Context Learning
- MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning - image inputs, inter-related multiple image inputs, and multimodal in-context learning inputs. |
- MIMIC-IT: Multi-Modal In-Context Instruction Tuning - it/README.md) | Multimodal in-context instruction dataset|
- MIMIC-IT: Multi-Modal In-Context Instruction Tuning - it/README.md) | Multimodal in-context instruction dataset|
- MIMIC-IT: Multi-Modal In-Context Instruction Tuning - it/README.md) | Multimodal in-context instruction dataset|
Programming Languages
Categories
Multimodal Instruction Tuning
155
Datasets of Pre-Training for Alignment
102
Benchmarks for Evaluation
85
Multimodal Hallucination
59
Datasets of Multimodal Instruction Tuning
56
Evaluation
44
LLM-Aided Visual Reasoning
40
Foundation Models
28
Multimodal In-Context Learning
21
Others
21
Multimodal Chain-of-Thought
16
Our MLLM works
7
Datasets of Multimodal Chain-of-Thought
6
Multimodal RLHF
4
Datasets of In-Context Learning
4
Datasets of Multimodal RLHF
3
Sub Categories