{"id":13429445,"url":"https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models","last_synced_at":"2025-03-16T03:31:34.069Z","repository":{"id":167543040,"uuid":"642642539","full_name":"BradyFU/Awesome-Multimodal-Large-Language-Models","owner":"BradyFU","description":":sparkles::sparkles:Latest Advances on Multimodal Large Language Models","archived":false,"fork":false,"pushed_at":"2025-03-14T04:44:57.000Z","size":84241,"stargazers_count":14265,"open_issues_count":72,"forks_count":918,"subscribers_count":270,"default_branch":"main","last_synced_at":"2025-03-15T08:47:13.880Z","etag":null,"topics":["chain-of-thought","in-context-learning","instruction-following","instruction-tuning","large-language-models","large-vision-language-model","large-vision-language-models","multi-modality","multimodal-chain-of-thought","multimodal-in-context-learning","multimodal-instruction-tuning","multimodal-large-language-models","visual-instruction-tuning"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/BradyFU.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-05-19T03:02:29.000Z","updated_at":"2025-03-15T07:52:34.000Z","dependencies_parsed_at":"2023-10-15T04:45:58.692Z","dependency_job_id":"bdfd9cf1-7be4-4c70-ac77-4007b65179d6","html_url":"https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models","commit_stats":null,"previous_names":["bradyfu/awesome-visual-instruction-tuning"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BradyFU%2FAwesome-Multimodal-Large-Language-Models","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BradyFU%2FAwesome-Multimodal-Large-Language-Models/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BradyFU%2FAwesome-Multimodal-Large-Language-Models/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BradyFU%2FAwesome-Multimodal-Large-Language-Models/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/BradyFU","download_url":"https://codeload.github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243822312,"owners_count":20353496,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chain-of-thought","in-context-learning","instruction-following","instruction-tuning","large-language-models","large-vision-language-model","large-vision-language-models","multi-modality","multimodal-chain-of-thought","multimodal-in-context-learning","multimodal-instruction-tuning","multimodal-large-language-models","visual-instruction-tuning"],"created_at":"2024-07-31T02:00:39.133Z","updated_at":"2025-03-16T03:31:34.055Z","avatar_url":"https://github.com/BradyFU.png","language":null,"funding_links":[],"categories":["3 Reasoning Tasks","Github resource","Join the Awesome Video Large Language Models Community 🎓🤝","🎭 Multi-modal Testing","🌟 Awesome Lists and Resource Hubs","Articles and Resources","1 Relevant Surveys and Links","Other-Awesome-Lists","Others","Repos","Related LLM/LM/FM Resources","NLP","Grounding Datasets","Inbox: Speech-to-text (STT) and spoken content analysis","多模态大模型","Summary","Other Lists","Related Projects","Topics","Multi-modal learning","7. Resources","Machine Learning \u0026 AI","Multimodal, Vision-Language, and Generative AI","Related Awesome Lists","Awesome Surveys"],"sub_categories":["3.7 Multimodal Reasoning","Action Recognition","Popular-LLM","Methods with Potential for DOD","Creative Uses of Generative AI Image Synthesis Tools","网络服务_其他","TeX Lists","🎨Application","Multi-modal LLM","7.1 Related Awesome Lists","Multimodal and Vision-Language Models","Previous Venues"],"readme":"# Awesome-Multimodal-Large-Language-Models\n\n\n## Our MLLM works\n\n🔥🔥🔥 **A Survey on Multimodal Large Language Models**  \n**[Project Page [This Page]](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models)** | **[Paper](https://arxiv.org/pdf/2306.13549.pdf)** | :black_nib: **[Citation](./images/bib_survey.txt)** | **[💬 WeChat (MLLM微信交流群，欢迎加入)](./images/wechat-group.png)**\n\nThe first comprehensive survey for Multimodal Large Language Models (MLLMs). :sparkles:  \n\n---\n\n🔥🔥🔥 **VITA: Towards Open-Source Interactive Omni Multimodal LLM**  \n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"./images/vita-1.5.jpg\" width=\"60%\" height=\"60%\"\u003e\n\u003c/p\u003e\n\n\u003cfont size=7\u003e\u003cdiv align='center' \u003e [[📽 VITA-1.5 Demo Show! Here We Go! 🔥](https://youtu.be/tyi6SVFT5mM?si=fkMQCrwa5fVnmEe7)] \u003c/div\u003e\u003c/font\u003e  \n\n\u003cfont size=7\u003e\u003cdiv align='center' \u003e [[📖 VITA-1.5 Paper](https://arxiv.org/pdf/2501.01957)] [[🌟 GitHub](https://github.com/VITA-MLLM/VITA)] [[🤖 Basic Demo](https://modelscope.cn/studios/modelscope/VITA1.5_demo)] [[🍎 VITA-1.0](https://vita-home.github.io/)] [[💬 WeChat (微信)](https://github.com/VITA-MLLM/VITA/blob/main/asset/wechat-group.jpg)]\u003c/div\u003e\u003c/font\u003e  \n\n\u003cfont size=7\u003e\u003cdiv align='center' \u003e We are excited to introduce the **VITA-1.5**, a more powerful and more real-time version. ✨ \u003c/div\u003e\u003c/font\u003e\n\n\u003cfont size=7\u003e\u003cdiv align='center' \u003e**All codes of VITA-1.5 have been released**! :star2: \u003c/div\u003e\u003c/font\u003e  \n\nYou can experience our [Basic Demo](https://modelscope.cn/studios/modelscope/VITA1.5_demo) on ModelScope directly. The Real-Time Interactive Demo needs to be configured according to the [instructions](https://github.com/VITA-MLLM/VITA?tab=readme-ov-file#-real-time-interactive-demo).\n\n\n---\n\n🔥🔥🔥 **Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy**  \n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"./images/longvita.jpg\" width=\"80%\" height=\"80%\"\u003e\n\u003c/p\u003e\n\n\u003cfont size=7\u003e\u003cdiv align='center' \u003e [[📖 arXiv Paper](https://arxiv.org/pdf/2502.05177)] [[🌟 GitHub](https://github.com/VITA-MLLM/Long-VITA)]\u003c/div\u003e\u003c/font\u003e  \n\n\u003cfont size=7\u003e\u003cdiv align='center' \u003e Process more than **4K frames** or over **1M visual tokens**. State-of-the-art on Video-MME under 20B models!  ✨ \u003c/div\u003e\u003c/font\u003e\n\n\n---\n\n🔥🔥🔥 **MM-RLHF: The Next Step Forward in Multimodal LLM Alignment**  \n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"./images/mm-rlhf.jpg\" width=\"60%\" height=\"60%\"\u003e\n\u003c/p\u003e\n\n\u003cfont size=7\u003e\u003cdiv align='center' \u003e [[📖 arXiv Paper](https://arxiv.org/pdf/2502.10391)] [[🌟 GitHub](https://github.com/Kwai-YuanQi/MM-RLHF)] [[📊 MM-RLHF Data](https://huggingface.co/datasets/yifanzhang114/MM-RLHF)] \u003c/div\u003e\u003c/font\u003e  \n\nAlign MLLMs with human preference, including a high-quality dataset, a strong reward model, a new alignmen algorithm, and two new benchmarks.✨\n\n\n---\n\n🔥🔥🔥 **MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs**  \n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"./images/mme-survey.jpg\" width=\"60%\" height=\"60%\"\u003e\n\u003c/p\u003e\n\n\u003cfont size=7\u003e\u003cdiv align='center' \u003e [[🍎 Project Page](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Benchmarks)] [[📖 arXiv Paper](https://arxiv.org/pdf/2411.15296)] \u003c/div\u003e\u003c/font\u003e\n\n\u003cfont size=7\u003e\u003cdiv align='center' \u003e Jointly introduced by **MME**, **MMBench**, and **LLaVA** teams. ✨ \u003c/div\u003e\u003c/font\u003e\n\n---\n\n🔥🔥🔥 **Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis**  \n**[Project Page](https://video-mme.github.io/)** | **[Paper](https://arxiv.org/pdf/2405.21075)** | **[GitHub](https://github.com/BradyFU/Video-MME)** | **[Dataset](https://github.com/BradyFU/Video-MME?tab=readme-ov-file#-dataset)** | **[Leaderboard](https://video-mme.github.io/home_page.html#leaderboard)**\n\nWe are very proud to launch Video-MME, the first-ever comprehensive evaluation benchmark of MLLMs in Video Analysis! 🌟  \n\nIt includes short- (\u003c 2min), medium- (4min\\~15min), and long-term (30min\\~60min) videos, ranging from \u003cb\u003e11 seconds to 1 hour\u003c/b\u003e. All data are newly collected and annotated by humans, not from any existing video dataset. ✨ \n\n---\n\n🔥🔥🔥 **MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models**  \n**[Paper](https://arxiv.org/pdf/2306.13394.pdf)** | **[Download](https://huggingface.co/datasets/darkyarding/MME/blob/main/MME_Benchmark_release_version.zip)** | **[Eval Tool](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/blob/Evaluation/tools/eval_tool.zip)** | :black_nib: **[Citation](./images/bib_mme.txt)**\n\nA representative evaluation benchmark for MLLMs. :sparkles:  \n\n---\n\n🔥🔥🔥 **Woodpecker: Hallucination Correction for Multimodal Large Language Models**  \n**[Paper](https://arxiv.org/pdf/2310.16045)** | **[GitHub](https://github.com/BradyFU/Woodpecker)**\n\nThis is the first work to correct hallucination in multimodal large language models. :sparkles:  \n\n\n---\n\n\u003cfont size=5\u003e\u003ccenter\u003e\u003cb\u003e Table of Contents \u003c/b\u003e \u003c/center\u003e\u003c/font\u003e\n- [Awesome Papers](#awesome-papers)\n  - [Multimodal Instruction Tuning](#multimodal-instruction-tuning)\n  - [Multimodal Hallucination](#multimodal-hallucination)\n  - [Multimodal In-Context Learning](#multimodal-in-context-learning)\n  - [Multimodal Chain-of-Thought](#multimodal-chain-of-thought)\n  - [LLM-Aided Visual Reasoning](#llm-aided-visual-reasoning)\n  - [Foundation Models](#foundation-models)\n  - [Evaluation](#evaluation)\n  - [Multimodal RLHF](#multimodal-rlhf)\n  - [Others](#others)\n- [Awesome Datasets](#awesome-datasets)\n  - [Datasets of Pre-Training for Alignment](#datasets-of-pre-training-for-alignment)\n  - [Datasets of Multimodal Instruction Tuning](#datasets-of-multimodal-instruction-tuning)\n  - [Datasets of In-Context Learning](#datasets-of-in-context-learning)\n  - [Datasets of Multimodal Chain-of-Thought](#datasets-of-multimodal-chain-of-thought)\n  - [Datasets of Multimodal RLHF](#datasets-of-multimodal-rlhf)\n  - [Benchmarks for Evaluation](#benchmarks-for-evaluation)\n  - [Others](#others-1)\n---\n\n# Awesome Papers\n\n## Multimodal Instruction Tuning\n|  Title  |   Venue  |   Date   |   Code   |   Demo   |\n|:--------|:--------:|:--------:|:--------:|:--------:|\n| [**Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs**](https://arxiv.org/pdf/2503.01743) | arXiv | 2025-03-03 | [Hugging Face](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) | [Demo](https://huggingface.co/spaces/microsoft/phi-4-multimodal) | \n| ![Star](https://img.shields.io/github/stars/QwenLM/Qwen2.5-VL.svg?style=social\u0026label=Star) \u003cbr\u003e [**Qwen2.5-VL Technical Report**](https://arxiv.org/pdf/2502.13923) \u003cbr\u003e | arXiv | 2025-02-19 | [Github](https://github.com/QwenLM/Qwen2.5-VL) | [Demo](https://huggingface.co/spaces/Qwen/Qwen2.5-VL) |\n| ![Star](https://img.shields.io/github/stars/VITA-MLLM/Long-VITA.svg?style=social\u0026label=Star) \u003cbr\u003e [**Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuray**](https://arxiv.org/pdf/2502.05177) \u003cbr\u003e | arXiv | 2025-02-07 | [Github](https://github.com/VITA-MLLM/Long-VITA) | - |\n| ![Star](https://img.shields.io/github/stars/baichuan-inc/Baichuan-Omni-1.5.svg?style=social\u0026label=Star) \u003cbr\u003e [**Baichuan-Omni-1.5 Technical Report**](https://github.com/baichuan-inc/Baichuan-Omni-1.5/blob/main/baichuan_omni_1_5.pdf) \u003cbr\u003e | Tech Report | 2025-01-26 | [Github](https://github.com/baichuan-inc/Baichuan-Omni-1.5) | Local Demo |\n| ![Star](https://img.shields.io/github/stars/mbzuai-oryx/LlamaV-o1.svg?style=social\u0026label=Star) \u003cbr\u003e [**LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs**](https://arxiv.org/pdf/2501.06186) \u003cbr\u003e | arXiv | 2025-01-10 | [Github](https://github.com/mbzuai-oryx/LlamaV-o1) | - |\n| ![Star](https://img.shields.io/github/stars/VITA-MLLM/VITA.svg?style=social\u0026label=Star) \u003cbr\u003e [**VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction**](https://arxiv.org/pdf/2501.01957) \u003cbr\u003e | arXiv | 2025-01-03 | [Github](https://github.com/VITA-MLLM/VITA) | - |\n| ![Star](https://img.shields.io/github/stars/QwenLM/Qwen2-VL.svg?style=social\u0026label=Star) \u003cbr\u003e [**QVQ: To See the World with Wisdom**](https://qwenlm.github.io/blog/qvq-72b-preview/) \u003cbr\u003e | Qwen | 2024-12-25 | [Github](https://github.com/QwenLM/Qwen2-VL) | [Demo](https://qwenlm.github.io/blog/qvq-72b-preview/) |\n| ![Star](https://img.shields.io/github/stars/deepseek-ai/DeepSeek-VL2.svg?style=social\u0026label=Star) \u003cbr\u003e [**DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding**](https://arxiv.org/pdf/2412.10302) \u003cbr\u003e | arXiv | 2024-12-13 | [Github](https://github.com/deepseek-ai/DeepSeek-VL2) | - |\n| [**Apollo: An Exploration of Video Understanding in Large Multimodal Models**](https://arxiv.org/pdf/2412.10360) | arXiv | 2024-12-13 | - | - |\n| ![Star](https://img.shields.io/github/stars/InternLM/InternLM-XComposer.svg?style=social\u0026label=Star) \u003cbr\u003e [**InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions**](https://arxiv.org/pdf/2412.09596) \u003cbr\u003e | arXiv | 2024-12-12 | [Github](https://github.com/InternLM/InternLM-XComposer/tree/main/InternLM-XComposer-2.5-OmniLive) | Local Demo |\n| [**StreamChat: Chatting with Streaming Video**](https://arxiv.org/pdf/2412.08646) | arXiv | 2024-12-11 | Coming soon | - |\n| [**CompCap: Improving Multimodal Large Language Models with Composite Captions**](https://arxiv.org/pdf/2412.05243) | arXiv | 2024-12-06 | - | - |\n| ![Star](https://img.shields.io/github/stars/gls0425/LinVT.svg?style=social\u0026label=Star) \u003cbr\u003e [**LinVT: Empower Your Image-level Large Language Model to Understand Videos**](https://arxiv.org/pdf/2412.05185) \u003cbr\u003e | arXiv | 2024-12-06 | [Github](https://github.com/gls0425/LinVT) | - |\n| ![Star](https://img.shields.io/github/stars/OpenGVLab/InternVL.svg?style=social\u0026label=Star) \u003cbr\u003e [**Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling**](https://arxiv.org/pdf/2412.05271) \u003cbr\u003e | arXiv | 2024-12-06 | [Github](https://github.com/OpenGVLab/InternVL) | [Demo](https://internvl.opengvlab.com) |\n| ![Star](https://img.shields.io/github/stars/NVlabs/VILA.svg?style=social\u0026label=Star) \u003cbr\u003e [**NVILA: Efficient Frontier Visual Language Models**](https://arxiv.org/pdf/2412.04468) \u003cbr\u003e | arXiv | 2024-12-05 | [Github](https://github.com/NVlabs/VILA) | [Demo](https://vila.mit.edu) |\n| ![Star](https://img.shields.io/github/stars/inst-it/inst-it.svg?style=social\u0026label=Star) \u003cbr\u003e [**Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning**](https://arxiv.org/pdf/2412.03565) \u003cbr\u003e | arXiv | 2024-12-04 | [Github](https://github.com/inst-it/inst-it) | - |\n| ![Star](https://img.shields.io/github/stars/VITA-MLLM/Sparrow.svg?style=social\u0026label=Star) \u003cbr\u003e [**Sparrow: Data-Efficient Video-LLM with Text-to-Image Augmentation**](https://arxiv.org/pdf/2411.19951) \u003cbr\u003e | arXiv | 2024-11-29 | [Github](https://github.com/VITA-MLLM/Sparrow) | - |\n| ![Star](https://img.shields.io/github/stars/TimeMarker-LLM/TimeMarker.svg?style=social\u0026label=Star) \u003cbr\u003e [**TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability**](https://arxiv.org/pdf/2411.18211) \u003cbr\u003e | arXiv | 2024-11-27 | [Github](https://github.com/TimeMarker-LLM/TimeMarker/) | - |\n| ![Star](https://img.shields.io/github/stars/IDEA-Research/ChatRex.svg?style=social\u0026label=Star) \u003cbr\u003e [**ChatRex: Taming Multimodal LLM for Joint Perception and Understanding**](https://arxiv.org/pdf/2411.18363) \u003cbr\u003e | arXiv | 2024-11-27 | [Github](https://github.com/IDEA-Research/ChatRex) | Local Demo | \n| ![Star](https://img.shields.io/github/stars/Vision-CAIR/LongVU.svg?style=social\u0026label=Star) \u003cbr\u003e [**LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding**](https://arxiv.org/pdf/2410.17434) \u003cbr\u003e | arXiv | 2024-10-22 | [Github](https://github.com/Vision-CAIR/LongVU) | [Demo](https://huggingface.co/spaces/Vision-CAIR/LongVU) |\n| ![Star](https://img.shields.io/github/stars/shikiw/Modality-Integration-Rate.svg?style=social\u0026label=Star) \u003cbr\u003e [**Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate**](https://arxiv.org/pdf/2410.07167) \u003cbr\u003e | arXiv | 2024-10-09 | [Github](https://github.com/shikiw/Modality-Integration-Rate) | - |\n| ![Star](https://img.shields.io/github/stars/rese1f/aurora.svg?style=social\u0026label=Star) \u003cbr\u003e [**AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark**](https://arxiv.org/pdf/2410.03051) \u003cbr\u003e | arXiv | 2024-10-04 | [Github](https://github.com/rese1f/aurora) | Local Demo |\n| ![Star](https://img.shields.io/github/stars/emova-ollm/EMOVA.svg?style=social\u0026label=Star) \u003cbr\u003e [**EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions**](https://arxiv.org/pdf/2409.18042) \u003cbr\u003e | CVPR | 2024-09-26 | [Github](https://github.com/emova-ollm/EMOVA) | [Demo](https://huggingface.co/spaces/Emova-ollm/EMOVA-demo) | \n| [**Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models**](https://arxiv.org/pdf/2409.17146) | arXiv | 2024-09-25 | [Huggingface](https://huggingface.co/allenai/MolmoE-1B-0924) | [Demo](https://molmo.allenai.org) |\n| ![Star](https://img.shields.io/github/stars/QwenLM/Qwen2-VL.svg?style=social\u0026label=Star) \u003cbr\u003e [**Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution**](https://arxiv.org/pdf/2409.12191) \u003cbr\u003e | arXiv | 2024-09-18 | [Github](https://github.com/QwenLM/Qwen2-VL) | [Demo](https://huggingface.co/spaces/Qwen/Qwen2-VL) |\n| ![Star](https://img.shields.io/github/stars/IDEA-FinAI/ChartMoE.svg?style=social\u0026label=Star) \u003cbr\u003e [**ChartMoE: Mixture of Expert Connector for Advanced Chart Understanding**](https://arxiv.org/pdf/2409.03277) \u003cbr\u003e | ICLR | 2024-09-05 | [Github](https://github.com/IDEA-FinAI/ChartMoE) | Local Demo |\n| ![Star](https://img.shields.io/github/stars/FreedomIntelligence/LongLLaVA.svg?style=social\u0026label=Star) \u003cbr\u003e [**LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture**](https://arxiv.org/pdf/2409.02889) \u003cbr\u003e | arXiv | 2024-09-04 | [Github](https://github.com/FreedomIntelligence/LongLLaVA) | - | \n| ![Star](https://img.shields.io/github/stars/NVlabs/Eagle.svg?style=social\u0026label=Star) \u003cbr\u003e [**EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders**](https://arxiv.org/pdf/2408.15998) \u003cbr\u003e | arXiv | 2024-08-28 | [Github](https://github.com/NVlabs/Eagle) | [Demo](https://huggingface.co/spaces/NVEagle/Eagle-X5-13B-Chat) |\n| ![Star](https://img.shields.io/github/stars/shufangxun/LLaVA-MoD.svg?style=social\u0026label=Star) \u003cbr\u003e [**LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation**](https://arxiv.org/pdf/2408.15881) \u003cbr\u003e | arXiv | 2024-08-28 | [Github](https://github.com/shufangxun/LLaVA-MoD) | - |\n| ![Star](https://img.shields.io/github/stars/X-PLUG/mPLUG-Owl.svg?style=social\u0026label=Star) \u003cbr\u003e [**mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models**](https://www.arxiv.org/pdf/2408.04840) \u003cbr\u003e | arXiv | 2024-08-09 | [Github](https://github.com/X-PLUG/mPLUG-Owl) | - |\n| ![Star](https://img.shields.io/github/stars/VITA-MLLM/VITA.svg?style=social\u0026label=Star) \u003cbr\u003e [**VITA: Towards Open-Source Interactive Omni Multimodal LLM**](https://arxiv.org/pdf/2408.05211) \u003cbr\u003e | arXiv | 2024-08-09 | [Github](https://github.com/VITA-MLLM/VITA) | - | \n| ![Star](https://img.shields.io/github/stars/LLaVA-VL/LLaVA-NeXT.svg?style=social\u0026label=Star) \u003cbr\u003e [**LLaVA-OneVision: Easy Visual Task Transfer**](https://arxiv.org/pdf/2408.03326) \u003cbr\u003e | arXiv | 2024-08-06 | [Github](https://github.com/LLaVA-VL/LLaVA-NeXT) | [Demo](https://llava-onevision.lmms-lab.com) | \n| ![Star](https://img.shields.io/github/stars/OpenBMB/MiniCPM-V.svg?style=social\u0026label=Star) \u003cbr\u003e [**MiniCPM-V: A GPT-4V Level MLLM on Your Phone**](https://arxiv.org/pdf/2408.01800) \u003cbr\u003e | arXiv | 2024-08-03 | [Github](https://github.com/OpenBMB/MiniCPM-V) | [Demo](https://huggingface.co/spaces/openbmb/MiniCPM-Llama3-V-2_5) |\n| [**VILA^2: VILA Augmented VILA**](https://arxiv.org/pdf/2407.17453) | arXiv | 2024-07-24 | - | - |\n| [**SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models**](https://arxiv.org/pdf/2407.15841) | arXiv | 2024-07-22 | - | - |\n| [**EVLM: An Efficient Vision-Language Model for Visual Understanding**](https://arxiv.org/pdf/2407.14177) | arXiv | 2024-07-19 | - | - |\n| ![Star](https://img.shields.io/github/stars/jiyt17/IDA-VLM.svg?style=social\u0026label=Star) \u003cbr\u003e [**IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model**](https://arxiv.org/pdf/2407.07577) \u003cbr\u003e | arXiv | 2024-07-10 | [Github](https://github.com/jiyt17/IDA-VLM) | - |\n| ![Star](https://img.shields.io/github/stars/InternLM/InternLM-XComposer.svg?style=social\u0026label=Star) \u003cbr\u003e [**InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output**](https://arxiv.org/pdf/2407.03320) \u003cbr\u003e | arXiv | 2024-07-03 | [Github](https://github.com/InternLM/InternLM-XComposer) | [Demo](https://openxlab.org.cn/apps/detail/WillowBreeze/InternLM-XComposer) |\n| ![Star](https://img.shields.io/github/stars/lxtGH/OMG-Seg.svg?style=social\u0026label=Star) \u003cbr\u003e [**OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding**](https://arxiv.org/pdf/2406.19389) \u003cbr\u003e | arXiv | 2024-06-27 | [Github](https://github.com/lxtGH/OMG-Seg) | Local Demo |\n| ![Star](https://img.shields.io/github/stars/ZZZHANG-jx/DocKylin.svg?style=social\u0026label=Star) \u003cbr\u003e [**DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming**](https://arxiv.org/pdf/2406.19101) \u003cbr\u003e | AAAI | 2024-06-27 | [Github](https://github.com/ZZZHANG-jx/DocKylin) | - |\n| ![Star](https://img.shields.io/github/stars/cambrian-mllm/cambrian.svg?style=social\u0026label=Star) \u003cbr\u003e [**Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs**](https://arxiv.org/pdf/2406.16860) \u003cbr\u003e | arXiv | 2024-06-24 | [Github](https://github.com/cambrian-mllm/cambrian) | Local Demo |\n| ![Star](https://img.shields.io/github/stars/EvolvingLMMs-Lab/LongVA.svg?style=social\u0026label=Star) \u003cbr\u003e [**Long Context Transfer from Language to Vision**](https://arxiv.org/pdf/2406.16852) \u003cbr\u003e | arXiv | 2024-06-24 | [Github](https://github.com/EvolvingLMMs-Lab/LongVA) | Local Demo |\n| ![Star](https://img.shields.io/github/stars/bytedance/SALMONN.svg?style=social\u0026label=Star) \u003cbr\u003e [**video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models**](https://arxiv.org/pdf/2406.15704) \u003cbr\u003e | ICML | 2024-06-22 | [Github](https://github.com/bytedance/SALMONN) | - |\n| ![Star](https://img.shields.io/github/stars/ByungKwanLee/TroL.svg?style=social\u0026label=Star) \u003cbr\u003e [**TroL: Traversal of Layers for Large Language and Vision Models**](https://arxiv.org/pdf/2406.12246) \u003cbr\u003e | EMNLP | 2024-06-18 | [Github](https://github.com/ByungKwanLee/TroL) | Local Demo |\n| ![Star](https://img.shields.io/github/stars/baaivision/EVE.svg?style=social\u0026label=Star) \u003cbr\u003e [**Unveiling Encoder-Free Vision-Language Models**](https://arxiv.org/pdf/2406.11832) \u003cbr\u003e | arXiv | 2024-06-17 | [Github](https://github.com/baaivision/EVE) | Local Demo |\n| ![Star](https://img.shields.io/github/stars/showlab/VideoLLM-online.svg?style=social\u0026label=Star) \u003cbr\u003e [**VideoLLM-online: Online Video Large Language Model for Streaming Video**](https://arxiv.org/pdf/2406.11816) \u003cbr\u003e | CVPR | 2024-06-17 | [Github](https://github.com/showlab/VideoLLM-online) | Local Demo |\n| ![Star](https://img.shields.io/github/stars/wentaoyuan/RoboPoint.svg?style=social\u0026label=Star) \u003cbr\u003e [**RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics**](https://arxiv.org/pdf/2406.10721) \u003cbr\u003e | CoRL | 2024-06-15 | [Github](https://github.com/wentaoyuan/RoboPoint) | [Demo](https://007e03d34429a2517b.gradio.live/) | \n| ![Star](https://img.shields.io/github/stars/wlin-at/CaD-VI) \u003cbr\u003e [**Comparison Visual Instruction Tuning**](https://arxiv.org/abs/2406.09240) \u003cbr\u003e | arXiv | 2024-06-13 | [Github](https://wlin-at.github.io/cad_vi) | Local Demo |\n| ![Star](https://img.shields.io/github/stars/yfzhang114/SliME.svg?style=social\u0026label=Star) \u003cbr\u003e [**Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models**](https://arxiv.org/pdf/2406.08487) \u003cbr\u003e | arXiv | 2024-06-12 | [Github](https://github.com/yfzhang114/SliME) | - |\n| ![Star](https://img.shields.io/github/stars/DAMO-NLP-SG/VideoLLaMA2.svg?style=social\u0026label=Star) \u003cbr\u003e [**VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs**](https://arxiv.org/pdf/2406.07476) \u003cbr\u003e | arXiv | 2024-06-11 | [Github](https://github.com/DAMO-NLP-SG/VideoLLaMA2) | Local Demo |\n| ![Star](https://img.shields.io/github/stars/AIDC-AI/Parrot.svg?style=social\u0026label=Star) \u003cbr\u003e [**Parrot: Multilingual Visual Instruction Tuning**](https://arxiv.org/pdf/2406.02539) \u003cbr\u003e | arXiv | 2024-06-04 | [Github](https://github.com/AIDC-AI/Parrot) | - |\n| ![Star](https://img.shields.io/github/stars/AIDC-AI/Ovis.svg?style=social\u0026label=Star) \u003cbr\u003e [**Ovis: Structural Embedding Alignment for Multimodal Large Language Model**](https://arxiv.org/pdf/2405.20797) \u003cbr\u003e | arXiv | 2024-05-31 | [Github](https://github.com/AIDC-AI/Ovis/) | - |\n| ![Star](https://img.shields.io/github/stars/gordonhu608/MQT-LLaVA.svg?style=social\u0026label=Star) \u003cbr\u003e [**Matryoshka Query Transformer for Large Vision-Language Models**](https://arxiv.org/pdf/2405.19315) \u003cbr\u003e | arXiv | 2024-05-29 | [Github](https://github.com/gordonhu608/MQT-LLaVA) | [Demo](https://huggingface.co/spaces/gordonhu/MQT-LLaVA) |\n| ![Star](https://img.shields.io/github/stars/alibaba/conv-llava.svg?style=social\u0026label=Star) \u003cbr\u003e [**ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models**](https://arxiv.org/pdf/2405.15738) \u003cbr\u003e | arXiv | 2024-05-24 | [Github](https://github.com/alibaba/conv-llava) | - |\n| ![Star](https://img.shields.io/github/stars/ByungKwanLee/Meteor.svg?style=social\u0026label=Star) \u003cbr\u003e [**Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models**](https://arxiv.org/pdf/2405.15574) \u003cbr\u003e | arXiv | 2024-05-24 | [Github](https://github.com/ByungKwanLee/Meteor) | [Demo](https://huggingface.co/spaces/BK-Lee/Meteor) | \n| ![Star](https://img.shields.io/github/stars/YifanXu74/Libra.svg?style=social\u0026label=Star) \u003cbr\u003e [**Libra: Building Decoupled Vision System on Large Language Models**](https://arxiv.org/pdf/2405.10140) \u003cbr\u003e | ICML | 2024-05-16 | [Github](https://github.com/YifanXu74/Libra) | Local Demo |\n| ![Star](https://img.shields.io/github/stars/SHI-Labs/CuMo.svg?style=social\u0026label=Star) \u003cbr\u003e [**CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts**](https://arxiv.org/pdf/2405.05949) \u003cbr\u003e | arXiv | 2024-05-09 | [Github](https://github.com/SHI-Labs/CuMo) | Local Demo |\n| ![Star](https://img.shields.io/github/stars/OpenGVLab/InternVL.svg?style=social\u0026label=Star) \u003cbr\u003e [**How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites**](https://arxiv.org/pdf/2404.16821) \u003cbr\u003e | arXiv | 2024-04-25 | [Github](https://github.com/OpenGVLab/InternVL) | [Demo](https://internvl.opengvlab.com) |\n| ![Star](https://img.shields.io/github/stars/graphic-design-ai/graphist.svg?style=social\u0026label=Star) \u003cbr\u003e [**Graphic Design with Large Multimodal Model**](https://arxiv.org/pdf/2404.14368) \u003cbr\u003e | arXiv | 2024-04-22 | [Github](https://github.com/graphic-design-ai/graphist) | - |\n| [**BRAVE: Broadening the visual encoding of vision-language models**](https://arxiv.org/abs/2404.07204) | ECCV | 2024-04-10 | - | - |\n| ![Star](https://img.shields.io/github/stars/InternLM/InternLM-XComposer.svg?style=social\u0026label=Star) \u003cbr\u003e [**InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD**](https://arxiv.org/pdf/2404.06512.pdf) \u003cbr\u003e | arXiv | 2024-04-09 | [Github](https://github.com/InternLM/InternLM-XComposer) | [Demo](https://huggingface.co/spaces/Willow123/InternLM-XComposer) |\n| [**Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs**](https://arxiv.org/pdf/2404.05719.pdf) | arXiv | 2024-04-08 | - | - |\n| ![Star](https://img.shields.io/github/stars/boheumd/MA-LMM.svg?style=social\u0026label=Star) \u003cbr\u003e [**MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding**](https://arxiv.org/pdf/2404.05726.pdf) \u003cbr\u003e | CVPR | 2024-04-08 | [Github](https://github.com/boheumd/MA-LMM) | - |\n| ![Star](https://img.shields.io/github/stars/SkyworkAI/Vitron.svg?style=social\u0026label=Star) \u003cbr\u003e [**VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing**](https://haofei.vip/downloads/papers/Skywork_Vitron_2024.pdf) \u003cbr\u003e | NeurIPS | 2024-04-04 | [Github](https://github.com/SkyworkAI/Vitron) | Local Demo |\n| [**TOMGPT: Reliable Text-Only Training Approach for Cost-Effective Multi-modal Large Language Model**](https://dl.acm.org/doi/pdf/10.1145/3654674) | ACM TKDD | 2024-03-28 | - | - |\n| ![Star](https://img.shields.io/github/stars/NVlabs/LITA.svg?style=social\u0026label=Star) \u003cbr\u003e [**LITA: Language Instructed Temporal-Localization Assistant**](https://arxiv.org/pdf/2403.19046) | arXiv | 2024-03-27 | [Github](https://github.com/NVlabs/LITA) | Local Demo |\n| ![Star](https://img.shields.io/github/stars/dvlab-research/MiniGemini.svg?style=social\u0026label=Star) \u003cbr\u003e [**Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models**](https://arxiv.org/pdf/2403.18814.pdf) \u003cbr\u003e | arXiv | 2024-03-27 | [Github](https://github.com/dvlab-research/MiniGemini) | [Demo](http://103.170.5.190:7860) |\n| [**MM1: Methods, Analysis \u0026 Insights from Multimodal LLM Pre-training**](https://arxiv.org/pdf/2403.09611.pdf) | arXiv | 2024-03-14 | - | - |\n| ![Star](https://img.shields.io/github/stars/ByungKwanLee/MoAI.svg?style=social\u0026label=Star) \u003cbr\u003e [**MoAI: Mixture of All Intelligence for Large Language and Vision Models**](https://arxiv.org/pdf/2403.07508.pdf) \u003cbr\u003e | arXiv | 2024-03-12 | [Github](https://github.com/ByungKwanLee/MoAI) | Local Demo |\n| ![Star](https://img.shields.io/github/stars/deepseek-ai/DeepSeek-VL.svg?style=social\u0026label=Star) \u003cbr\u003e [**DeepSeek-VL: Towards Real-World Vision-Language Understanding**](https://arxiv.org/pdf/2403.05525) \u003cbr\u003e | arXiv | 2024-03-08 | [Github](https://github.com/deepseek-ai/DeepSeek-VL) | [Demo](https://huggingface.co/spaces/deepseek-ai/DeepSeek-VL-7B) |\n| ![Star](https://img.shields.io/github/stars/Yuliang-Liu/Monkey.svg?style=social\u0026label=Star) \u003cbr\u003e [**TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document**](https://arxiv.org/pdf/2403.04473.pdf) \u003cbr\u003e | arXiv | 2024-03-07 | [Github](https://github.com/Yuliang-Liu/Monkey) | [Demo](http://vlrlab-monkey.xyz:7684) |\n| ![Star](https://img.shields.io/github/stars/OpenGVLab/all-seeing.svg?style=social\u0026label=Star) \u003cbr\u003e [**The All-Seeing Project V2: Towards General Relation Comprehension of the Open World**](https://arxiv.org/pdf/2402.19474.pdf) | arXiv | 2024-02-29 | [Github](https://github.com/OpenGVLab/all-seeing) | - |\n| [**GROUNDHOG: Grounding Large Language Models to Holistic Segmentation**](https://arxiv.org/pdf/2402.16846.pdf) | CVPR | 2024-02-26 | Coming soon | Coming soon |\n| ![Star](https://img.shields.io/github/stars/OpenMOSS/AnyGPT.svg?style=social\u0026label=Star) \u003cbr\u003e [**AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling**](https://arxiv.org/pdf/2402.12226.pdf) \u003cbr\u003e | arXiv | 2024-02-19 | [Github](https://github.com/OpenMOSS/AnyGPT) | - |\n| ![Star](https://img.shields.io/github/stars/DCDmllm/Momentor.svg?style=social\u0026label=Star) \u003cbr\u003e [**Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning**](https://arxiv.org/pdf/2402.11435.pdf) \u003cbr\u003e | arXiv | 2024-02-18 | [Github](https://github.com/DCDmllm/Momentor) | - |\n| ![Star](https://img.shields.io/github/stars/FreedomIntelligence/ALLaVA.svg?style=social\u0026label=Star) \u003cbr\u003e [**ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model**](https://arxiv.org/pdf/2402.11684.pdf) \u003cbr\u003e | arXiv | 2024-02-18 | [Github](https://github.com/FreedomIntelligence/ALLaVA) | [Demo](https://huggingface.co/FreedomIntelligence/ALLaVA-3B) |\n| ![Star](https://img.shields.io/github/stars/ByungKwanLee/CoLLaVO-Crayon-Large-Language-and-Vision-mOdel.svg?style=social\u0026label=Star) \u003cbr\u003e [**CoLLaVO: Crayon Large Language and Vision mOdel**](https://arxiv.org/pdf/2402.11248.pdf) \u003cbr\u003e | arXiv | 2024-02-17 | [Github](https://github.com/ByungKwanLee/CoLLaVO-Crayon-Large-Language-and-Vision-mOdel) | - |\n| ![Star](https://img.shields.io/github/stars/TRI-ML/prismatic-vlms.svg?style=social\u0026label=Star) \u003cbr\u003e [**Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models**](https://arxiv.org/pdf/2402.07865) \u003cbr\u003e | ICML | 2024-02-12 | [Github](https://github.com/TRI-ML/prismatic-vlms) | - | \n| ![Star](https://img.shields.io/github/stars/THUDM/CogCoM.svg?style=social\u0026label=Star) \u003cbr\u003e [**CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations**](https://arxiv.org/pdf/2402.04236.pdf) \u003cbr\u003e | arXiv | 2024-02-06 | [Github](https://github.com/THUDM/CogCoM) | - |\n| ![Star](https://img.shields.io/github/stars/Meituan-AutoML/MobileVLM.svg?style=social\u0026label=Star) \u003cbr\u003e [**MobileVLM V2: Faster and Stronger Baseline for Vision Language Model**](https://arxiv.org/pdf/2402.03766.pdf) \u003cbr\u003e | arXiv | 2024-02-06 | [Github](https://github.com/Meituan-AutoML/MobileVLM) | - |\n| ![Star](https://img.shields.io/github/stars/WEIYanbin1999/GITA.svg?style=social\u0026label=Star) \u003cbr\u003e [**GITA: Graph to Visual and Textual Integration for Vision-Language Graph Reasoning**](https://arxiv.org/pdf/2402.02130) \u003cbr\u003e | NeurIPS | 2024-02-03 | [Github](https://github.com/WEIYanbin1999/GITA/) | - |\n| [**Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study**](https://arxiv.org/pdf/2401.17981.pdf) | arXiv | 2024-01-31 | [Coming soon]() | - |\n| ![Star](https://img.shields.io/github/stars/haotian-liu/LLaVA.svg?style=social\u0026label=Star) \u003cbr\u003e [**LLaVA-NeXT: Improved reasoning, OCR, and world knowledge**](https://llava-vl.github.io/blog/2024-01-30-llava-next/) | Blog | 2024-01-30 | [Github](https://github.com/haotian-liu/LLaVA) | [Demo](https://llava.hliu.cc) |\n| ![Star](https://img.shields.io/github/stars/PKU-YuanGroup/MoE-LLaVA.svg?style=social\u0026label=Star) \u003cbr\u003e [**MoE-LLaVA: Mixture of Experts for Large Vision-Language Models**](https://arxiv.org/pdf/2401.15947.pdf) \u003cbr\u003e | arXiv | 2024-01-29 | [Github](https://github.com/PKU-YuanGroup/MoE-LLaVA) | [Demo](https://huggingface.co/spaces/LanguageBind/MoE-LLaVA) |\n| ![Star](https://img.shields.io/github/stars/InternLM/InternLM-XComposer.svg?style=social\u0026label=Star) \u003cbr\u003e [**InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model**](https://arxiv.org/pdf/2401.16420.pdf) \u003cbr\u003e | arXiv | 2024-01-29 | [Github](https://github.com/InternLM/InternLM-XComposer) | [Demo](https://openxlab.org.cn/apps/detail/WillowBreeze/InternLM-XComposer) |\n| ![Star](https://img.shields.io/github/stars/01-ai/Yi.svg?style=social\u0026label=Star) \u003cbr\u003e [**Yi-VL**](https://github.com/01-ai/Yi/tree/main/VL) \u003cbr\u003e | - | 2024-01-23 | [Github](https://github.com/01-ai/Yi/tree/main/VL) | Local Demo |\n| [**SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities**](https://arxiv.org/pdf/2401.12168.pdf) | arXiv | 2024-01-22 | - | - |\n| ![Star](https://img.shields.io/github/stars/OpenGVLab/ChartAst.svg?style=social\u0026label=Star) \u003cbr\u003e [**ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning**](https://arxiv.org/pdf/2401.02384) \u003cbr\u003e | ACL | 2024-01-04 | [Github](https://github.com/OpenGVLab/ChartAst) | Local Demo | \n| ![Star](https://img.shields.io/github/stars/Meituan-AutoML/MobileVLM.svg?style=social\u0026label=Star) \u003cbr\u003e [**MobileVLM : A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices**](https://arxiv.org/pdf/2312.16886.pdf) \u003cbr\u003e | arXiv | 2023-12-28 | [Github](https://github.com/Meituan-AutoML/MobileVLM) | - | \n| ![Star](https://img.shields.io/github/stars/OpenGVLab/InternVL.svg?style=social\u0026label=Star) \u003cbr\u003e [**InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks**](https://arxiv.org/pdf/2312.14238.pdf) \u003cbr\u003e | CVPR | 2023-12-21 | [Github](https://github.com/OpenGVLab/InternVL) | [Demo](https://internvl.opengvlab.com) |\n| ![Star](https://img.shields.io/github/stars/CircleRadon/Osprey.svg?style=social\u0026label=Star) \u003cbr\u003e [**Osprey: Pixel Understanding with Visual Instruction Tuning**](https://arxiv.org/pdf/2312.10032.pdf) \u003cbr\u003e | CVPR | 2023-12-15 | [Github](https://github.com/CircleRadon/Osprey) | [Demo](http://111.0.123.204:8000/) |\n| ![Star](https://img.shields.io/github/stars/THUDM/CogVLM.svg?style=social\u0026label=Star) \u003cbr\u003e [**CogAgent: A Visual Language Model for GUI Agents**](https://arxiv.org/pdf/2312.08914.pdf) \u003cbr\u003e | arXiv | 2023-12-14 | [Github](https://github.com/THUDM/CogVLM) | [Coming soon]() |\n| [**Pixel Aligned Language Models**](https://arxiv.org/pdf/2312.09237.pdf) | arXiv | 2023-12-14 | [Coming soon]() | - |\n| ![Star](https://img.shields.io/github/stars/NVlabs/VILA.svg?style=social\u0026label=Star) \u003cbr\u003e [**VILA: On Pre-training for Visual Language Models**](https://arxiv.org/pdf/2312.07533) \u003cbr\u003e | CVPR | 2023-12-13 | [Github](https://github.com/NVlabs/VILA) | Local Demo |\n| [**See, Say, and Segment: Teaching LMMs to Overcome False Premises**](https://arxiv.org/pdf/2312.08366.pdf) | arXiv | 2023-12-13 | [Coming soon]() | - | \n| ![Star](https://img.shields.io/github/stars/Ucas-HaoranWei/Vary.svg?style=social\u0026label=Star) \u003cbr\u003e [**Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models**](https://arxiv.org/pdf/2312.06109.pdf) \u003cbr\u003e | ECCV | 2023-12-11 | [Github](https://github.com/Ucas-HaoranWei/Vary) | [Demo](http://region-31.seetacloud.com:22701/) |\n| ![Star](https://img.shields.io/github/stars/kakaobrain/honeybee.svg?style=social\u0026label=Star) \u003cbr\u003e [**Honeybee: Locality-enhanced Projector for Multimodal LLM**](https://arxiv.org/pdf/2312.06742.pdf) \u003cbr\u003e | CVPR | 2023-12-11 | [Github](https://github.com/kakaobrain/honeybee) | - |\n| [**Gemini: A Family of Highly Capable Multimodal Models**](https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf) | Google | 2023-12-06 | - | - |\n| ![Star](https://img.shields.io/github/stars/csuhan/OneLLM.svg?style=social\u0026label=Star) \u003cbr\u003e [**OneLLM: One Framework to Align All Modalities with Language**](https://arxiv.org/pdf/2312.03700.pdf) \u003cbr\u003e | arXiv | 2023-12-06 | [Github](https://github.com/csuhan/OneLLM) | [Demo](https://huggingface.co/spaces/csuhan/OneLLM) |\n| ![Star](https://img.shields.io/github/stars/Meituan-AutoML/Lenna.svg?style=social\u0026label=Star) \u003cbr\u003e [**Lenna: Language Enhanced Reasoning Detection Assistant**](https://arxiv.org/pdf/2312.02433.pdf) \u003cbr\u003e | arXiv | 2023-12-05 | [Github](https://github.com/Meituan-AutoML/Lenna) | - | \n| [**VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding**](https://arxiv.org/pdf/2312.02310.pdf) | arXiv | 2023-12-04 | - | - |\n| ![Star](https://img.shields.io/github/stars/RenShuhuai-Andy/TimeChat.svg?style=social\u0026label=Star) \u003cbr\u003e [**TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding**](https://arxiv.org/pdf/2312.02051.pdf) \u003cbr\u003e | arXiv | 2023-12-04 | [Github](https://github.com/RenShuhuai-Andy/TimeChat) | Local Demo | \n| ![Star](https://img.shields.io/github/stars/mu-cai/vip-llava.svg?style=social\u0026label=Star) \u003cbr\u003e [**Making Large Multimodal Models Understand Arbitrary Visual Prompts**](https://arxiv.org/pdf/2312.00784.pdf) \u003cbr\u003e | CVPR | 2023-12-01 | [Github](https://github.com/mu-cai/vip-llava) | [Demo](https://pages.cs.wisc.edu/~mucai/vip-llava.html) | \n| ![Star](https://img.shields.io/github/stars/vlm-driver/Dolphins.svg?style=social\u0026label=Star) \u003cbr\u003e [**Dolphins: Multimodal Language Model for Driving**](https://arxiv.org/pdf/2312.00438.pdf) \u003cbr\u003e | arXiv | 2023-12-01 | [Github](https://github.com/vlm-driver/Dolphins) | - |\n| ![Star](https://img.shields.io/github/stars/Open3DA/LL3DA.svg?style=social\u0026label=Star) \u003cbr\u003e [**LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning**](https://arxiv.org/pdf/2311.18651.pdf) \u003cbr\u003e | arXiv | 2023-11-30 | [Github](https://github.com/Open3DA/LL3DA) | [Coming soon]() |\n| ![Star](https://img.shields.io/github/stars/huangb23/VTimeLLM.svg?style=social\u0026label=Star) \u003cbr\u003e [**VTimeLLM: Empower LLM to Grasp Video Moments**](https://arxiv.org/pdf/2311.18445.pdf) \u003cbr\u003e | arXiv | 2023-11-30 | [Github](https://github.com/huangb23/VTimeLLM/) | Local Demo |\n| ![Star](https://img.shields.io/github/stars/X-PLUG/mPLUG-DocOwl.svg?style=social\u0026label=Star) \u003cbr\u003e [**mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model**](https://arxiv.org/pdf/2311.18248.pdf) \u003cbr\u003e | arXiv | 2023-11-30 | [Github](https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/PaperOwl) | - |\n| ![Star](https://img.shields.io/github/stars/dvlab-research/LLaMA-VID.svg?style=social\u0026label=Star) \u003cbr\u003e [**LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models**](https://arxiv.org/pdf/2311.17043.pdf) \u003cbr\u003e | arXiv | 2023-11-28 | [Github](https://github.com/dvlab-research/LLaMA-VID) | [Coming soon]() |\n| ![Star](https://img.shields.io/github/stars/dvlab-research/LLMGA.svg?style=social\u0026label=Star) \u003cbr\u003e [**LLMGA: Multimodal Large Language Model based Generation Assistant**](https://arxiv.org/pdf/2311.16500.pdf) \u003cbr\u003e | arXiv | 2023-11-27 | [Github](https://github.com/dvlab-research/LLMGA) | [Demo](https://baa55ef8590b623f18.gradio.live/) |\n| ![Star](https://img.shields.io/github/stars/tingxueronghua/ChartLlama-code.svg?style=social\u0026label=Star) \u003cbr\u003e [**ChartLlama: A Multimodal LLM for Chart Understanding and Generation**](https://arxiv.org/pdf/2311.16483.pdf) \u003cbr\u003e | arXiv | 2023-11-27 | [Github](https://github.com/tingxueronghua/ChartLlama-code) | - |\n| ![Star](https://img.shields.io/github/stars/InternLM/InternLM-XComposer.svg?style=social\u0026label=Star) \u003cbr\u003e [**ShareGPT4V: Improving Large Multi-Modal Models with Better Captions**](https://arxiv.org/pdf/2311.12793.pdf) \u003cbr\u003e | arXiv | 2023-11-21 | [Github](https://github.com/InternLM/InternLM-XComposer/tree/main/projects/ShareGPT4V) | [Demo](https://huggingface.co/spaces/Lin-Chen/ShareGPT4V-7B) |\n| ![Star](https://img.shields.io/github/stars/rshaojimmy/JiuTian.svg?style=social\u0026label=Star) \u003cbr\u003e [**LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge**](https://arxiv.org/pdf/2311.11860.pdf) \u003cbr\u003e | arXiv | 2023-11-20 | [Github](https://github.com/rshaojimmy/JiuTian) | - |\n| ![Star](https://img.shields.io/github/stars/embodied-generalist/embodied-generalist.svg?style=social\u0026label=Star) \u003cbr\u003e [**An Embodied Generalist Agent in 3D World**](https://arxiv.org/pdf/2311.12871.pdf) \u003cbr\u003e | arXiv | 2023-11-18 | [Github](https://github.com/embodied-generalist/embodied-generalist) | [Demo](https://www.youtube.com/watch?v=mlnjz4eSjB4) |\n| ![Star](https://img.shields.io/github/stars/PKU-YuanGroup/Video-LLaVA.svg?style=social\u0026label=Star) \u003cbr\u003e [**Video-LLaVA: Learning United Visual Representation by Alignment Before Projection**](https://arxiv.org/pdf/2311.10122.pdf) \u003cbr\u003e | arXiv | 2023-11-16 | [Github](https://github.com/PKU-YuanGroup/Video-LLaVA) | [Demo](https://huggingface.co/spaces/LanguageBind/Video-LLaVA) |\n| ![Star](https://img.shields.io/github/stars/PKU-YuanGroup/Chat-UniVi.svg?style=social\u0026label=Star) \u003cbr\u003e [**Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding**](https://arxiv.org/pdf/2311.08046) \u003cbr\u003e | CVPR | 2023-11-14 | [Github](https://github.com/PKU-YuanGroup/Chat-UniVi) | - |\n| ![Star](https://img.shields.io/github/stars/X2FD/LVIS-INSTRUCT4V.svg?style=social\u0026label=Star) \u003cbr\u003e [**To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning**](https://arxiv.org/pdf/2311.07574.pdf) \u003cbr\u003e | arXiv | 2023-11-13 | [Github](https://github.com/X2FD/LVIS-INSTRUCT4V) | - |\n| ![Star](https://img.shields.io/github/stars/Alpha-VLLM/LLaMA2-Accessory.svg?style=social\u0026label=Star) \u003cbr\u003e [**SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models**](https://arxiv.org/pdf/2311.07575.pdf) \u003cbr\u003e | arXiv | 2023-11-13 | [Github](https://github.com/Alpha-VLLM/LLaMA2-Accessory) | [Demo](http://imagebind-llm.opengvlab.com/) |\n| ![Star](https://img.shields.io/github/stars/Yuliang-Liu/Monkey.svg?style=social\u0026label=Star) \u003cbr\u003e [**Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models**](https://arxiv.org/pdf/2311.06607.pdf) \u003cbr\u003e | CVPR | 2023-11-11 | [Github](https://github.com/Yuliang-Liu/Monkey) | [Demo](http://27.17.184.224:7681/) |\n| ![Star](https://img.shields.io/github/stars/LLaVA-VL/LLaVA-Plus-Codebase.svg?style=social\u0026label=Star) \u003cbr\u003e [**LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents**](https://arxiv.org/pdf/2311.05437.pdf) \u003cbr\u003e | arXiv | 2023-11-09 | [Github](https://github.com/LLaVA-VL/LLaVA-Plus-Codebase) | [Demo](https://llavaplus.ngrok.io/) |\n| ![Star](https://img.shields.io/github/stars/NExT-ChatV/NExT-Chat.svg?style=social\u0026label=Star) \u003cbr\u003e [**NExT-Chat: An LMM for Chat, Detection and Segmentation**](https://arxiv.org/pdf/2311.04498.pdf) \u003cbr\u003e | arXiv | 2023-11-08 | [Github](https://github.com/NExT-ChatV/NExT-Chat) | Local Demo | \n| ![Star](https://img.shields.io/github/stars/X-PLUG/mPLUG-Owl.svg?style=social\u0026label=Star) \u003cbr\u003e [**mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration**](https://arxiv.org/pdf/2311.04257.pdf) \u003cbr\u003e | arXiv | 2023-11-07 | [Github](https://github.com/X-PLUG/mPLUG-Owl/tree/main/mPLUG-Owl2) | [Demo](https://modelscope.cn/studios/damo/mPLUG-Owl2/summary) |\n| ![Star](https://img.shields.io/github/stars/Luodian/Otter.svg?style=social\u0026label=Star) \u003cbr\u003e [**OtterHD: A High-Resolution Multi-modality Model**](https://arxiv.org/pdf/2311.04219.pdf) \u003cbr\u003e | arXiv | 2023-11-07 | [Github](https://github.com/Luodian/Otter) | - |\n| [**CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding**](https://arxiv.org/pdf/2311.03354.pdf) | arXiv | 2023-11-06 | [Coming soon]() | - |\n| ![Star](https://img.shields.io/github/stars/mbzuai-oryx/groundingLMM.svg?style=social\u0026label=Star) \u003cbr\u003e [**GLaMM: Pixel Grounding Large Multimodal Model**](https://arxiv.org/pdf/2311.03356.pdf) \u003cbr\u003e | CVPR | 2023-11-06 | [Github](https://github.com/mbzuai-oryx/groundingLMM) | [Demo](https://glamm.mbzuai-oryx.ngrok.app/) |\n| ![Star](https://img.shields.io/github/stars/RUCAIBox/ComVint.svg?style=social\u0026label=Star) \u003cbr\u003e [**What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning**](https://arxiv.org/pdf/2311.01487.pdf) \u003cbr\u003e | arXiv | 2023-11-02| [Github](https://github.com/RUCAIBox/ComVint) | - |\n| ![Star](https://img.shields.io/github/stars/Vision-CAIR/MiniGPT-4.svg?style=social\u0026label=Star) \u003cbr\u003e [**MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning**](https://arxiv.org/pdf/2310.09478.pdf) \u003cbr\u003e | arXiv | 2023-10-14 | [Github](https://github.com/Vision-CAIR/MiniGPT-4) | Local Demo | \n| ![Star](https://img.shields.io/github/stars/bytedance/SALMONN.svg?style=social\u0026label=Star) \u003cbr\u003e [**SALMONN: Towards Generic Hearing Abilities for Large Language Models**](https://arxiv.org/pdf/2310.13289) \u003cbr\u003e | ICLR | 2023-10-20 | [Github](https://github.com/bytedance/SALMONN) | - |\n| ![Star](https://img.shields.io/github/stars/apple/ml-ferret.svg?style=social\u0026label=Star) \u003cbr\u003e [**Ferret: Refer and Ground Anything Anywhere at Any Granularity**](https://arxiv.org/pdf/2310.07704.pdf) \u003cbr\u003e | arXiv | 2023-10-11 | [Github](https://github.com/apple/ml-ferret) | - |\n| ![Star](https://img.shields.io/github/stars/THUDM/CogVLM.svg?style=social\u0026label=Star) \u003cbr\u003e [**CogVLM: Visual Expert For Large Language Models**](https://arxiv.org/pdf/2311.03079.pdf) \u003cbr\u003e | arXiv | 2023-10-09 | [Github](https://github.com/THUDM/CogVLM) | [Demo](http://36.103.203.44:7861/) | \n| ![Star](https://img.shields.io/github/stars/haotian-liu/LLaVA.svg?style=social\u0026label=Star) \u003cbr\u003e [**Improved Baselines with Visual Instruction Tuning**](https://arxiv.org/pdf/2310.03744.pdf) \u003cbr\u003e | arXiv | 2023-10-05 | [Github](https://github.com/haotian-liu/LLaVA) | [Demo](https://llava.hliu.cc/) |\n| ![Star](https://img.shields.io/github/stars/PKU-YuanGroup/LanguageBind.svg?style=social\u0026label=Star) \u003cbr\u003e [**LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment**](https://arxiv.org/pdf/2310.01852.pdf) \u003cbr\u003e | ICLR | 2023-10-03 | [Github](https://github.com/PKU-YuanGroup/LanguageBind) | [Demo](https://huggingface.co/spaces/LanguageBind/LanguageBind) | \n![Star](https://img.shields.io/github/stars/SY-Xuan/Pink.svg?style=social\u0026label=Star) \u003cbr\u003e [**Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs**](https://arxiv.org/pdf/2310.00582.pdf) | arXiv | 2023-10-01 | [Github](https://github.com/SY-Xuan/Pink) | - |\n| ![Star](https://img.shields.io/github/stars/thunlp/Muffin.svg?style=social\u0026label=Star) \u003cbr\u003e [**Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants**](https://arxiv.org/pdf/2310.00653.pdf) \u003cbr\u003e | arXiv | 2023-10-01 | [Github](https://github.com/thunlp/Muffin) | Local Demo | \n| [**AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model**](https://arxiv.org/pdf/2309.16058.pdf) | arXiv | 2023-09-27 | - | - |\n| ![Star](https://img.shields.io/github/stars/InternLM/InternLM-XComposer.svg?style=social\u0026label=Star) \u003cbr\u003e [**InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition**](https://arxiv.org/pdf/2309.15112.pdf) \u003cbr\u003e | arXiv | 2023-09-26 | [Github](https://github.com/InternLM/InternLM-XComposer) | Local Demo |\n| ![Star](https://img.shields.io/github/stars/RunpeiDong/DreamLLM.svg?style=social\u0026label=Star) \u003cbr\u003e [**DreamLLM: Synergistic Multimodal Comprehension and Creation**](https://arxiv.org/pdf/2309.11499.pdf) \u003cbr\u003e | ICLR | 2023-09-20 | [Github](https://github.com/RunpeiDong/DreamLLM) | [Coming soon]() |\n| [**An Empirical Study of Scaling Instruction-Tuned Large Multimodal Models**](https://arxiv.org/pdf/2309.09958.pdf) | arXiv | 2023-09-18 | [Coming soon]() | - |\n| ![Star](https://img.shields.io/github/stars/SihengLi99/TextBind.svg?style=social\u0026label=Star) \u003cbr\u003e [**TextBind: Multi-turn Interleaved Multimodal Instruction-following**](https://arxiv.org/pdf/2309.08637.pdf) \u003cbr\u003e | arXiv | 2023-09-14 | [Github](https://github.com/SihengLi99/TextBind) | [Demo](https://ailabnlp.tencent.com/research_demos/textbind/) |\n| ![Star](https://img.shields.io/github/stars/NExT-GPT/NExT-GPT.svg?style=social\u0026label=Star) \u003cbr\u003e [**NExT-GPT: Any-to-Any Multimodal LLM**](https://arxiv.org/pdf/2309.05519.pdf) \u003cbr\u003e | arXiv | 2023-09-11 | [Github](https://github.com/NExT-GPT/NExT-GPT) | [Demo](https://fc7a82a1c76b336b6f.gradio.live/) |\n| ![Star](https://img.shields.io/github/stars/UCSC-VLAA/Sight-Beyond-Text.svg?style=social\u0026label=Star) \u003cbr\u003e [**Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics**](https://arxiv.org/pdf/2309.07120.pdf) \u003cbr\u003e | arXiv | 2023-09-13 | [Github](https://github.com/UCSC-VLAA/Sight-Beyond-Text) | - |\n| ![Star](https://img.shields.io/github/stars/OpenGVLab/LLaMA-Adapter.svg?style=social\u0026label=Star) \u003cbr\u003e [**ImageBind-LLM: Multi-modality Instruction Tuning**](https://arxiv.org/pdf/2309.03905.pdf) \u003cbr\u003e | arXiv | 2023-09-07 | [Github](https://github.com/OpenGVLab/LLaMA-Adapter) | [Demo](http://imagebind-llm.opengvlab.com/) |\n| [**Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning**](https://arxiv.org/pdf/2309.02591.pdf) | arXiv | 2023-09-05 | - | - | \n| ![Star](https://img.shields.io/github/stars/OpenRobotLab/PointLLM.svg?style=social\u0026label=Star) \u003cbr\u003e [**PointLLM: Empowering Large Language Models to Understand Point Clouds**](https://arxiv.org/pdf/2308.16911.pdf) \u003cbr\u003e | arXiv | 2023-08-31 | [Github](https://github.com/OpenRobotLab/PointLLM) | [Demo](http://101.230.144.196/) |\n| ![Star](https://img.shields.io/github/stars/HYPJUDY/Sparkles.svg?style=social\u0026label=Star) \u003cbr\u003e [**✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models**](https://arxiv.org/pdf/2308.16463.pdf) \u003cbr\u003e | arXiv | 2023-08-31 | [Github](https://github.com/HYPJUDY/Sparkles) | Local Demo |\n| ![Star](https://img.shields.io/github/stars/opendatalab/MLLM-DataEngine.svg?style=social\u0026label=Star) \u003cbr\u003e [**MLLM-DataEngine: An Iterative Refinement Approach for MLLM**](https://arxiv.org/pdf/2308.13566.pdf) \u003cbr\u003e | arXiv | 2023-08-25 | [Github](https://github.com/opendatalab/MLLM-DataEngine) | - |\n| ![Star](https://img.shields.io/github/stars/PVIT-official/PVIT.svg?style=social\u0026label=Star) \u003cbr\u003e [**Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models**](https://arxiv.org/pdf/2308.13437.pdf) \u003cbr\u003e | arXiv | 2023-08-25 | [Github](https://github.com/PVIT-official/PVIT) | [Demo](https://huggingface.co/spaces/PVIT/pvit) |  \n| ![Star](https://img.shields.io/github/stars/QwenLM/Qwen-VL.svg?style=social\u0026label=Star) \u003cbr\u003e [**Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities**](https://arxiv.org/pdf/2308.12966.pdf) \u003cbr\u003e | arXiv | 2023-08-24 | [Github](https://github.com/QwenLM/Qwen-VL) | [Demo](https://modelscope.cn/studios/qwen/Qwen-VL-Chat-Demo/summary) | \n| ![Star](https://img.shields.io/github/stars/OpenBMB/VisCPM.svg?style=social\u0026label=Star) \u003cbr\u003e [**Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages**](https://arxiv.org/pdf/2308.12038.pdf) \u003cbr\u003e | ICLR | 2023-08-23 | [Github](https://github.com/OpenBMB/VisCPM) | [Demo](https://huggingface.co/spaces/openbmb/viscpm-chat) | \n| ![Star](https://img.shields.io/github/stars/icoz69/StableLLAVA.svg?style=social\u0026label=Star) \u003cbr\u003e [**StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data**](https://arxiv.org/pdf/2308.10253.pdf) \u003cbr\u003e | arXiv | 2023-08-20 | [Github](https://github.com/icoz69/StableLLAVA) | - |\n| ![Star](https://img.shields.io/github/stars/mlpc-ucsd/BLIVA.svg?style=social\u0026label=Star) \u003cbr\u003e [**BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions**](https://arxiv.org/pdf/2308.09936.pdf) \u003cbr\u003e | arXiv | 2023-08-19 | [Github](https://github.com/mlpc-ucsd/BLIVA) | [Demo](https://huggingface.co/spaces/mlpc-lab/BLIVA) |\n| ![Star](https://img.shields.io/github/stars/DCDmllm/Cheetah.svg?style=social\u0026label=Star) \u003cbr\u003e [**Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions**](https://arxiv.org/pdf/2308.04152.pdf) \u003cbr\u003e | arXiv | 2023-08-08 | [Github](https://github.com/DCDmllm/Cheetah) | - |\n| ![Star](https://img.shields.io/github/stars/OpenGVLab/All-Seeing.svg?style=social\u0026label=Star) \u003cbr\u003e [**The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World**](https://arxiv.org/pdf/2308.01907.pdf) \u003cbr\u003e | ICLR | 2023-08-03 | [Github](https://github.com/OpenGVLab/All-Seeing) | [Demo](https://huggingface.co/spaces/OpenGVLab/all-seeing) | \n| ![Star](https://img.shields.io/github/stars/dvlab-research/LISA.svg?style=social\u0026label=Star) \u003cbr\u003e [**LISA: Reasoning Segmentation via Large Language Model**](https://arxiv.org/pdf/2308.00692.pdf) \u003cbr\u003e | arXiv | 2023-08-01 | [Github](https://github.com/dvlab-research/LISA) | [Demo](http://103.170.5.190:7860) |\n| ![Star](https://img.shields.io/github/stars/rese1f/MovieChat.svg?style=social\u0026label=Star) \u003cbr\u003e [**MovieChat: From Dense Token to Sparse Memory for Long Video Understanding**](https://arxiv.org/pdf/2307.16449.pdf) \u003cbr\u003e | arXiv | 2023-07-31 | [Github](https://github.com/rese1f/MovieChat) | Local Demo |\n| ![Star](https://img.shields.io/github/stars/UMass-Foundation-Model/3D-LLM.svg?style=social\u0026label=Star) \u003cbr\u003e [**3D-LLM: Injecting the 3D World into Large Language Models**](https://arxiv.org/pdf/2307.12981.pdf) \u003cbr\u003e | arXiv | 2023-07-24 | [Github](https://github.com/UMass-Foundation-Model/3D-LLM) | - | \n| [**ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning**](https://arxiv.org/pdf/2307.09474.pdf) \u003cbr\u003e | arXiv | 2023-07-18 | - | [Demo](https://chatspot.streamlit.app/) |\n| ![Star](https://img.shields.io/github/stars/magic-research/bubogpt.svg?style=social\u0026label=Star) \u003cbr\u003e [**BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs**](https://arxiv.org/pdf/2307.08581.pdf) \u003cbr\u003e | arXiv | 2023-07-17 | [Github](https://github.com/magic-research/bubogpt) | [Demo](https://huggingface.co/spaces/magicr/BuboGPT) |\n| ![Star](https://img.shields.io/github/stars/BAAI-DCAI/Visual-Instruction-Tuning.svg?style=social\u0026label=Star) \u003cbr\u003e [**SVIT: Scaling up Visual Instruction Tuning**](https://arxiv.org/pdf/2307.04087.pdf) \u003cbr\u003e | arXiv | 2023-07-09 | [Github](https://github.com/BAAI-DCAI/Visual-Instruction-Tuning) | - |\n| ![Star](https://img.shields.io/github/stars/jshilong/GPT4RoI.svg?style=social\u0026label=Star) \u003cbr\u003e [**GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest**](https://arxiv.org/pdf/2307.03601.pdf) \u003cbr\u003e | arXiv | 2023-07-07 | [Github](https://github.com/jshilong/GPT4RoI) | [Demo](http://139.196.83.164:7000/) |\n| ![Star](https://img.shields.io/github/stars/bytedance/lynx-llm.svg?style=social\u0026label=Star) \u003cbr\u003e [**What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?**](https://arxiv.org/pdf/2307.02469.pdf) \u003cbr\u003e | arXiv | 2023-07-05 | [Github](https://github.com/bytedance/lynx-llm)  | - | \n| ![Star](https://img.shields.io/github/stars/X-PLUG/mPLUG-DocOwl.svg?style=social\u0026label=Star) \u003cbr\u003e [**mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding**](https://arxiv.org/pdf/2307.02499.pdf) \u003cbr\u003e | arXiv | 2023-07-04 | [Github](https://github.com/X-PLUG/mPLUG-DocOwl) | [Demo](https://modelscope.cn/studios/damo/mPLUG-DocOwl/summary) | \n| ![Star](https://img.shields.io/github/stars/ChenDelong1999/polite_flamingo.svg?style=social\u0026label=Star) \u003cbr\u003e [**Visual Instruction Tuning with Polite Flamingo**](https://arxiv.org/pdf/2307.01003.pdf) \u003cbr \u003e| arXiv | 2023-07-03 | [Github](https://github.com/ChenDelong1999/polite_flamingo) | [Demo](http://clever_flamingo.xiaoice.com/) |\n| ![Star](https://img.shields.io/github/stars/SALT-NLP/LLaVAR.svg?style=social\u0026label=Star) \u003cbr\u003e [**LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding**](https://arxiv.org/pdf/2306.17107.pdf) \u003cbr\u003e | arXiv | 2023-06-29 | [Github](https://github.com/SALT-NLP/LLaVAR) | [Demo](https://eba470c07c805702b8.gradio.live/) |\n| ![Star](https://img.shields.io/github/stars/shikras/shikra.svg?style=social\u0026label=Star) \u003cbr\u003e [**Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic**](https://arxiv.org/pdf/2306.15195.pdf) \u003cbr\u003e | arXiv | 2023-06-27 | [Github](https://github.com/shikras/shikra) | [Demo](http://demo.zhaozhang.net:7860/) |\n| ![Star](https://img.shields.io/github/stars/OpenMotionLab/MotionGPT.svg?style=social\u0026label=Star) \u003cbr\u003e [**MotionGPT: Human Motion as a Foreign Language**](https://arxiv.org/pdf/2306.14795.pdf) \u003cbr\u003e | arXiv | 2023-06-26 | [Github](https://github.com/OpenMotionLab/MotionGPT) | - | \n| ![Star](https://img.shields.io/github/stars/lyuchenyang/Macaw-LLM.svg?style=social\u0026label=Star) \u003cbr\u003e [**Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration**](https://arxiv.org/pdf/2306.09093.pdf) \u003cbr\u003e | arXiv | 2023-06-15 | [Github](https://github.com/lyuchenyang/Macaw-LLM) | [Coming soon]() |\n| ![Star](https://img.shields.io/github/stars/OpenLAMM/LAMM.svg?style=social\u0026label=Star) \u003cbr\u003e [**LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark**](https://arxiv.org/pdf/2306.06687.pdf) \u003cbr\u003e | arXiv | 2023-06-11 | [Github](https://github.com/OpenLAMM/LAMM) | [Demo](https://huggingface.co/spaces/openlamm/LAMM) | \n| ![Star](https://img.shields.io/github/stars/mbzuai-oryx/Video-ChatGPT.svg?style=social\u0026label=Star) \u003cbr\u003e [**Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models**](https://arxiv.org/pdf/2306.05424.pdf) \u003cbr\u003e | arXiv | 2023-06-08 | [Github](https://github.com/mbzuai-oryx/Video-ChatGPT) | [Demo](https://www.ival-mbzuai.com/video-chatgpt) |\n| ![Star](https://img.shields.io/github/stars/Luodian/Otter.svg?style=social\u0026label=Star) \u003cbr\u003e [**MIMIC-IT: Multi-Modal In-Context Instruction Tuning**](https://arxiv.org/pdf/2306.05425.pdf) \u003cbr\u003e | arXiv | 2023-06-08 | [Github](https://github.com/Luodian/Otter) | [Demo](https://otter.cliangyu.com/) |\n| [**M\u003csup\u003e3\u003c/sup\u003eIT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning**](https://arxiv.org/pdf/2306.04387.pdf) | arXiv | 2023-06-07 | - | - | \n| ![Star](https://img.shields.io/github/stars/DAMO-NLP-SG/Video-LLaMA.svg?style=social\u0026label=Star) \u003cbr\u003e [**Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding**](https://arxiv.org/pdf/2306.02858.pdf) \u003cbr\u003e | arXiv | 2023-06-05 | [Github](https://github.com/DAMO-NLP-SG/Video-LLaMA) | [Demo](https://huggingface.co/spaces/DAMO-NLP-SG/Video-LLaMA) |\n| ![Star](https://img.shields.io/github/stars/microsoft/LLaVA-Med.svg?style=social\u0026label=Star) \u003cbr\u003e [**LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day**](https://arxiv.org/pdf/2306.00890.pdf) \u003cbr\u003e | arXiv | 2023-06-01 | [Github](https://github.com/microsoft/LLaVA-Med) | - |\n| ![Star](https://img.shields.io/github/stars/StevenGrove/GPT4Tools.svg?style=social\u0026label=Star) \u003cbr\u003e [**GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction**](https://arxiv.org/pdf/2305.18752.pdf) \u003cbr\u003e | arXiv | 2023-05-30 | [Github](https://github.com/StevenGrove/GPT4Tools) | [Demo](https://huggingface.co/spaces/stevengrove/GPT4Tools) | \n| ![Star](https://img.shields.io/github/stars/yxuansu/PandaGPT.svg?style=social\u0026label=Star) \u003cbr\u003e [**PandaGPT: One Model To Instruction-Follow Them All**](https://arxiv.org/pdf/2305.16355.pdf) \u003cbr\u003e | arXiv | 2023-05-25 | [Github](https://github.com/yxuansu/PandaGPT) | [Demo](https://huggingface.co/spaces/GMFTBY/PandaGPT) | \n| ![Star](https://img.shields.io/github/stars/joez17/ChatBridge.svg?style=social\u0026label=Star) \u003cbr\u003e [**ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst**](https://arxiv.org/pdf/2305.16103.pdf) \u003cbr\u003e | arXiv | 2023-05-25 | [Github](https://github.com/joez17/ChatBridge) | - | \n| ![Star](https://img.shields.io/github/stars/luogen1996/LaVIN.svg?style=social\u0026label=Star) \u003cbr\u003e [**Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models**](https://arxiv.org/pdf/2305.15023.pdf) \u003cbr\u003e | arXiv | 2023-05-24 | [Github](https://github.com/luogen1996/LaVIN) | Local Demo |\n| ![Star](https://img.shields.io/github/stars/OptimalScale/DetGPT.svg?style=social\u0026label=Star) \u003cbr\u003e [**DetGPT: Detect What You Need via Reasoning**](https://arxiv.org/pdf/2305.14167.pdf) \u003cbr\u003e | arXiv | 2023-05-23 | [Github](https://github.com/OptimalScale/DetGPT) | [Demo](https://d3c431c0c77b1d9010.gradio.live/) | \n| ![Star](https://img.shields.io/github/stars/microsoft/Pengi.svg?style=social\u0026label=Star) \u003cbr\u003e [**Pengi: An Audio Language Model for Audio Tasks**](https://arxiv.org/pdf/2305.11834.pdf) \u003cbr\u003e | NeurIPS | 2023-05-19 | [Github](https://github.com/microsoft/Pengi) | - |\n| ![Star](https://img.shields.io/github/stars/OpenGVLab/VisionLLM.svg?style=social\u0026label=Star) \u003cbr\u003e [**VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks**](https://arxiv.org/pdf/2305.11175.pdf) \u003cbr\u003e | arXiv | 2023-05-18 | [Github](https://github.com/OpenGVLab/VisionLLM) | - |\n| ![Star](https://img.shields.io/github/stars/YuanGongND/ltu.svg?style=social\u0026label=Star) \u003cbr\u003e [**Listen, Think, and Understand**](https://arxiv.org/pdf/2305.10790.pdf) \u003cbr\u003e | arXiv | 2023-05-18 | [Github](https://github.com/YuanGongND/ltu) | [Demo](https://github.com/YuanGongND/ltu) |\n| ![Star](https://img.shields.io/github/stars/THUDM/VisualGLM-6B.svg?style=social\u0026label=Star) \u003cbr\u003e **VisualGLM-6B** \u003cbr\u003e | - | 2023-05-17 | [Github](https://github.com/THUDM/VisualGLM-6B) | Local Demo |\n| ![Star](https://img.shields.io/github/stars/xiaoman-zhang/PMC-VQA.svg?style=social\u0026label=Star) \u003cbr\u003e [**PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering**](https://arxiv.org/pdf/2305.10415.pdf) \u003cbr\u003e | arXiv | 2023-05-17 | [Github](https://github.com/xiaoman-zhang/PMC-VQA) | - | \n| ![Star](https://img.shields.io/github/stars/salesforce/LAVIS.svg?style=social\u0026label=Star) \u003cbr\u003e [**InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning**](https://arxiv.org/pdf/2305.06500.pdf) \u003cbr\u003e | arXiv | 2023-05-11 | [Github](https://github.com/salesforce/LAVIS/tree/main/projects/instructblip) | Local Demo |\n| ![Star](https://img.shields.io/github/stars/OpenGVLab/Ask-Anything.svg?style=social\u0026label=Star) \u003cbr\u003e [**VideoChat: Chat-Centric Video Understanding**](https://arxiv.org/pdf/2305.06355.pdf) \u003cbr\u003e | arXiv | 2023-05-10 | [Github](https://github.com/OpenGVLab/Ask-Anything) | [Demo](https://ask.opengvlab.com/) |\n| ![Star](https://img.shields.io/github/stars/open-mmlab/Multimodal-GPT.svg?style=social\u0026label=Star) \u003cbr\u003e [**MultiModal-GPT: A Vision and Language Model for Dialogue with Humans**](https://arxiv.org/pdf/2305.04790.pdf) \u003cbr\u003e | arXiv | 2023-05-08 | [Github](https://github.com/open-mmlab/Multimodal-GPT) | [Demo](https://mmgpt.openmmlab.org.cn/) |\n| ![Star](https://img.shields.io/github/stars/phellonchen/X-LLM.svg?style=social\u0026label=Star) \u003cbr\u003e [**X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages**](https://arxiv.org/pdf/2305.04160.pdf) \u003cbr\u003e | arXiv | 2023-05-07 | [Github](https://github.com/phellonchen/X-LLM) | - | \n| ![Star](https://img.shields.io/github/stars/YunxinLi/LingCloud.svg?style=social\u0026label=Star) \u003cbr\u003e [**LMEye: An Interactive Perception Network for Large Language Models**](https://arxiv.org/pdf/2305.03701.pdf) \u003cbr\u003e | arXiv | 2023-05-05 | [Github](https://github.com/YunxinLi/LingCloud) | Local Demo |\n| ![Star](https://img.shields.io/github/stars/OpenGVLab/LLaMA-Adapter.svg?style=social\u0026label=Star) \u003cbr\u003e [**LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model**](https://arxiv.org/pdf/2304.15010.pdf) \u003cbr\u003e | arXiv | 2023-04-28 | [Github](https://github.com/OpenGVLab/LLaMA-Adapter) | [Demo](http://llama-adapter.opengvlab.com/) | \n| ![Star](https://img.shields.io/github/stars/X-PLUG/mPLUG-Owl.svg?style=social\u0026label=Star) \u003cbr\u003e [**mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality**](https://arxiv.org/pdf/2304.14178.pdf) \u003cbr\u003e | arXiv | 2023-04-27 | [Github](https://github.com/X-PLUG/mPLUG-Owl) | [Demo](https://huggingface.co/spaces/MAGAer13/mPLUG-Owl) |\n| ![Star](https://img.shields.io/github/stars/Vision-CAIR/MiniGPT-4.svg?style=social\u0026label=Star) \u003cbr\u003e [**MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models**](https://arxiv.org/pdf/2304.10592.pdf) \u003cbr\u003e | arXiv | 2023-04-20 | [Github](https://github.com/Vision-CAIR/MiniGPT-4) | - |\n| ![Star](https://img.shields.io/github/stars/haotian-liu/LLaVA.svg?style=social\u0026label=Star) \u003cbr\u003e [**Visual Instruction Tuning**](https://arxiv.org/pdf/2304.08485.pdf) \u003cbr\u003e | NeurIPS | 2023-04-17 | [GitHub](https://github.com/haotian-liu/LLaVA) | [Demo](https://llava.hliu.cc/) |\n| ![Star](https://img.shields.io/github/stars/OpenGVLab/LLaMA-Adapter.svg?style=social\u0026label=Star) \u003cbr\u003e [**LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention**](https://arxiv.org/pdf/2303.16199.pdf) \u003cbr\u003e | ICLR | 2023-03-28 | [Github](https://github.com/OpenGVLab/LLaMA-Adapter) | [Demo](https://huggingface.co/spaces/csuhan/LLaMA-Adapter) |\n| ![Star](https://img.shields.io/github/stars/VT-NLP/MultiInstruct.svg?style=social\u0026label=Star) \u003cbr\u003e [**MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning**](https://arxiv.org/pdf/2212.10773.pdf) \u003cbr\u003e | ACL | 2022-12-21 | [Github](https://github.com/VT-NLP/MultiInstruct) | - | \n\n## Multimodal Hallucination\n|  Title  |   Venue  |   Date   |   Code   |   Demo   |\n|:--------|:--------:|:--------:|:--------:|:--------:|\n| ![Star](https://img.shields.io/github/stars/1zhou-Wang/MemVR.svg?style=social\u0026label=Star) \u003cbr\u003e [**Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models**](https://arxiv.org/pdf/2410.03577) \u003cbr\u003e | arXiv | 2024-10-04 | [Github](https://github.com/1zhou-Wang/MemVR) | - |\n| ![Star](https://img.shields.io/github/stars/nickjiang2378/vl-interp.svg?style=social\u0026label=Star) \u003cbr\u003e [**Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations**](https://arxiv.org/pdf/2410.02762) \u003cbr\u003e | arXiv | 2024-10-03 | [Github](https://github.com/nickjiang2378/vl-interp/) | - |\n| [**FIHA: Autonomous Hallucination Evaluation in Vision-Language Models with Davidson Scene Graphs**](https://arxiv.org/pdf/2409.13612) | arXiv | 2024-09-20 | [Link](https://anonymous.4open.science/r/FIHA-45BB) | - | \n| [**Alleviating Hallucination in Large Vision-Language Models with Active Retrieval Augmentation**](https://arxiv.org/pdf/2408.00555) | arXiv | 2024-08-01 | - | - |\n| ![Star](https://img.shields.io/github/stars/LALBJ/PAI.svg?style=social\u0026label=Star) \u003cbr\u003e [**Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs**](https://arxiv.org/pdf/2407.21771) \u003cbr\u003e | ECCV | 2024-07-31 | [Github](https://github.com/LALBJ/PAI) | - |\n| ![Star](https://img.shields.io/github/stars/mrwu-mac/R-Bench.svg?style=social\u0026label=Star) \u003cbr\u003e [**Evaluating and Analyzing Relationship Hallucinations in LVLMs**](https://arxiv.org/pdf/2406.16449) \u003cbr\u003e | ICML | 2024-06-24 | [Github](https://github.com/mrwu-mac/R-Bench) | - |\n| ![Star](https://img.shields.io/github/stars/Lackel/AGLA.svg?style=social\u0026label=Star) \u003cbr\u003e [**AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention**](https://arxiv.org/pdf/2406.12718) \u003cbr\u003e | arXiv | 2024-06-18 | [Github](https://github.com/Lackel/AGLA) | - |\n| [**CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal Models**](https://arxiv.org/pdf/2406.01920) | arXiv | 2024-06-04 | [Coming soon]() | - |\n| [**Mitigating Object Hallucination via Data Augmented Contrastive Tuning**](https://arxiv.org/pdf/2405.18654) | arXiv | 2024-05-28 | [Coming soon]() | - |\n| [**VDGD: Mitigating LVLM Hallucinations in Cognitive Prompts by Bridging the Visual Perception Gap**](https://arxiv.org/pdf/2405.15683) | arXiv | 2024-05-24 | [Coming soon]() | - |\n| [**Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback**](https://arxiv.org/pdf/2404.14233.pdf) | arXiv | 2024-04-22 | - | - |\n| [**Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding**](https://arxiv.org/pdf/2403.18715.pdf) | arXiv | 2024-03-27 | - | - |\n| ![Star](https://img.shields.io/github/stars/IVY-LVLM/Counterfactual-Inception.svg?style=social\u0026label=Star) \u003cbr\u003e [**What if...?: Counterfactual Inception to Mitigate Hallucination Effects in Large Multimodal Models**](https://arxiv.org/pdf/2403.13513.pdf) \u003cbr\u003e | arXiv | 2024-03-20 | [Github](https://github.com/IVY-LVLM/Counterfactual-Inception) | - |\n| [**Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization**](https://arxiv.org/pdf/2403.08730.pdf) | arXiv | 2024-03-13 | - | - |\n| ![Star](https://img.shields.io/github/stars/yfzhang114/LLaVA-Align.svg?style=social\u0026label=Star) \u003cbr\u003e [**Debiasing Multimodal Large Language Models**](https://arxiv.org/pdf/2403.05262) \u003cbr\u003e | arXiv | 2024-03-08 | [Github](https://github.com/yfzhang114/LLaVA-Align) | - |\n| ![Star](https://img.shields.io/github/stars/BillChan226/HALC.svg?style=social\u0026label=Star) \u003cbr\u003e [**HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding**](https://arxiv.org/pdf/2403.00425.pdf) \u003cbr\u003e | arXiv | 2024-03-01 | [Github](https://github.com/BillChan226/HALC) | - |\n| [**IBD: Alleviating Hallucinations in Large Vision-Language Models via Image-Biased Decoding**](https://arxiv.org/pdf/2402.18476.pdf) | arXiv | 2024-02-28 | - | - |\n| ![Star](https://img.shields.io/github/stars/yuezih/less-is-more.svg?style=social\u0026label=Star) \u003cbr\u003e [**Less is More: Mitigating Multimodal Hallucination from an EOS Decision Perspective**](https://arxiv.org/pdf/2402.14545.pdf) \u003cbr\u003e | arXiv | 2024-02-22 | [Github](https://github.com/yuezih/less-is-more) | - |\n| ![Star](https://img.shields.io/github/stars/Hyperwjf/LogicCheckGPT.svg?style=social\u0026label=Star) \u003cbr\u003e [**Logical Closed Loop: Uncovering Object Hallucinations in Large Vision-Language Models**](https://arxiv.org/pdf/2402.11622.pdf) \u003cbr\u003e | arXiv | 2024-02-18 | [Github](https://github.com/Hyperwjf/LogicCheckGPT) | - | \n| ![Star](https://img.shields.io/github/stars/MasaiahHan/CorrelationQA.svg?style=social\u0026label=Star) \u003cbr\u003e [**The Instinctive Bias: Spurious Images lead to Hallucination in MLLMs**](https://arxiv.org/pdf/2402.03757.pdf) \u003cbr\u003e | arXiv | 2024-02-06 | [Github](https://github.com/MasaiahHan/CorrelationQA) | - |\n| ![Star](https://img.shields.io/github/stars/OpenKG-ORG/EasyDetect.svg?style=social\u0026label=Star) \u003cbr\u003e [**Unified Hallucination Detection for Multimodal Large Language Models**](https://arxiv.org/pdf/2402.03190.pdf) \u003cbr\u003e | arXiv | 2024-02-05 | [Github](https://github.com/OpenKG-ORG/EasyDetect) | - |\n| [**A Survey on Hallucination in Large Vision-Language Models**](https://arxiv.org/pdf/2402.00253.pdf) | arXiv | 2024-02-01 | - | - |\n| [**Temporal Insight Enhancement: Mitigating Temporal Hallucination in Multimodal Large Language Models**](https://arxiv.org/pdf/2401.09861.pdf) | arXiv | 2024-01-18 | - | - |\n| ![Star](https://img.shields.io/github/stars/X-PLUG/mPLUG-HalOwl.svg?style=social\u0026label=Star) \u003cbr\u003e [**Hallucination Augmented Contrastive Learning for Multimodal Large Language Model**](https://arxiv.org/pdf/2312.06968.pdf) \u003cbr\u003e | arXiv | 2023-12-12 | [Github](https://github.com/X-PLUG/mPLUG-HalOwl/tree/main/hacl) | - |\n| ![Star](https://img.shields.io/github/stars/assafbk/mocha_code.svg?style=social\u0026label=Star) \u003cbr\u003e [**MOCHa: Multi-Objective Reinforcement Mitigating Caption Hallucinations**](https://arxiv.org/pdf/2312.03631.pdf) \u003cbr\u003e | arXiv | 2023-12-06 | [Github](https://github.com/assafbk/mocha_code) | - |\n| ![Star](https://img.shields.io/github/stars/Anonymousanoy/FOHE.svg?style=social\u0026label=Star) \u003cbr\u003e [**Mitigating Fine-Grained Hallucination by Fine-Tuning Large Vision-Language Models with Caption Rewrites**](https://arxiv.org/pdf/2312.01701.pdf) \u003cbr\u003e | arXiv | 2023-12-04 | [Github](https://github.com/Anonymousanoy/FOHE) | - |\n| ![Star](https://img.shields.io/github/stars/RLHF-V/RLHF-V.svg?style=social\u0026label=Star) \u003cbr\u003e [**RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback**](https://arxiv.org/pdf/2312.00849.pdf) \u003cbr\u003e | arXiv | 2023-12-01 | [Github](https://github.com/RLHF-V/RLHF-V) | [Demo](http://120.92.209.146:8081/) |\n| ![Star](https://img.shields.io/github/stars/shikiw/OPERA.svg?style=social\u0026label=Star) \u003cbr\u003e [**OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation**](https://arxiv.org/pdf/2311.17911.pdf) \u003cbr\u003e | CVPR | 2023-11-29 | [Github](https://github.com/shikiw/OPERA) | - |\n| ![Star](https://img.shields.io/github/stars/DAMO-NLP-SG/VCD.svg?style=social\u0026label=Star) \u003cbr\u003e [**Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding**](https://arxiv.org/pdf/2311.16922.pdf) \u003cbr\u003e | CVPR | 2023-11-28 | [Github](https://github.com/DAMO-NLP-SG/VCD) | - |\n| [**Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization**](https://arxiv.org/pdf/2311.16839.pdf) | arXiv | 2023-11-28 | [Github](https://github.com/opendatalab/HA-DPO) | [Comins Soon]() |\n| [**Mitigating Hallucination in Visual Language Models with Visual Supervision**](https://arxiv.org/pdf/2311.16479.pdf) | arXiv | 2023-11-27 | - | - |\n| ![Star](https://img.shields.io/github/stars/Yuqifan1117/HalluciDoctor.svg?style=social\u0026label=Star) \u003cbr\u003e [**HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data**](https://arxiv.org/pdf/2311.13614.pdf) \u003cbr\u003e | arXiv | 2023-11-22 | [Github](https://github.com/Yuqifan1117/HalluciDoctor) | - |\n| ![Star](https://img.shields.io/github/stars/junyangwang0410/AMBER.svg?style=social\u0026label=Star) \u003cbr\u003e [**An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation**](https://arxiv.org/pdf/2311.07397.pdf) \u003cbr\u003e | arXiv | 2023-11-13 | [Github](https://github.com/junyangwang0410/AMBER) | - |\n| ![Star](https://img.shields.io/github/stars/bcdnlp/FAITHSCORE.svg?style=social\u0026label=Star) \u003cbr\u003e [**FAITHSCORE: Evaluating Hallucinations in Large Vision-Language Models**](https://arxiv.org/pdf/2311.01477.pdf) \u003cbr\u003e | arXiv | 2023-11-02 | [Github](https://github.com/bcdnlp/FAITHSCORE) | - |\n| ![Star](https://img.shields.io/github/stars/BradyFU/Woodpecker.svg?style=social\u0026label=Star) \u003cbr\u003e [**Woodpecker: Hallucination Correction for Multimodal Large Language Models**](https://arxiv.org/pdf/2310.16045.pdf) \u003cbr\u003e | arXiv | 2023-10-24 | [Github](https://github.com/BradyFU/Woodpecker) | [Demo](https://deb6a97bae6fab67ae.gradio.live/) |\n| [**Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models**](https://arxiv.org/pdf/2310.05338.pdf) | arXiv | 2023-10-09 | - | - |\n| ![Star](https://img.shields.io/github/stars/bronyayang/HallE_Switch.svg?style=social\u0026label=Star) \u003cbr\u003e [**HallE-Switch: Rethinking and Controlling Object Existence Hallucinations in Large Vision Language Models for Detailed Caption**](https://arxiv.org/pdf/2310.01779.pdf) \u003cbr\u003e | arXiv | 2023-10-03 | [Github](https://github.com/bronyayang/HallE_Switch) | - |\n| ![Star](https://img.shields.io/github/stars/YiyangZhou/LURE.svg?style=social\u0026label=Star) \u003cbr\u003e [**Analyzing and Mitigating Object Hallucination in Large Vision-Language Models**](https://arxiv.org/pdf/2310.00754.pdf) \u003cbr\u003e | ICLR | 2023-10-01 | [Github](https://github.com/YiyangZhou/LURE) | - |\n| ![Star](https://img.shields.io/github/stars/llava-rlhf/LLaVA-RLHF.svg?style=social\u0026label=Star) \u003cbr\u003e [**Aligning Large Multimodal Models with Factually Augmented RLHF**](https://arxiv.org/pdf/2309.14525.pdf) \u003cbr\u003e | arXiv | 2023-09-25 | [Github](https://github.com/llava-rlhf/LLaVA-RLHF) | [Demo](http://pitt.lti.cs.cmu.edu:7890/) |\n| [**Evaluation and Mitigation of Agnosia in Multimodal Large Language Models**](https://arxiv.org/pdf/2309.04041.pdf) | arXiv | 2023-09-07 | - | - |\n| [**CIEM: Contrastive Instruction Evaluation Method for Better Instruction Tuning**](https://arxiv.org/pdf/2309.02301.pdf) | arXiv | 2023-09-05 | - | - | \n| ![Star](https://img.shields.io/github/stars/junyangwang0410/HaELM.svg?style=social\u0026label=Star) \u003cbr\u003e [**Evaluation and Analysis of Hallucination in Large Vision-Language Models**](https://arxiv.org/pdf/2308.15126.pdf) \u003cbr\u003e | arXiv | 2023-08-29 | [Github](https://github.com/junyangwang0410/HaELM) | - |\n| ![Star](https://img.shields.io/github/stars/opendatalab/VIGC.svg?style=social\u0026label=Star) \u003cbr\u003e [**VIGC: Visual Instruction Generation and Correction**](https://arxiv.org/pdf/2308.12714.pdf) \u003cbr\u003e | arXiv | 2023-08-24 | [Github](https://github.com/opendatalab/VIGC) | [Demo](https://opendatalab.github.io/VIGC) | \n| [**Detecting and Preventing Hallucinations in Large Vision Language Models**](https://arxiv.org/pdf/2308.06394.pdf) | arXiv | 2023-08-11 | - | - |\n| ![Star](https://img.shields.io/github/stars/FuxiaoLiu/LRV-Instruction.svg?style=social\u0026label=Star) \u003cbr\u003e [**Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning**](https://arxiv.org/pdf/2306.14565.pdf) \u003cbr\u003e | ICLR | 2023-06-26 | [Github](https://github.com/FuxiaoLiu/LRV-Instruction) | [Demo](https://7b6590ed039a06475d.gradio.live/) |\n| ![Star](https://img.shields.io/github/stars/RUCAIBox/POPE.svg?style=social\u0026label=Star) \u003cbr\u003e [**Evaluating Object Hallucination in Large Vision-Language Models**](https://arxiv.org/pdf/2305.10355.pdf) \u003cbr\u003e | EMNLP | 2023-05-17 | [Github](https://github.com/RUCAIBox/POPE) | - |\n\n## Multimodal In-Context Learning\n|  Title  |   Venue  |   Date   |   Code   |   Demo   |\n|:--------|:--------:|:--------:|:--------:|:--------:|\n| [**Visual In-Context Learning for Large Vision-Language Models**](https://arxiv.org/pdf/2402.11574.pdf) | arXiv | 2024-02-18 | - | - |\n| ![Star](https://img.shields.io/github/stars/YuanJianhao508/RAG-Driver.svg?style=social\u0026label=Star) \u003cbr\u003e [**RAG-Driver: Generalisable Driving Explanations with Retrieval-Augmented In-Context Learning in Multi-Modal Large Language Model**](https://arxiv.org/abs/2402.10828) \u003cbr\u003e | RSS | 2024-02-16 | [Github](https://github.com/YuanJianhao508/RAG-Driver) | - |\n| ![Star](https://img.shields.io/github/stars/UW-Madison-Lee-Lab/CoBSAT.svg?style=social\u0026label=Star) \u003cbr\u003e [**Can MLLMs Perform Text-to-Image In-Context Learning?**](https://arxiv.org/pdf/2402.01293.pdf) \u003cbr\u003e | arXiv | 2024-02-02 | [Github](https://github.com/UW-Madison-Lee-Lab/CoBSAT) | - |\n| ![Star](https://img.shields.io/github/stars/baaivision/Emu.svg?style=social\u0026label=Star) \u003cbr\u003e [**Generative Multimodal Models are In-Context Learners**](https://arxiv.org/pdf/2312.13286) \u003cbr\u003e | CVPR | 2023-12-20 | [Github](https://github.com/baaivision/Emu/tree/main/Emu2) | [Demo](https://huggingface.co/spaces/BAAI/Emu2) |\n| [**Hijacking Context in Large Multi-modal Models**](https://arxiv.org/pdf/2312.07553.pdf) | arXiv | 2023-12-07 | - | - |\n| [**Towards More Unified In-context Visual Understanding**](https://arxiv.org/pdf/2312.02520.pdf) | arXiv | 2023-12-05 | - | - | \n| ![Star](https://img.shields.io/github/stars/HaozheZhao/MIC.svg?style=social\u0026label=Star) \u003cbr\u003e [**MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning**](https://arxiv.org/pdf/2309.07915.pdf) \u003cbr\u003e | arXiv | 2023-09-14 | [Github](https://github.com/HaozheZhao/MIC) | [Demo](https://8904cdd23621858859.gradio.live/) |\n| ![Star](https://img.shields.io/github/stars/isekai-portal/Link-Context-Learning.svg?style=social\u0026label=Star) \u003cbr\u003e [**Link-Context Learning for Multimodal LLMs**](https://arxiv.org/pdf/2308.07891.pdf) \u003cbr\u003e | arXiv | 2023-08-15 | [Github](https://github.com/isekai-portal/Link-Context-Learning) | [Demo](http://117.144.81.99:20488/) | \n| ![Star](https://img.shields.io/github/stars/mlfoundations/open_flamingo.svg?style=social\u0026label=Star) \u003cbr\u003e [**OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models**](https://arxiv.org/pdf/2308.01390.pdf) \u003cbr\u003e | arXiv | 2023-08-02 | [Github](https://github.com/mlfoundations/open_flamingo) | [Demo](https://huggingface.co/spaces/openflamingo/OpenFlamingo) | \n| ![Star](https://img.shields.io/github/stars/snap-stanford/med-flamingo.svg?style=social\u0026label=Star) \u003cbr\u003e [**Med-Flamingo: a Multimodal Medical Few-shot Learner**](https://arxiv.org/pdf/2307.15189.pdf) \u003cbr\u003e | arXiv | 2023-07-27 | [Github](https://github.com/snap-stanford/med-flamingo) | Local Demo | \n| ![Star](https://img.shields.io/github/stars/baaivision/Emu.svg?style=social\u0026label=Star) \u003cbr\u003e [**Generative Pretraining in Multimodality**](https://arxiv.org/pdf/2307.05222.pdf) \u003cbr\u003e | ICLR | 2023-07-11 | [Github](https://github.com/baaivision/Emu/tree/main/Emu1) | [Demo](http://218.91.113.230:9002/) |\n| [**AVIS: Autonomous Visual Information Seeking with Large Language Models**](https://arxiv.org/pdf/2306.08129.pdf) | arXiv | 2023-06-13 | - | - |\n| ![Star](https://img.shields.io/github/stars/Luodian/Otter.svg?style=social\u0026label=Star) \u003cbr\u003e [**MIMIC-IT: Multi-Modal In-Context Instruction Tuning**](https://arxiv.org/pdf/2306.05425.pdf) \u003cbr\u003e | arXiv | 2023-06-08 | [Github](https://github.com/Luodian/Otter) | [Demo](https://otter.cliangyu.com/) |\n| ![Star](https://img.shields.io/github/stars/yongliang-wu/ExploreCfg.svg?style=social\u0026label=Star) \u003cbr\u003e [**Exploring Diverse In-Context Configurations for Image Captioning**](https://arxiv.org/pdf/2305.14800.pdf) \u003cbr\u003e | NeurIPS | 2023-05-24 | [Github](https://github.com/yongliang-wu/ExploreCfg) | - |\n| ![Star](https://img.shields.io/github/stars/lupantech/chameleon-llm.svg?style=social\u0026label=Star) \u003cbr\u003e [**Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models**](https://arxiv.org/pdf/2304.09842.pdf) \u003cbr\u003e | arXiv | 2023-04-19 | [Github](https://github.com/lupantech/chameleon-llm) | [Demo](https://chameleon-llm.github.io/) | \n| ![Star](https://img.shields.io/github/stars/microsoft/JARVIS.svg?style=social\u0026label=Star) \u003cbr\u003e [**HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace**](https://arxiv.org/pdf/2303.17580.pdf) \u003cbr\u003e | arXiv | 2023-03-30 | [Github](https://github.com/microsoft/JARVIS) | [Demo](https://huggingface.co/spaces/microsoft/HuggingGPT) | \n| ![Star](https://img.shields.io/github/stars/microsoft/MM-REACT.svg?style=social\u0026label=Star) \u003cbr\u003e [**MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action**](https://arxiv.org/pdf/2303.11381.pdf) \u003cbr\u003e | arXiv | 2023-03-20 | [Github](https://github.com/microsoft/MM-REACT) | [Demo](https://huggingface.co/spaces/microsoft-cognitive-service/mm-react) |\n| ![Star](https://img.shields.io/github/stars/MAEHCM/ICL-D3IE.svg?style=social\u0026label=Star) \u003cbr\u003e [**ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction**](https://arxiv.org/pdf/2303.05063.pdf) \u003cbr\u003e | ICCV | 2023-03-09 | [Github](https://github.com/MAEHCM/ICL-D3IE) | - |\n| ![Star](https://img.shields.io/github/stars/MILVLG/prophet.svg?style=social\u0026label=Star) \u003cbr\u003e [**Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering**](https://arxiv.org/pdf/2303.01903.pdf) \u003cbr\u003e | CVPR | 2023-03-03 | [Github](https://github.com/MILVLG/prophet) | - |\n| ![Star](https://img.shields.io/github/stars/allenai/visprog.svg?style=social\u0026label=Star) \u003cbr\u003e [**Visual Programming: Compositional visual reasoning without training**](https://openaccess.thecvf.com/content/CVPR2023/papers/Gupta_Visual_Programming_Compositional_Visual_Reasoning_Without_Training_CVPR_2023_paper.pdf) \u003cbr\u003e | CVPR | 2022-11-18 | [Github](https://github.com/allenai/visprog) | Local Demo | \n| ![Star](https://img.shields.io/github/stars/microsoft/PICa.svg?style=social\u0026label=Star) \u003cbr\u003e [**An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA**](https://ojs.aaai.org/index.php/AAAI/article/download/20215/19974) \u003cbr\u003e | AAAI | 2022-06-28 | [Github](https://github.com/microsoft/PICa) | - |\n| ![Star](https://img.shields.io/github/stars/mlfoundations/open_flamingo.svg?style=social\u0026label=Star) \u003cbr\u003e [**Flamingo: a Visual Language Model for Few-Shot Learning**](https://arxiv.org/pdf/2204.14198.pdf) \u003cbr\u003e | NeurIPS | 2022-04-29 | [Github](https://github.com/mlfoundations/open_flamingo) | [Demo](https://huggingface.co/spaces/dhansmair/flamingo-mini-cap) | \n| [**Multimodal Few-Shot Learning with Frozen Language Models**](https://arxiv.org/pdf/2106.13884.pdf) | NeurIPS | 2021-06-25 | - | - |\n\n\n## Multimodal Chain-of-Thought\n|  Title  |   Venue  |   Date   |   Code   |   Demo   |\n|:--------|:--------:|:--------:|:--------:|:--------:|\n| ![Star](https://img.shields.io/github/stars/dongyh20/Insight-V.svg?style=social\u0026label=Star) \u003cbr\u003e [**Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models**](https://arxiv.org/pdf/2411.14432) \u003cbr\u003e | arXiv | 2024-11-21 | [Github](https://github.com/dongyh20/Insight-V) | - |\n| ![Star](https://img.shields.io/github/stars/ggg0919/cantor.svg?style=social\u0026label=Star) \u003cbr\u003e [**Cantor: Inspiring Multimodal Chain-of-Thought of MLLM**](https://arxiv.org/pdf/2404.16033.pdf) \u003cbr\u003e | arXiv | 2024-04-24 | [Github](https://github.com/ggg0919/cantor) | Local Demo |\n| ![Star](https://img.shields.io/github/stars/deepcs233/Visual-CoT.svg?style=social\u0026label=Star) \u003cbr\u003e [**Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models**](https://arxiv.org/pdf/2403.16999.pdf) \u003cbr\u003e | arXiv | 2024-03-25 | [Github](https://github.com/deepcs233/Visual-CoT) | Local Demo |\n| ![Star](https://img.shields.io/github/stars/chancharikmitra/CCoT.svg?style=social\u0026label=Star) \u003cbr\u003e [**Compositional Chain-of-Thought Prompting for Large Multimodal Models**](https://arxiv.org/pdf/2311.17076) \u003cbr\u003e | CVPR | 2023-11-27 | [Github](https://github.com/chancharikmitra/CCoT) | - |\n| ![Star](https://img.shields.io/github/stars/SooLab/DDCOT.svg?style=social\u0026label=Star) \u003cbr\u003e [**DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models**](https://arxiv.org/pdf/2310.16436.pdf) \u003cbr\u003e | NeurIPS | 2023-10-25 | [Github](https://github.com/SooLab/DDCOT) | - |\n| ![Star](https://img.shields.io/github/stars/shikras/shikra.svg?style=social\u0026label=Star) \u003cbr\u003e [**Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic**](https://arxiv.org/pdf/2306.15195.pdf) \u003cbr\u003e | arXiv | 2023-06-27 | [Github](https://github.com/shikras/shikra) | [Demo](http://demo.zhaozhang.net:7860/) |\n| ![Star](https://img.shields.io/github/stars/zeroQiaoba/Explainable-Multimodal-Emotion-Reasoning.svg?style=social\u0026label=Star) \u003cbr\u003e [**Explainable Multimodal Emotion Reasoning**](https://arxiv.org/pdf/2306.15401.pdf) \u003cbr\u003e | arXiv | 2023-06-27 | [Github](https://github.com/zeroQiaoba/Explainable-Multimodal-Emotion-Reasoning) | - | \n| ![Star](https://img.shields.io/github/stars/EmbodiedGPT/EmbodiedGPT_Pytorch.svg?style=social\u0026label=Star) \u003cbr\u003e [**EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought**](https://arxiv.org/pdf/2305.15021.pdf) \u003cbr\u003e | arXiv | 2023-05-24 | [Github](https://github.com/EmbodiedGPT/EmbodiedGPT_Pytorch) | - | \n| [**Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction**](https://arxiv.org/pdf/2305.13903.pdf) | arXiv | 2023-05-23 | - | - |\n| [**T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Large Language Model Signals for Science Question Answering**](https://arxiv.org/pdf/2305.03453.pdf) | arXiv | 2023-05-05 | - | - |\n| ![Star](https://img.shields.io/github/stars/ttengwang/Caption-Anything.svg?style=social\u0026label=Star) \u003cbr\u003e [**Caption Anything: Interactive Image Description with Diverse Multimodal Controls**](https://arxiv.org/pdf/2305.02677.pdf) \u003cbr\u003e | arXiv | 2023-05-04 | [Github](https://github.com/ttengwang/Caption-Anything) | [Demo](https://huggingface.co/spaces/TencentARC/Caption-Anything) |\n| [**Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings**](https://arxiv.org/pdf/2305.02317.pdf) | arXiv | 2023-05-03 | [Coming soon](https://github.com/dannyrose30/VCOT) | - |\n| ![Star](https://img.shields.io/github/stars/lupantech/chameleon-llm.svg?style=social\u0026label=Star) \u003cbr\u003e [**Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models**](https://arxiv.org/pdf/2304.09842.pdf) \u003cbr\u003e | arXiv | 2023-04-19 | [Github](https://github.com/lupantech/chameleon-llm) | [Demo](https://chameleon-llm.github.io/) | \n| [**Chain of Thought Prompt Tuning in Vision Language Models**](https://arxiv.org/pdf/2304.07919.pdf) | arXiv | 2023-04-16 | [Coming soon]() | - |\n| ![Star](https://img.shields.io/github/stars/microsoft/MM-REACT.svg?style=social\u0026label=Star) \u003cbr\u003e [**MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action**](https://arxiv.org/pdf/2303.11381.pdf) \u003cbr\u003e | arXiv | 2023-03-20 | [Github](https://github.com/microsoft/MM-REACT) | [Demo](https://huggingface.co/spaces/microsoft-cognitive-service/mm-react) |\n| ![Star](https://img.shields.io/github/stars/microsoft/TaskMatrix.svg?style=social\u0026label=Star) \u003cbr\u003e [**Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models**](https://arxiv.org/pdf/2303.04671.pdf) \u003cbr\u003e | arXiv | 2023-03-08 | [Github](https://github.com/microsoft/TaskMatrix) | [Demo](https://huggingface.co/spaces/microsoft/visual_chatgpt) |\n| ![Star](https://img.shields.io/github/stars/amazon-science/mm-cot.svg?style=social\u0026label=Star) \u003cbr\u003e [**Multimodal Chain-of-Thought Reasoning in Language Models**](https://arxiv.org/pdf/2302.00923.pdf) \u003cbr\u003e | arXiv | 2023-02-02 | [Github](https://github.com/amazon-science/mm-cot) | - |\n| ![Star](https://img.shields.io/github/stars/allenai/visprog.svg?style=social\u0026label=Star) \u003cbr\u003e [**Visual Programming: Compositional visual reasoning without training**](https://openaccess.thecvf.com/content/CVPR2023/papers/Gupta_Visual_Programming_Compositional_Visual_Reasoning_Without_Training_CVPR_2023_paper.pdf) \u003cbr\u003e | CVPR | 2022-11-18 | [Github](https://github.com/allenai/visprog) | Local Demo | \n| ![Star](https://img.shields.io/github/stars/lupantech/ScienceQA.svg?style=social\u0026label=Star) \u003cbr\u003e [**Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering**](https://proceedings.neurips.cc/paper_files/paper/2022/file/11332b6b6cf4485b84afadb1352d3a9a-Paper-Conference.pdf) \u003cbr\u003e | NeurIPS | 2022-09-20 | [Github](https://github.com/lupantech/ScienceQA) | - |\n\n\n## LLM-Aided Visual Reasoning\n|  Title  |   Venue  |   Date   |   Code   |   Demo   |\n|:--------|:--------:|:--------:|:--------:|:--------:|\n| ![Star](https://img.shields.io/github/stars/LaVi-Lab/Visual-Table.svg?style=social\u0026label=Star) \u003cbr\u003e [**Beyond Embeddings: The Promise of Visual Table in Multi-Modal Models**](https://arxiv.org/pdf/2403.18252.pdf) \u003cbr\u003e | arXiv | 2024-03-27 | [Github](https://github.com/LaVi-Lab/Visual-Table) | - |\n| ![Star](https://img.shields.io/github/stars/penghao-wu/vstar.svg?style=social\u0026label=Star) \u003cbr\u003e [**V∗: Guided Visual Search as a Core Mechanism in Multimodal LLMs**](https://arxiv.org/pdf/2312.14135.pdf) \u003cbr\u003e | arXiv | 2023-12-21 | [Github](https://github.com/penghao-wu/vstar) | Local Demo |\n| ![Star](https://img.shields.io/github/stars/LLaVA-VL/LLaVA-Interactive-Demo.svg?style=social\u0026label=Star) \u003cbr\u003e [**LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing**](https://arxiv.org/pdf/2311.00571.pdf) \u003cbr\u003e | arXiv | 2023-11-01 | [Github](https://github.com/LLaVA-VL/LLaVA-Interactive-Demo) | [Demo](https://6dd3-20-163-117-69.ngrok-free.app/) |\n| [**MM-VID: Advancing Video Understanding with GPT-4V(vision)**](https://arxiv.org/pdf/2310.19773.pdf) | arXiv | 2023-10-30 | - | - |\n| ![Star](https://img.shields.io/github/stars/OpenGVLab/ControlLLM.svg?style=social\u0026label=Star) \u003cbr\u003e [**ControlLLM: Augment Language Models with Tools by Searching on Graphs**](https://arxiv.org/pdf/2310.17796.pdf) \u003cbr\u003e | arXiv | 2023-10-26 | [Github](https://github.com/OpenGVLab/ControlLLM) | - |\n| ![Star](https://img.shields.io/github/stars/BradyFU/Woodpecker.svg?style=social\u0026label=Star) \u003cbr\u003e [**Woodpecker: Hallucination Correction for Multimodal Large Language Models**](https://arxiv.org/pdf/2310.16045.pdf) \u003cbr\u003e | arXiv | 2023-10-24 | [Github](https://github.com/BradyFU/Woodpecker) | [Demo](https://deb6a97bae6fab67ae.gradio.live/) |\n| ![Star](https://img.shields.io/github/stars/mindagent/mindagent.svg?style=social\u0026label=Star) \u003cbr\u003e [**MindAgent: Emergent Gaming Interaction**](https://arxiv.org/pdf/2309.09971.pdf) \u003cbr\u003e | arXiv | 2023-09-18 | [Github](https://github.com/mindagent/mindagent) | - | \n| ![Star](https://img.shields.io/github/stars/ContextualAI/lens.svg?style=social\u0026label=Star) \u003cbr\u003e [**Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language**](https://arxiv.org/pdf/2306.16410.pdf) \u003cbr\u003e | arXiv | 2023-06-28 | [Github](https://github.com/ContextualAI/lens) | [Demo](https://lens.contextual.ai/) |\n| [**Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models**](https://arxiv.org/pdf/2306.11732.pdf) | arXiv | 2023-06-15 | - | - |\n| ![Star](https://img.shields.io/github/stars/showlab/assistgpt.svg?style=social\u0026label=Star) \u003cbr\u003e [**AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn**](https://arxiv.org/pdf/2306.08640.pdf) \u003cbr\u003e | arXiv | 2023-06-14 | [Github](https://github.com/showlab/assistgpt) | - |\n| [**AVIS: Autonomous Visual Information Seeking with Large Language Models**](https://arxiv.org/pdf/2306.08129.pdf) | arXiv | 2023-06-13 | - | - |\n| ![Star](https://img.shields.io/github/stars/StevenGrove/GPT4Tools.svg?style=social\u0026label=Star) \u003cbr\u003e [**GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction**](https://arxiv.org/pdf/2305.18752.pdf) \u003cbr\u003e | arXiv | 2023-05-30 | [Github](https://github.com/StevenGrove/GPT4Tools) | [Demo](https://c60eb7e9400930f31b.gradio.live/) | \n| [**Mindstorms in Natural Language-Based Societies of Mind**](https://arxiv.org/pdf/2305.17066.pdf) | arXiv | 2023-05-26 | - | - | \n| ![Star](https://img.shields.io/github/stars/weixi-feng/LayoutGPT.svg?style=social\u0026label=Star) \u003cbr\u003e [**LayoutGPT: Compositional Visual Planning and Generation with Large Language Models**](https://arxiv.org/pdf/2305.15393.pdf) \u003cbr\u003e | arXiv | 2023-05-24 | [Github](https://github.com/weixi-feng/LayoutGPT) | - |\n| ![Star](https://img.shields.io/github/stars/Hxyou/IdealGPT.svg?style=social\u0026label=Star) \u003cbr\u003e [**IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models**](https://arxiv.org/pdf/2305.14985.pdf) \u003cbr\u003e | arXiv | 2023-05-24 | [Github](https://github.com/Hxyou/IdealGPT) | Local Demo | \n| ![Star](https://img.shields.io/github/stars/matrix-alpha/Accountable-Textual-Visual-Chat.svg?style=social\u0026label=Star) \u003cbr\u003e [**Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation**](https://arxiv.org/pdf/2303.05983.pdf) \u003cbr\u003e | arXiv | 2023-05-10 | [Github](https://github.com/matrix-alpha/Accountable-Textual-Visual-Chat) | - |\n| ![Star](https://img.shields.io/github/stars/ttengwang/Caption-Anything.svg?style=social\u0026label=Star) \u003cbr\u003e [**Caption Anything: Interactive Image Description with Diverse Multimodal Controls**](https://arxiv.org/pdf/2305.02677.pdf) \u003cbr\u003e | arXiv | 2023-05-04 | [Github](https://github.com/ttengwang/Caption-Anything) | [Demo](https://huggingface.co/spaces/TencentARC/Caption-Anything) |\n| ![Star](https://img.shields.io/github/stars/lupantech/chameleon-llm.svg?style=social\u0026label=Star) \u003cbr\u003e [**Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models**](https://arxiv.org/pdf/2304.09842.pdf) \u003cbr\u003e | arXiv | 2023-04-19 | [Github](https://github.com/lupantech/chameleon-llm) | [Demo](https://chameleon-llm.github.io/) | \n| ![Star](https://img.shields.io/github/stars/microsoft/JARVIS.svg?style=social\u0026label=Star) \u003cbr\u003e [**HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace**](https://arxiv.org/pdf/2303.17580.pdf) \u003cbr\u003e | arXiv | 2023-03-30 | [Github](https://github.com/microsoft/JARVIS) | [Demo](https://huggingface.co/spaces/microsoft/HuggingGPT) | \n| ![Star](https://img.shields.io/github/stars/microsoft/MM-REACT.svg?style=social\u0026label=Star) \u003cbr\u003e [**MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action**](https://arxiv.org/pdf/2303.11381.pdf) \u003cbr\u003e | arXiv | 2023-03-20 | [Github](https://github.com/microsoft/MM-REACT) | [Demo](https://huggingface.co/spaces/microsoft-cognitive-service/mm-react) |\n| ![Star](https://img.shields.io/github/stars/cvlab-columbia/viper.svg?style=social\u0026label=Star) \u003cbr\u003e [**ViperGPT: Visual Inference via Python Execution for Reasoning**](https://arxiv.org/pdf/2303.08128.pdf) \u003cbr\u003e | arXiv | 2023-03-14 | [Github](https://github.com/cvlab-columbia/viper) | Local Demo | \n| ![Star](https://img.shields.io/github/stars/Vision-CAIR/ChatCaptioner.svg?style=social\u0026label=Star) \u003cbr\u003e [**ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions**](https://arxiv.org/pdf/2303.06594.pdf) \u003cbr\u003e | arXiv | 2023-03-12 | [Github](https://github.com/Vision-CAIR/ChatCaptioner) | Local Demo |\n| [**ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction**](https://arxiv.org/pdf/2303.05063.pdf) | ICCV | 2023-03-09 | - | - |\n| ![Star](https://img.shields.io/github/stars/microsoft/TaskMatrix.svg?style=social\u0026label=Star) \u003cbr\u003e [**Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models**](https://arxiv.org/pdf/2303.04671.pdf) \u003cbr\u003e | arXiv | 2023-03-08 | [Github](https://github.com/microsoft/TaskMatrix) | [Demo](https://huggingface.co/spaces/microsoft/visual_chatgpt) |\n| ![Star](https://img.shields.io/github/stars/ZrrSkywalker/CaFo.svg?style=social\u0026label=Star) \u003cbr\u003e [**Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners**](https://arxiv.org/pdf/2303.02151.pdf) \u003cbr\u003e | CVPR | 2023-03-03 | [Github](https://github.com/ZrrSkywalker/CaFo) | - |\n| ![Star](https://img.shields.io/github/stars/salesforce/LAVIS.svg?style=social\u0026label=Star) \u003cbr\u003e [**From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models**](https://arxiv.org/pdf/2212.10846.pdf) \u003cbr\u003e | CVPR | 2022-12-21 | [Github](https://github.com/salesforce/LAVIS/tree/main/projects/img2llm-vqa) | [Demo](https://colab.research.google.com/github/salesforce/LAVIS/blob/main/projects/img2llm-vqa/img2llm_vqa.ipynb) | \n| ![Star](https://img.shields.io/github/stars/vishaal27/SuS-X.svg?style=social\u0026label=Star) \u003cbr\u003e [**SuS-X: Training-Free Name-Only Transfer of Vision-Language Models**](https://arxiv.org/pdf/2211.16198.pdf) \u003cbr\u003e | arXiv | 2022-11-28 | [Github](https://github.com/vishaal27/SuS-X) | - |\n| ![Star](https://img.shields.io/github/stars/yangyangyang127/PointCLIP_V2.svg?style=social\u0026label=Star) \u003cbr\u003e [**PointCLIP V2: Adapting CLIP for Powerful 3D Open-world Learning**](https://arxiv.org/pdf/2211.11682.pdf) \u003cbr\u003e | CVPR | 2022-11-21 | [Github](https://github.com/yangyangyang127/PointCLIP_V2) | - |\n| ![Star](https://img.shields.io/github/stars/allenai/visprog.svg?style=social\u0026label=Star) \u003cbr\u003e [**Visual Programming: Compositional visual reasoning without training**](https://openaccess.thecvf.com/content/CVPR2023/papers/Gupta_Visual_Programming_Compositional_Visual_Reasoning_Without_Training_CVPR_2023_paper.pdf) \u003cbr\u003e | CVPR | 2022-11-18 | [Github](https://github.com/allenai/visprog) | Local Demo | \n| ![Star](https://img.shields.io/github/stars/google-research/google-research.svg?style=social\u0026label=Star) \u003cbr\u003e [**Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language**](https://arxiv.org/pdf/2204.00598.pdf) \u003cbr\u003e | arXiv | 2022-04-01 | [Github](https://github.com/google-research/google-research/tree/master/socraticmodels) | - |\n\n\n## Foundation Models\n|  Title  |   Venue  |   Date   |   Code   |   Demo   |\n|:--------|:--------:|:--------:|:--------:|:--------:|\n| ![Star](https://img.shields.io/github/stars/DAMO-NLP-SG/VideoLLaMA3.svg?style=social\u0026label=Star) \u003cbr\u003e [**VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding**](https://arxiv.org/pdf/2501.13106) \u003cbr\u003e | arXiv | 2025-01-22 | [Github](https://github.com/DAMO-NLP-SG/VideoLLaMA3) | [Demo](https://huggingface.co/spaces/lixin4ever/VideoLLaMA3) |\n| ![Star](https://img.shields.io/github/stars/baaivision/Emu3.svg?style=social\u0026label=Star) \u003cbr\u003e [**Emu3: Next-Token Prediction is All You Need**](https://arxiv.org/pdf/2409.18869) \u003cbr\u003e | arXiv | 2024-09-27 | [Github](https://github.com/baaivision/Emu3) | Local Demo |\n| [**Llama 3.2: Revolutionizing edge AI and vision with open, customizable models**](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/) | Meta | 2024-09-25 | - | [Demo](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) | \n| [**Pixtral-12B**](https://mistral.ai/news/pixtral-12b/) | Mistral | 2024-09-17 | - | - |\n| ![Star](https://img.shields.io/github/stars/salesforce/LAVIS.svg?style=social\u0026label=Star) \u003cbr\u003e [**xGen-MM (BLIP-3): A Family of Open Large Multimodal Models**](https://arxiv.org/pdf/2","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FBradyFU%2FAwesome-Multimodal-Large-Language-Models","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FBradyFU%2FAwesome-Multimodal-Large-Language-Models","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FBradyFU%2FAwesome-Multimodal-Large-Language-Models/lists"}