Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Awesome-Multimodal-LLM
Reading list for Multimodal Large Language Models
https://github.com/vincentlux/Awesome-Multimodal-LLM
Last synced: 4 days ago
JSON representation
-
Table of Contents
- Recent Advances in Vision Foundation Models
- M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning
- LLaVA Instruction 150K - Instruct-150K)
- Youku-mPLUG 10M - PLUG/Youku-mPLUG)
- MULTIINSTRUCT: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning - NLP/MultiInstruct)
-
Survey Papers
- A Survey on Multimodal Large Language Models - Multimodal-Large-Language-Models)
- Vision-Language Models for Vision Tasks: A Survey
-
Core Areas
-
Multimodal Understanding
- PandaGPT: One Model To Instruction-Follow Them All
- MIMIC-IT: Multi-Modal In-Context Instruction Tuning
- LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding - NLP/LLaVAR)
- MetaVL: Transferring In-Context Learning Ability From Language Models to Vision-Language Models
- mPLUG-Owl : Modularization Empowers Large Language Models with Multimodality - PLUG/mPLUG-Owl)
- InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
- Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models
- MultiModal-GPT: A Vision and Language Model for Dialogue with Humans - mmlab/Multimodal-GPT)
- LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model - Adapter)
- Language Is Not All You Need: Aligning Perception with Language Models
- ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities - Sys/ONE-PEACE)
- X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages - LLM)
- Visual Instruction Tuning - liu/LLaVA)
- Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
- PaLI: A Jointly-Scaled Multilingual Language-Image Model - scaling-language-image-learning-in.html)
- Grounding Language Models to Images for Multimodal Inputs and Outputs
- OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework - Sys/OFA)
- Flamingo: a Visual Language Model for Few-Shot Learning
- Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
-
Vision-Centric Understanding
- LISA: Reasoning Segmentation via Large Language Model - research/LISA)
- Contextual Object Detection with Multimodal Large Language Models
- KOSMOS-2: Grounding Multimodal Large Language Models to the World - 2)
- Fast Segment Anything - IVA-Lab/FastSAM)
- Multi-Modal Classifiers for Open-Vocabulary Object Detection - ovod)
- Instruction-ViT: Multi-Modal Prompts for Instruction Learning in ViT
- Images Speak in Images: A Generalist Painter for In-Context Visual Learning
- Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding - NLP-SG/Video-LLaMA)
- SegGPT: Segmenting Everything In Context
- VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks
- Segment Anything - anything)
- Matcher: Segment Anything with One Shot Using All-Purpose Feature Matching
- Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks - Perceiver)
- Personalize Segment Anything Model with One Shot - SAM)
- A Generalist Framework for Panoptic Segmentation of Images and Videos
- A Unified Sequence Interface for Vision Tasks - research/pix2seq)
- Pix2seq: A language modeling framework for object detection - research/pix2seq)
-
Embodied-Centric Understanding
- Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition - ai-robotics/scalingup)
- RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control - 2-new-model-translates-vision-and-language-into-action)
- VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
- MotionGPT: Human Motion as a Foreign Language
- Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model
- PaLM-E: An Embodied Multimodal Language Model - e.github.io/)
- Generative Agents: Interactive Simulacra of Human Behavior
- Vision-Language Models as Success Detectors
- TidyBot: Personalized Robot Assistance with Large Language Models
- Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action
-
Domain-Specific Models
-
Multimodal Evaluation
-