https://github.com/zjr2000/Awesome-Multimodal-Chatbot

Awesome Multimodal Assistant is a curated list of multimodal chatbots/conversational assistants that utilize various modes of interaction, such as text, speech, images, and videos, to provide a seamless and versatile user experience.
https://github.com/zjr2000/Awesome-Multimodal-Chatbot

List: Awesome-Multimodal-Chatbot

awesome chat-application chatbot general-ai instruction-following instruction-tuning multimodal multimodal-assistant multimodal-dialogue papers vision-language

Last synced: about 2 months ago
JSON representation

Host: GitHub
URL: https://github.com/zjr2000/Awesome-Multimodal-Chatbot
Owner: zjr2000
Created: 2023-05-12T07:28:39.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2023-06-18T08:38:31.000Z (about 2 years ago)
Last Synced: 2025-05-07T23:02:00.031Z (about 2 months ago)
Topics: awesome, chat-application, chatbot, general-ai, instruction-following, instruction-tuning, multimodal, multimodal-assistant, multimodal-dialogue, papers, vision-language
Homepage:
Size: 17.6 KB
Stars: 75
Watchers: 4
Forks: 7
Open Issues: 1
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

Awesome_Prompting_Papers_in_Computer_Vision - Awesome Multimodal Assistant - language instruction tuning and LLM-based chatbot. (More Resources / Vision-Language Instruction Tuning)
ultimate-awesome - Awesome-Multimodal-Chatbot - Awesome Multimodal Assistant is a curated list of multimodal chatbots/conversational assistants that utilize various modes of interaction, such as text, speech, images, and videos, to provide a seamless and versatile user experience. (Other Lists / Julia Lists)
awesome-of-multimodal-dialogue-models - Awesome-Multimodal-Chatbot - Multimodal-Chatbot.svg) (Awesome Surveys / Previous Venues)

README

        # Awesome-Multimodal-Chatbot [![Awesome](https://awesome.re/badge.svg)](https://awesome.re)

**Awesome Multimodal Assistant** is a curated list of multimodal chatbots/conversational assistants that utilize various modes of interaction, such as text, speech, images, and videos, to provide a seamless and versatile user experience. It is designed to assist users in performing various tasks, from simple information retrieval to complex multimedia reasoning.

## Multimodal Instruction Tuning

- **MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning**

  

  ```arXiv 2022/12``` [[paper]](https://arxiv.org/abs/2212.10773)

- **GPT-4**

  ```arXiv 2023/03``` [[paper]](https://arxiv.org/abs/2303.08774) [[blog]](https://openai.com/research/gpt-4)

- **Visual Instruction Tuning** [![Star](https://img.shields.io/github/stars/haotian-liu/LLaVA.svg?style=social&label=Star)](https://github.com/haotian-liu/LLaVA)

  

  ```arXiv 2023/04``` [[paper]](https://arxiv.org/abs/2304.08485) [[code]](https://github.com/haotian-liu/LLaVA) [[project page]](https://llava-vl.github.io/) [[demo]](https://llava.hliu.cc/)

- **MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models** [![Star](https://img.shields.io/github/stars/Vision-CAIR/MiniGPT-4.svg?style=social&label=Star)](https://github.com/Vision-CAIR/MiniGPT-4)

  

  ```arXiv 2023/04``` [[paper]](https://arxiv.org/abs/2304.10592) [[code]](https://github.com/Vision-CAIR/MiniGPT-4) [[project page]](https://minigpt-4.github.io/) [[demo]](https://huggingface.co/spaces/Vision-CAIR/minigpt4)

- **mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality** [![Star](https://img.shields.io/github/stars/X-PLUG/mPLUG-Owl.svg?style=social&label=Star)](https://github.com/X-PLUG/mPLUG-Owl)

  

  ```arXiv 2023/04``` [[paper]](https://arxiv.org/abs/2304.14178) [[code]](https://github.com/X-PLUG/mPLUG-Owl) [[demo]](https://modelscope.cn/studios/damo/mPLUG-Owl/summary)

- **LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model** [![Star](https://img.shields.io/github/stars/ZrrSkywalker/LLaMA-Adapter.svg?style=social&label=Star)](https://github.com/ZrrSkywalker/LLaMA-Adapter)

  

  

  ```arXiv 2023/04``` [[paper]](https://arxiv.org/abs/2304.15010) [[code]](https://github.com/ZrrSkywalker/LLaMA-Adapter) [[demo]](https://huggingface.co/spaces/csuhan/LLaMA-Adapter)

- **Video-LLaMA: An Instruction-Finetuned Visual Language Model for Video Understanding** [![Star](https://img.shields.io/github/stars/DAMO-NLP-SG/Video-LLaMA.svg?style=social&label=Star)](https://github.com/DAMO-NLP-SG/Video-LLaMA)

  [[code]](https://github.com/DAMO-NLP-SG/Video-LLaMA)

- **LMEye: An Interactive Perception Network for Large Language Models** [![Star](https://img.shields.io/github/stars/YunxinLi/LingCloud.svg?style=social&label=Star)](https://github.com/YunxinLi/LingCloud)

- 

  ```arXiv 2023/05``` [[paper]](https://arxiv.org/abs/2305.03701) [[code]](https://github.com/YunxinLi/LingCloud)

- **MultiModal-GPT: A Vision and Language Model for Dialogue with Humans** [![Star](https://img.shields.io/github/stars/open-mmlab/Multimodal-GPT.svg?style=social&label=Star)](https://github.com/open-mmlab/Multimodal-GPT)

  ```arXiv 2023/05``` [[paper]](https://arxiv.org/abs/2305.04790) [[code]](https://github.com/open-mmlab/Multimodal-GPT) [[demo]](https://mmgpt.openmmlab.org.cn/)

- **X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages** [![Star](https://img.shields.io/github/stars/phellonchen/X-LLM.svg?style=social&label=Star)](https://github.com/phellonchen/X-LLM)

 

  ```arXiv 2023/05``` [[paper]](https://arxiv.org/abs/2305.04160) [[code]](https://github.com/phellonchen/X-LLM) [[project page]](https://x-llm.github.io/)

- **Otter: A Multi-Modal Model with In-Context Instruction Tuning** [![Star](https://img.shields.io/github/stars/Luodian/Otter.svg?style=social&label=Star)](https://github.com/Luodian/Otter)

  ```arXiv 2023/05``` [[paper]](https://arxiv.org/abs/2305.03726) [[code]](https://github.com/Luodian/Otter) [[demo]](https://otter.cliangyu.com/)

- **InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning** [![Star](https://img.shields.io/github/stars/salesforce/LAVIS.svg?style=social&label=Star)](https://github.com/salesforce/LAVIS)

  

  ```arXiv 2023/05``` [[paper]](https://arxiv.org/abs/2305.06500) [[code]](https://github.com/salesforce/LAVIS/tree/main/projects/instructblip)

- **InternGPT: Solving Vision-Centric Tasks by Interacting with ChatGPT Beyond Language** [![Star](https://img.shields.io/github/stars/OpenGVLab/InternGPT.svg?style=social&label=Star)](https://github.com/OpenGVLab/InternGPT)

  

  ```arXiv 2023/05``` [[paper]](https://arxiv.org/abs/2305.05662) [[code]](https://github.com/OpenGVLab/InternGPT) [[demo]](https://igpt.opengvlab.com/)

- **VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks**[![Star](https://img.shields.io/github/stars/OpenGVLab/VisionLLM.svg?style=social&label=Star)](https://github.com/OpenGVLab/VisionLLM)

  ```arXiv 2023/05``` [[paper]](https://arxiv.org/abs/2305.11175) [[code]](https://github.com/OpenGVLab/VisionLLM)

- **Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models**[![Star](https://img.shields.io/github/stars/luogen1996/LaVIN.svg?style=social&label=Star)](https://github.com/luogen1996/LaVIN)

- 

  ```arXiv 2023/05``` [[paper]](https://arxiv.org/abs/2305.15023) [[code]](https://github.com/luogen1996/LaVIN) [[project page]](https://luogen1996.github.io/lavin/)

- **EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought** [![Star](https://img.shields.io/github/stars/EmbodiedGPT/EmbodiedGPT_Pytorch.svg?style=social&label=Star)](https://github.com/EmbodiedGPT/EmbodiedGPT_Pytorch)

  ```arXiv 2023/05``` [[paper]](https://arxiv.org/abs/2305.15021) [[code]](https://github.com/EmbodiedGPT/EmbodiedGPT_Pytorch) [[project page]](https://embodiedgpt.github.io/)

- **DetGPT: Detect What You Need via Reasoning** [![Star](https://img.shields.io/github/stars/OptimalScale/DetGPT.svg?style=social&label=Star)](https://github.com/OptimalScale/DetGPT)

  ```arXiv 2023/05``` [[paper]](https://arxiv.org/abs/2305.14167) [[code]](https://github.com/OptimalScale/DetGPT) [[project page]](https://detgpt.github.io/)

- **PathAsst: Redefining Pathology through Generative Foundation AI Assistant for Pathology** [![Star](https://img.shields.io/github/stars/superjamessyx/Generative-Foundation-AI-Assistant-for-Pathology.svg?style=social&label=Star)](https://github.com/superjamessyx/Generative-Foundation-AI-Assistant-for-Pathology)

  ```arXiv 2023/05``` [[paper]](https://arxiv.org/abs/2305.15072) [[code]](https://github.com/superjamessyx/Generative-Foundation-AI-Assistant-for-Pathology)

- **ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst** [![Star](https://img.shields.io/github/stars/joez17/ChatBridge.svg?style=social&label=Star)](https://github.com/joez17/ChatBridge)

  ```arXiv 2023/05``` [[paper]](https://arxiv.org/abs/2305.16103) [[code]](https://github.com/joez17/ChatBridge) [[project page]](https://iva-chatbridge.github.io/)

  

- **Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models** [![Star](https://img.shields.io/github/stars/mbzuai-oryx/Video-ChatGPT.svg?style=social&label=Star)](https://github.com/mbzuai-oryx/Video-ChatGPT)

  ```arXiv 2023/06``` [[paper]](https://arxiv.org/abs/2306.05424) [[code]](https://github.com/mbzuai-oryx/Video-ChatGPT)

- **LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark**

  ```arXiv 2023/06```  [[paper]](https://arxiv.org/abs/2306.06687)

- **Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation**

  ```arXiv 2023/06``` [[paper]](https://arxiv.org/abs/2303.05983) [[project page]](https://matrix-alpha.github.io/)

- **VALLEY: VIDEO ASSISTANT WITH LARGE LANGUAGE MODEL ENHANCED ABILITY** [![Star](https://img.shields.io/github/stars/RupertLuo/Valley.svg?style=social&label=Star)](https://github.com/RupertLuo/Valley)

  ```arXiv 2023/06``` [[paper]](https://arxiv.org/abs/2306.07207) [[code]](https://github.com/RupertLuo/Valley)

## LLM-Based Modularized Frameworks

- **Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models** [![Star](https://img.shields.io/github/stars/microsoft/TaskMatrix.svg?style=social&label=Star)](https://github.com/microsoft/TaskMatrix)

  ```arXiv 2023/03``` [[paper]](https://arxiv.org/abs/2303.04671) [[code]](https://github.com/microsoft/TaskMatrix) [[demo]](https://huggingface.co/spaces/microsoft/visual_chatgpt)

- **ViperGPT: Visual Inference via Python Execution for Reasoning** [![Star](https://img.shields.io/github/stars/cvlab-columbia/viper.svg?style=social&label=Star)](https://github.com/cvlab-columbia/viper)

  ```arXiv 2023/03``` [[paper]](https://arxiv.org/abs/2303.08128) [[code]](https://github.com/cvlab-columbia/viper) [[project page]](https://viper.cs.columbia.edu/)

- **TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs** [![Star](https://img.shields.io/github/stars/microsoft/TaskMatrix.svg?style=social&label=Star)](https://github.com/microsoft/TaskMatrix)

  ```arXiv 2023/03``` [[paper]](https://arxiv.org/abs/2303.16434) [[code]](https://github.com/microsoft/TaskMatrix/tree/main/TaskMatrix.AI)

- **Chatgpt asks, blip-2 answers: Automatic questioning towards enriched visual descriptions** [![Star](https://img.shields.io/github/stars/Vision-CAIR/ChatCaptioner.svg?style=social&label=Star)](https://github.com/Vision-CAIR/ChatCaptioner)

  

  ```arXiv 2023/03``` [[paper]](https://arxiv.org/abs/2303.06594) [[code]](https://github.com/Vision-CAIR/ChatCaptioner)  

- **MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action** [![Star](https://img.shields.io/github/stars/microsoft/MM-REACT.svg?style=social&label=Star)](https://github.com/microsoft/MM-REACT)

  ```arXiv 2023/03``` [[paper]](https://arxiv.org/abs/2303.11381) [[code]](https://github.com/microsoft/MM-REACT) [[project page]](https://multimodal-react.github.io/) [[demo]](https://huggingface.co/spaces/microsoft-cognitive-service/mm-react)

- **Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface** [![Star](https://img.shields.io/github/stars/microsoft/JARVIS.svg?style=social&label=Star)](https://github.com/microsoft/JARVIS)

  ```arXiv 2023/03``` [[paper]](https://arxiv.org/abs/2303.17580) [[code]](https://github.com/microsoft/JARVIS) [[demo]](https://huggingface.co/spaces/microsoft/HuggingGPT)

- **VLog: Video as a Long Document** [![Star](https://img.shields.io/github/stars/showlab/VLog.svg?style=social&label=Star)](https://github.com/showlab/VLog)

    

    [[code]](https://github.com/showlab/VLog) [[demo]](https://huggingface.co/spaces/TencentARC/VLog)

- **Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions** [![Star](https://img.shields.io/github/stars/Vision-CAIR/ChatCaptioner.svg?style=social&label=Star)](https://github.com/Vision-CAIR/ChatCaptioner)

  

  ```arXiv 2023/04``` [[paper]](https://arxiv.org/abs/2304.04227) [[code]](https://github.com/Vision-CAIR/ChatCaptioner/tree/main/Video_ChatCaptioner)

- **ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System** 

  ```arXiv 2023/04``` [[paper]](https://arxiv.org/abs/2304.14407) [[project page]](https://www.wangjunke.info/ChatVideo/)

- **VideoChat: Chat-Centric Video Understanding** [![Star](https://img.shields.io/github/stars/OpenGVLab/Ask-Anything.svg?style=social&label=Star)](https://github.com/OpenGVLab/Ask-Anything)

  ```arXiv 2023/05```  [[paper]](https://arxiv.org/abs/2305.06355) [[code]](https://github.com/OpenGVLab/Ask-Anything) [[demo]](https://huggingface.co/spaces/ynhe/AskAnything)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/zjr2000/Awesome-Multimodal-Chatbot

Awesome Lists containing this project

README