Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/zjr2000/Awesome-Multimodal-Chatbot
Awesome Multimodal Assistant is a curated list of multimodal chatbots/conversational assistants that utilize various modes of interaction, such as text, speech, images, and videos, to provide a seamless and versatile user experience.
https://github.com/zjr2000/Awesome-Multimodal-Chatbot
List: Awesome-Multimodal-Chatbot
awesome chat-application chatbot general-ai instruction-following instruction-tuning multimodal multimodal-assistant multimodal-dialogue papers vision-language
Last synced: about 1 month ago
JSON representation
Awesome Multimodal Assistant is a curated list of multimodal chatbots/conversational assistants that utilize various modes of interaction, such as text, speech, images, and videos, to provide a seamless and versatile user experience.
- Host: GitHub
- URL: https://github.com/zjr2000/Awesome-Multimodal-Chatbot
- Owner: zjr2000
- Created: 2023-05-12T07:28:39.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2023-06-18T08:38:31.000Z (over 1 year ago)
- Last Synced: 2024-05-19T22:23:32.502Z (7 months ago)
- Topics: awesome, chat-application, chatbot, general-ai, instruction-following, instruction-tuning, multimodal, multimodal-assistant, multimodal-dialogue, papers, vision-language
- Homepage:
- Size: 17.6 KB
- Stars: 57
- Watchers: 4
- Forks: 5
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- Awesome_Prompting_Papers_in_Computer_Vision - Awesome Multimodal Assistant - language instruction tuning and LLM-based chatbot. (More Resources / Vision-Language Instruction Tuning)
- ultimate-awesome - Awesome-Multimodal-Chatbot - Awesome Multimodal Assistant is a curated list of multimodal chatbots/conversational assistants that utilize various modes of interaction, such as text, speech, images, and videos, to provide a seamless and versatile user experience. (Other Lists / Monkey C Lists)
- awesome-of-multimodal-dialogue-models - Awesome-Multimodal-Chatbot - Multimodal-Chatbot.svg) (Awesome Surveys / Previous Venues)
README
# Awesome-Multimodal-Chatbot [![Awesome](https://awesome.re/badge.svg)](https://awesome.re)
**Awesome Multimodal Assistant** is a curated list of multimodal chatbots/conversational assistants that utilize various modes of interaction, such as text, speech, images, and videos, to provide a seamless and versatile user experience. It is designed to assist users in performing various tasks, from simple information retrieval to complex multimedia reasoning.
## Multimodal Instruction Tuning
- **MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning**
```arXiv 2022/12``` [[paper]](https://arxiv.org/abs/2212.10773)- **GPT-4**
```arXiv 2023/03``` [[paper]](https://arxiv.org/abs/2303.08774) [[blog]](https://openai.com/research/gpt-4)
- **Visual Instruction Tuning** [![Star](https://img.shields.io/github/stars/haotian-liu/LLaVA.svg?style=social&label=Star)](https://github.com/haotian-liu/LLaVA)
```arXiv 2023/04``` [[paper]](https://arxiv.org/abs/2304.08485) [[code]](https://github.com/haotian-liu/LLaVA) [[project page]](https://llava-vl.github.io/) [[demo]](https://llava.hliu.cc/)- **MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models** [![Star](https://img.shields.io/github/stars/Vision-CAIR/MiniGPT-4.svg?style=social&label=Star)](https://github.com/Vision-CAIR/MiniGPT-4)
```arXiv 2023/04``` [[paper]](https://arxiv.org/abs/2304.10592) [[code]](https://github.com/Vision-CAIR/MiniGPT-4) [[project page]](https://minigpt-4.github.io/) [[demo]](https://huggingface.co/spaces/Vision-CAIR/minigpt4)- **mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality** [![Star](https://img.shields.io/github/stars/X-PLUG/mPLUG-Owl.svg?style=social&label=Star)](https://github.com/X-PLUG/mPLUG-Owl)
```arXiv 2023/04``` [[paper]](https://arxiv.org/abs/2304.14178) [[code]](https://github.com/X-PLUG/mPLUG-Owl) [[demo]](https://modelscope.cn/studios/damo/mPLUG-Owl/summary)- **LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model** [![Star](https://img.shields.io/github/stars/ZrrSkywalker/LLaMA-Adapter.svg?style=social&label=Star)](https://github.com/ZrrSkywalker/LLaMA-Adapter)
```arXiv 2023/04``` [[paper]](https://arxiv.org/abs/2304.15010) [[code]](https://github.com/ZrrSkywalker/LLaMA-Adapter) [[demo]](https://huggingface.co/spaces/csuhan/LLaMA-Adapter)- **Video-LLaMA: An Instruction-Finetuned Visual Language Model for Video Understanding** [![Star](https://img.shields.io/github/stars/DAMO-NLP-SG/Video-LLaMA.svg?style=social&label=Star)](https://github.com/DAMO-NLP-SG/Video-LLaMA)
[[code]](https://github.com/DAMO-NLP-SG/Video-LLaMA)
- **LMEye: An Interactive Perception Network for Large Language Models** [![Star](https://img.shields.io/github/stars/YunxinLi/LingCloud.svg?style=social&label=Star)](https://github.com/YunxinLi/LingCloud)
-
```arXiv 2023/05``` [[paper]](https://arxiv.org/abs/2305.03701) [[code]](https://github.com/YunxinLi/LingCloud)- **MultiModal-GPT: A Vision and Language Model for Dialogue with Humans** [![Star](https://img.shields.io/github/stars/open-mmlab/Multimodal-GPT.svg?style=social&label=Star)](https://github.com/open-mmlab/Multimodal-GPT)
```arXiv 2023/05``` [[paper]](https://arxiv.org/abs/2305.04790) [[code]](https://github.com/open-mmlab/Multimodal-GPT) [[demo]](https://mmgpt.openmmlab.org.cn/)
- **X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages** [![Star](https://img.shields.io/github/stars/phellonchen/X-LLM.svg?style=social&label=Star)](https://github.com/phellonchen/X-LLM)
```arXiv 2023/05``` [[paper]](https://arxiv.org/abs/2305.04160) [[code]](https://github.com/phellonchen/X-LLM) [[project page]](https://x-llm.github.io/)- **Otter: A Multi-Modal Model with In-Context Instruction Tuning** [![Star](https://img.shields.io/github/stars/Luodian/Otter.svg?style=social&label=Star)](https://github.com/Luodian/Otter)
```arXiv 2023/05``` [[paper]](https://arxiv.org/abs/2305.03726) [[code]](https://github.com/Luodian/Otter) [[demo]](https://otter.cliangyu.com/)
- **InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning** [![Star](https://img.shields.io/github/stars/salesforce/LAVIS.svg?style=social&label=Star)](https://github.com/salesforce/LAVIS)
```arXiv 2023/05``` [[paper]](https://arxiv.org/abs/2305.06500) [[code]](https://github.com/salesforce/LAVIS/tree/main/projects/instructblip)- **InternGPT: Solving Vision-Centric Tasks by Interacting with ChatGPT Beyond Language** [![Star](https://img.shields.io/github/stars/OpenGVLab/InternGPT.svg?style=social&label=Star)](https://github.com/OpenGVLab/InternGPT)
```arXiv 2023/05``` [[paper]](https://arxiv.org/abs/2305.05662) [[code]](https://github.com/OpenGVLab/InternGPT) [[demo]](https://igpt.opengvlab.com/)
- **VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks**[![Star](https://img.shields.io/github/stars/OpenGVLab/VisionLLM.svg?style=social&label=Star)](https://github.com/OpenGVLab/VisionLLM)
```arXiv 2023/05``` [[paper]](https://arxiv.org/abs/2305.11175) [[code]](https://github.com/OpenGVLab/VisionLLM)
- **Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models**[![Star](https://img.shields.io/github/stars/luogen1996/LaVIN.svg?style=social&label=Star)](https://github.com/luogen1996/LaVIN)
-
```arXiv 2023/05``` [[paper]](https://arxiv.org/abs/2305.15023) [[code]](https://github.com/luogen1996/LaVIN) [[project page]](https://luogen1996.github.io/lavin/)- **EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought** [![Star](https://img.shields.io/github/stars/EmbodiedGPT/EmbodiedGPT_Pytorch.svg?style=social&label=Star)](https://github.com/EmbodiedGPT/EmbodiedGPT_Pytorch)
```arXiv 2023/05``` [[paper]](https://arxiv.org/abs/2305.15021) [[code]](https://github.com/EmbodiedGPT/EmbodiedGPT_Pytorch) [[project page]](https://embodiedgpt.github.io/)
- **DetGPT: Detect What You Need via Reasoning** [![Star](https://img.shields.io/github/stars/OptimalScale/DetGPT.svg?style=social&label=Star)](https://github.com/OptimalScale/DetGPT)
```arXiv 2023/05``` [[paper]](https://arxiv.org/abs/2305.14167) [[code]](https://github.com/OptimalScale/DetGPT) [[project page]](https://detgpt.github.io/)
- **PathAsst: Redefining Pathology through Generative Foundation AI Assistant for Pathology** [![Star](https://img.shields.io/github/stars/superjamessyx/Generative-Foundation-AI-Assistant-for-Pathology.svg?style=social&label=Star)](https://github.com/superjamessyx/Generative-Foundation-AI-Assistant-for-Pathology)
```arXiv 2023/05``` [[paper]](https://arxiv.org/abs/2305.15072) [[code]](https://github.com/superjamessyx/Generative-Foundation-AI-Assistant-for-Pathology)
- **ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst** [![Star](https://img.shields.io/github/stars/joez17/ChatBridge.svg?style=social&label=Star)](https://github.com/joez17/ChatBridge)
```arXiv 2023/05``` [[paper]](https://arxiv.org/abs/2305.16103) [[code]](https://github.com/joez17/ChatBridge) [[project page]](https://iva-chatbridge.github.io/)
- **Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models** [![Star](https://img.shields.io/github/stars/mbzuai-oryx/Video-ChatGPT.svg?style=social&label=Star)](https://github.com/mbzuai-oryx/Video-ChatGPT)```arXiv 2023/06``` [[paper]](https://arxiv.org/abs/2306.05424) [[code]](https://github.com/mbzuai-oryx/Video-ChatGPT)
- **LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark**
```arXiv 2023/06``` [[paper]](https://arxiv.org/abs/2306.06687)
- **Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation**
```arXiv 2023/06``` [[paper]](https://arxiv.org/abs/2303.05983) [[project page]](https://matrix-alpha.github.io/)
- **VALLEY: VIDEO ASSISTANT WITH LARGE LANGUAGE MODEL ENHANCED ABILITY** [![Star](https://img.shields.io/github/stars/RupertLuo/Valley.svg?style=social&label=Star)](https://github.com/RupertLuo/Valley)
```arXiv 2023/06``` [[paper]](https://arxiv.org/abs/2306.07207) [[code]](https://github.com/RupertLuo/Valley)
## LLM-Based Modularized Frameworks
- **Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models** [![Star](https://img.shields.io/github/stars/microsoft/TaskMatrix.svg?style=social&label=Star)](https://github.com/microsoft/TaskMatrix)
```arXiv 2023/03``` [[paper]](https://arxiv.org/abs/2303.04671) [[code]](https://github.com/microsoft/TaskMatrix) [[demo]](https://huggingface.co/spaces/microsoft/visual_chatgpt)
- **ViperGPT: Visual Inference via Python Execution for Reasoning** [![Star](https://img.shields.io/github/stars/cvlab-columbia/viper.svg?style=social&label=Star)](https://github.com/cvlab-columbia/viper)
```arXiv 2023/03``` [[paper]](https://arxiv.org/abs/2303.08128) [[code]](https://github.com/cvlab-columbia/viper) [[project page]](https://viper.cs.columbia.edu/)
- **TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs** [![Star](https://img.shields.io/github/stars/microsoft/TaskMatrix.svg?style=social&label=Star)](https://github.com/microsoft/TaskMatrix)
```arXiv 2023/03``` [[paper]](https://arxiv.org/abs/2303.16434) [[code]](https://github.com/microsoft/TaskMatrix/tree/main/TaskMatrix.AI)
- **Chatgpt asks, blip-2 answers: Automatic questioning towards enriched visual descriptions** [![Star](https://img.shields.io/github/stars/Vision-CAIR/ChatCaptioner.svg?style=social&label=Star)](https://github.com/Vision-CAIR/ChatCaptioner)
```arXiv 2023/03``` [[paper]](https://arxiv.org/abs/2303.06594) [[code]](https://github.com/Vision-CAIR/ChatCaptioner)- **MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action** [![Star](https://img.shields.io/github/stars/microsoft/MM-REACT.svg?style=social&label=Star)](https://github.com/microsoft/MM-REACT)
```arXiv 2023/03``` [[paper]](https://arxiv.org/abs/2303.11381) [[code]](https://github.com/microsoft/MM-REACT) [[project page]](https://multimodal-react.github.io/) [[demo]](https://huggingface.co/spaces/microsoft-cognitive-service/mm-react)
- **Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface** [![Star](https://img.shields.io/github/stars/microsoft/JARVIS.svg?style=social&label=Star)](https://github.com/microsoft/JARVIS)
```arXiv 2023/03``` [[paper]](https://arxiv.org/abs/2303.17580) [[code]](https://github.com/microsoft/JARVIS) [[demo]](https://huggingface.co/spaces/microsoft/HuggingGPT)
- **VLog: Video as a Long Document** [![Star](https://img.shields.io/github/stars/showlab/VLog.svg?style=social&label=Star)](https://github.com/showlab/VLog)
[[code]](https://github.com/showlab/VLog) [[demo]](https://huggingface.co/spaces/TencentARC/VLog)- **Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions** [![Star](https://img.shields.io/github/stars/Vision-CAIR/ChatCaptioner.svg?style=social&label=Star)](https://github.com/Vision-CAIR/ChatCaptioner)
```arXiv 2023/04``` [[paper]](https://arxiv.org/abs/2304.04227) [[code]](https://github.com/Vision-CAIR/ChatCaptioner/tree/main/Video_ChatCaptioner)- **ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System**
```arXiv 2023/04``` [[paper]](https://arxiv.org/abs/2304.14407) [[project page]](https://www.wangjunke.info/ChatVideo/)
- **VideoChat: Chat-Centric Video Understanding** [![Star](https://img.shields.io/github/stars/OpenGVLab/Ask-Anything.svg?style=social&label=Star)](https://github.com/OpenGVLab/Ask-Anything)
```arXiv 2023/05``` [[paper]](https://arxiv.org/abs/2305.06355) [[code]](https://github.com/OpenGVLab/Ask-Anything) [[demo]](https://huggingface.co/spaces/ynhe/AskAnything)