Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
awesome-large-multimodal-agents
https://github.com/jun0wanan/awesome-large-multimodal-agents
Last synced: 2 days ago
JSON representation
-
Application
-
Taxonomy
- **MLLM-Tool** - MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning [Github](https://github.com/MLLM-Tool/MLLM-Tool)
- **AppAgent** - AppAgent: Multimodal Agents as Smartphone Users [Github](https://github.com/mnotgod96/AppAgent)
- **MM-Navigator** - GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation [Github](https://github.com/zzxslp/MM-Navigator)
- **DroidBot-GPT** - DroidBot-GPT: GPT-powered UI Automation for Android [Github](https://github.com/MobileLLM/DroidBot-GPT)
- **MemoDroid** - Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation
- **GPT-Driver** - GPT-Driver: Learning to Drive with GPT [Github](https://github.com/PointsCoder/GPT-Driver)
- **M3** - Towards Robust Multi-Modal Reasoning via Model Selection [Github](https://github.com/LINs-lab/M3)
- **ASSISTGUI** - ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation [Github](https://github.com/showlab/assistgui)
- Github
- **AudioGPT** - AudioGPT: Understanding and Generating Speech,
- Github
- Github
- Github
- **MP5** - MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception [Github](https://github.com/IranQin/MP5)
- **MM-REACT** - MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action [Github](https://github.com/microsoft/MM-REACT)
- **STEVE** - See and Think: Embodied Agent in Virtual Environment [Github](https://github.com/rese1f/STEVE)
- **AutoDroid** - Empowering LLM to use Smartphone for Intelligent Task Automation [Github](https://github.com/MobileLLM/AutoDroid)
- **OpenAdapt**
- **AssistGPT** - AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn [Github](https://github.com/showlab/assistgpt)
- **GPT-4V-Act** - GPT-4V-Act: Chromium Copilot [Github](https://github.com/ddupont808/GPT-4V-Act)
- **VideoAgent-L** - VideoAgent: Long-form Video Understanding with Large Language Model as Agent [Project page](https://wxh1996.github.io/VideoAgent-Website/)
- **HuggingGPT** - HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face [Github](https://github.com/microsoft/JARVIS)
- **MusicAgent** - MusicAgent: An AI Agent for Music Understanding and Generation with Large Language Models [Github](https://github.com/microsoft/muzic/tree/main)
- **GRID** - GRID: A Platform for General Robot Intelligence Development [Github](https://github.com/ScaledFoundations/GRID-playground)
- **WebWISE** - WebWISE: Web Interface Control and Sequential Exploration with Large Language Models
- **MM-REACT** - MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action [Github](https://github.com/microsoft/MM-REACT)
- **ChatVideo** - ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System [Github](https://www.wangjunke.info/ChatVideo/)
- **CRAFT** - CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets
- Github
- **DroidBot-GPT** - DroidBot-GPT: GPT-powered UI Automation for Android [Github](https://github.com/MobileLLM/DroidBot-GPT)
- **EnvDistraction**
- **GenArtist** - GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing [Github](https://github.com/zhenyuw16/GenArtist)![Star](https://img.shields.io/github.com/zhenyuw16/GenArtist.svg?style=social&label=Star)
- **Kubrick** - Kubrick: Multimodal Agent Collaborations for Synthetic Video Generation [Github](https://github.com/gd3kr/BlenderGPT)
- **GenAI** - The Art of Storytelling: Multi-Agent Generative AI for Dynamic
- **OpenOmni** - OpenOmni: A Collaborative Open Source Tool for Building Future-Ready Multimodal Conversational Agents [Github](https://github.com/AI4WA/OpenOmniFramework)
- Star
- Star
- **Anim-Director** - Anim-Director: A Large Multimodal Model Powered Agent for Controllable Animation Video Generation [Github](https://github.com/HITsz-TMG/Anim-Director)
- **WirelessAgent** - WirelessAgent: Large Language Model Agents for Intelligent Wireless Networks
- **PhishAgent** - PhishAgent: A Robust Multimodal Agent for Phishing Webpage Detection
- **MMRole** - MMRole: A Comprehensive Framework for Developing and Evaluating Multimodal Role-Playing Agents [Github](https://github.com/YanqiDai/MMRole) ![Star](https://img.shields.io/github/stars/YanqiDai/MMRole.svg?style=social&label=Star)
- **ViperGPT** - ViperGPT: Visual Inference via Python Execution for Reasoning [Github](https://github.com/cvlab-columbia/viper)
- **Chameleon** - Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models [Github](https://github.com/lupantech/chameleon-llm)
- **DDCoT** - DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models [Github](https://github.com/SooLab/DDCOT)
- **AutoDroid** - Empowering LLM to use Smartphone for Intelligent Task Automation [Github](https://github.com/MobileLLM/AutoDroid)
- **GPT4Tools** - GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction [Github](https://github.com/AILab-CVC/GPT4Tools)
- Star
- **Octopus** - Octopus: Embodied Vision-Language Programmer from Environmental Feedback [Github](https://github.com/dongyh20/Octopus)
- Star
- Star
- Star
- **LLaVA-Interactive**
- Github - Art-of-Story-Telling.svg?style=social&label=Star)
- **GRID** - GRID: A Platform for General Robot Intelligence Development [Github](https://github.com/ScaledFoundations/GRID-playground)
- **MusicAgent** - MusicAgent: An AI Agent for Music Understanding and Generation with Large Language Models [Github](https://github.com/microsoft/muzic/tree/main)
- **WavJourney** - WavJourney: Compositional Audio Creation with Large Language Models [Github](https://github.com/Audio-AGI/WavJourney)
- **MM-Navigator** - GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation [Github](https://github.com/zzxslp/MM-Navigator)
- **DLAH** - Drive Like a Human: Rethinking Autonomous Driving with Large Language Models [Github](https://github.com/PJLab-ADG/DriveLikeAHuman)
- **VisionGPT** - VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework
- **Chameleon** - Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models [Github](https://github.com/lupantech/chameleon-llm)
- **Visual ChatGPT** - Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models [Github](https://github.com/moymix/TaskMatrix)
- **Avis** - Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation
- **LLaVA-Plus** - LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills [Github](https://github.com/LLaVA-VL/LLaVA-Plus-Codebase)
- **CRAFT** - CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets
- **AudioGPT** - AudioGPT: Understanding and Generating Speech,
- **Auto-UI** - You Only Look at Screens: Multimodal Chain-of-Action Agents [Github](https://github.com/cooelf/Auto-UI)
- **M3** - Towards Robust Multi-Modal Reasoning via Model Selection [Github](https://github.com/LINs-lab/M3)
- **Mobile-Agent** - Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception [Github](https://github.com/X-PLUG/MobileAgent)![Star](https://img.shields.io/github/stars/X-PLUG/MobileAgent.svg?style=social&label=Star)
- Star
- **MEIA** - Multimodal Embodied Interactive Agent for Cafe Scene
- **VisionGPT** - VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework
- **GenArtist** - GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing [Github](https://github.com/zhenyuw16/GenArtist)![Star](https://img.shields.io/github.com/zhenyuw16/GenArtist.svg?style=social&label=Star)
- **Anim-Director** - Anim-Director: A Large Multimodal Model Powered Agent for Controllable Animation Video Generation [Github](https://github.com/HITsz-TMG/Anim-Director)
- **ViperGPT** - ViperGPT: Visual Inference via Python Execution for Reasoning [Github](https://github.com/cvlab-columbia/viper)
- **GenAI** - The Art of Storytelling: Multi-Agent Generative AI for Dynamic
- **WirelessAgent** - WirelessAgent: Large Language Model Agents for Intelligent Wireless Networks
- **PhishAgent** - PhishAgent: A Robust Multimodal Agent for Phishing Webpage Detection
- **MMRole** - MMRole: A Comprehensive Framework for Developing and Evaluating Multimodal Role-Playing Agents [Github](https://github.com/YanqiDai/MMRole) ![Star](https://img.shields.io/github/stars/YanqiDai/MMRole.svg?style=social&label=Star)
- **Avis** - Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation
- Star
- Star
- **MP5** - MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception [Github](https://github.com/IranQin/MP5)
- **WavJourney** - WavJourney: Compositional Audio Creation with Large Language Models [Github](https://github.com/Audio-AGI/WavJourney)
- **VisProgram** - Visual Programming: Compositional visual reasoning without training
- **EMMA** - Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld [Github](https://github.com/stevenyangyj/Emma-Alfworld)
- **VideoAgent-M** - - VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding [Project page](https://videoagent.github.io/)
- **DLAH** - Drive Like a Human: Rethinking Autonomous Driving with Large Language Models [Github](https://github.com/PJLab-ADG/DriveLikeAHuman)
- **Cradle** - Can AI Prompt Humans? Multimodal Agents Prompt Players’ Game Actions and Show Consequences to Raise Sustainability Awareness [Github](https://github.com/BAAI-Agents/Cradle) ![Star](https://img.shields.io/github/stars/BAAI-Agents/Cradle.svg?style=social&label=Star)
- **CLOVA** - CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update
- **STEVE** - See and Think: Embodied Agent in Virtual Environment [Github](https://github.com/rese1f/STEVE)
- **DDCoT** - DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models [Github](https://github.com/SooLab/DDCOT)
- **ASSISTGUI** - ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation [Github](https://github.com/showlab/assistgui)
- **DEPS** - Describe, Explain, Plan and Select:
- **MuLan** - MuLan: Multimodal-LLM Agent for Progressive Multi-Object Diffusion [Github](https://github.com/measure-infinity/mulan-code)
- **AssistGPT** - AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn [Github](https://github.com/showlab/assistgpt)
- **SeeAct** - GPT-4V(ision) is a Generalist Web Agent, if Grounded
- **MLLM-Tool** - MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning [Github](https://github.com/MLLM-Tool/MLLM-Tool)
- **LLaVA-Plus** - LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills [Github](https://github.com/LLaVA-VL/LLaVA-Plus-Codebase)
- **WebWISE** - WebWISE: Web Interface Control and Sequential Exploration with Large Language Models
- **ChatVideo** - ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System [Github](https://www.wangjunke.info/ChatVideo/)
- **MuLan** - MuLan: Multimodal-LLM Agent for Progressive Multi-Object Diffusion [Github](https://github.com/measure-infinity/mulan-code)
- **EMMA** - Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld [Github](https://github.com/stevenyangyj/Emma-Alfworld)
- **MemoDroid** - Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation
- **GPT4Tools** - GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction [Github](https://github.com/AILab-CVC/GPT4Tools)
- **GPT-Driver** - GPT-Driver: Learning to Drive with GPT [Github](https://github.com/PointsCoder/GPT-Driver)
- **Mobile-Agent** - Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception [Github](https://github.com/X-PLUG/MobileAgent)![Star](https://img.shields.io/github/stars/X-PLUG/MobileAgent.svg?style=social&label=Star)
- **CLOVA** - CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update
- **Octopus** - Octopus: Embodied Vision-Language Programmer from Environmental Feedback [Github](https://github.com/dongyh20/Octopus)
- **Auto-UI** - You Only Look at Screens: Multimodal Chain-of-Action Agents [Github](https://github.com/cooelf/Auto-UI)
- **AppAgent** - AppAgent: Multimodal Agents as Smartphone Users [Github](https://github.com/mnotgod96/AppAgent)
- **HuggingGPT** - HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face [Github](https://github.com/microsoft/JARVIS)
- **Visual ChatGPT** - Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models [Github](https://github.com/moymix/TaskMatrix)
-
-
Benchmark
-
Taxonomy
- **OmniACT** - OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist
- **VisualWebArena** - VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks [Github](https://github.com/web-arena-x/visualwebarena)
- **DSBench** - DSBENCH: HOW FAR ARE DATA SCIENCE AGENTS TO BECOMING DATA SCIENCE EXPERTS? [Github](https://github.com/LiqiangJing/DSBench)
- Star
- **GTA** - GTA: A Benchmark for General Tool Agents [Github](https://github.com/open-compass/GTA)
- Star
- Star
- **GAIA** - GAIA: a benchmark for General AI Assistants [Github](https://huggingface.co/gaia-benchmark)
- **SmartPlay** - SmartPlay: A Benchmark for LLMs as Intelligent Agents [Github](https://github.com/microsoft/SmartPlay)
- **VisualWebArena** - VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks [Github](https://github.com/web-arena-x/visualwebarena)
- **OmniACT** - OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist
- **DSBench** - DSBENCH: HOW FAR ARE DATA SCIENCE AGENTS TO BECOMING DATA SCIENCE EXPERTS? [Github](https://github.com/LiqiangJing/DSBench)
- **SmartPlay** - SmartPlay: A Benchmark for LLMs as Intelligent Agents [Github](https://github.com/microsoft/SmartPlay)
- **Mind2Web** - MIND2WEB: Towards a Generalist Agent for the Web [Github](https://github.com/OSU-NLP-Group/Mind2Web)
- Star
-
-
Papers
-
Taxonomy
- **P2H** - Propaganda to Hate: A Multimodal Analysis of Arabic Memes with Multi-Agent LLMs
- Star
- Star
- Star
- Star
- Star
- Star
- Star
- Star
- Star
- Star
- Star
- Star
- Star
- Star
- Star
- Star
- Star
- Star
- Star
- Star
- Star
- **Agent-Smith** - Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast [Github](https://github.com/sail-sg/Agent-Smith)![Star](https://img.shields.io/github/stars/sail-sg/Agent-Smith.svg?style=social&label=Star)
- **P2H** - Propaganda to Hate: A Multimodal Analysis of Arabic Memes with Multi-Agent LLMs
-
Programming Languages
Categories
Sub Categories
Keywords
language-model
2
minecraft
2
lmm
1
multimodal
1
audio
1
gpt
1
music
1
sound
1
speech
1
talking-head
1
agent
1
gpt-4
1
gpt-4-api
1
gpt-4-vision-preview
1
gpt4-turbo
1
gpt4-vision
1
huggingface
1
huggingface-transformers
1
large-action-model
1
large-language-models
1
large-multimodal-models
1
process-automation
1
process-mining
1
python
1
segment-anything
1
transformers
1