Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

awesome-large-multimodal-agents

https://github.com/jun0wanan/awesome-large-multimodal-agents

Last synced: 4 days ago
JSON representation

Application
- Taxonomy
  - **MLLM-Tool** - MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning [Github](https://github.com/MLLM-Tool/MLLM-Tool)
  - **AppAgent** - AppAgent: Multimodal Agents as Smartphone Users [Github](https://github.com/mnotgod96/AppAgent)
  - **MM-Navigator** - GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation [Github](https://github.com/zzxslp/MM-Navigator)
  - **DroidBot-GPT** - DroidBot-GPT: GPT-powered UI Automation for Android [Github](https://github.com/MobileLLM/DroidBot-GPT)
  - **MemoDroid** - Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation
  - **GPT-Driver** - GPT-Driver: Learning to Drive with GPT [Github](https://github.com/PointsCoder/GPT-Driver)
  - **M3** - Towards Robust Multi-Modal Reasoning via Model Selection [Github](https://github.com/LINs-lab/M3)
  - **ASSISTGUI** - ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation [Github](https://github.com/showlab/assistgui)
  - Github
  - **MusicAgent** - MusicAgent: An AI Agent for Music Understanding and Generation with Large Language Models [Github](https://github.com/microsoft/muzic/tree/main)
  - **AudioGPT** - AudioGPT: Understanding and Generating Speech,
  - Github
  - Github
  - **WavJourney** - WavJourney: Compositional Audio Creation with Large Language Models [Github](https://github.com/Audio-AGI/WavJourney)
  - Github
  - **DLAH** - Drive Like a Human: Rethinking Autonomous Driving with Large Language Models [Github](https://github.com/PJLab-ADG/DriveLikeAHuman)
  - **MP5** - MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception [Github](https://github.com/IranQin/MP5)
  - **MM-REACT** - MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action [Github](https://github.com/microsoft/MM-REACT)
  - **STEVE** - See and Think: Embodied Agent in Virtual Environment [Github](https://github.com/rese1f/STEVE)
  - **AutoDroid** - Empowering LLM to use Smartphone for Intelligent Task Automation [Github](https://github.com/MobileLLM/AutoDroid)
  - **GPT4Tools** - GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction [Github](https://github.com/AILab-CVC/GPT4Tools)
  - **OpenAdapt**
  - **AssistGPT** - AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn [Github](https://github.com/showlab/assistgpt)
  - **ChatVideo** - ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System [Github](https://www.wangjunke.info/ChatVideo/)
  - **GPT-4V-Act** - GPT-4V-Act: Chromium Copilot [Github](https://github.com/ddupont808/GPT-4V-Act)
  - **VisionGPT** - VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework
  - **VideoAgent-L** - VideoAgent: Long-form Video Understanding with Large Language Model as Agent [Project page](https://wxh1996.github.io/VideoAgent-Website/)
  - **HuggingGPT** - HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face [Github](https://github.com/microsoft/JARVIS)
  - **MusicAgent** - MusicAgent: An AI Agent for Music Understanding and Generation with Large Language Models [Github](https://github.com/microsoft/muzic/tree/main)
  - **GRID** - GRID: A Platform for General Robot Intelligence Development [Github](https://github.com/ScaledFoundations/GRID-playground)
  - **WebWISE** - WebWISE: Web Interface Control and Sequential Exploration with Large Language Models
  - **LLaVA-Interactive**
  - **MM-REACT** - MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action [Github](https://github.com/microsoft/MM-REACT)
  - **ChatVideo** - ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System [Github](https://www.wangjunke.info/ChatVideo/)
  - **GRID** - GRID: A Platform for General Robot Intelligence Development [Github](https://github.com/ScaledFoundations/GRID-playground)
  - **CRAFT** - CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets
  - **LLaVA-Plus** - LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills [Github](https://github.com/LLaVA-VL/LLaVA-Plus-Codebase)
  - Github
  - **MLLM-Tool** - MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning [Github](https://github.com/MLLM-Tool/MLLM-Tool)
  - **EMMA** - Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld [Github](https://github.com/stevenyangyj/Emma-Alfworld)
  - **DroidBot-GPT** - DroidBot-GPT: GPT-powered UI Automation for Android [Github](https://github.com/MobileLLM/DroidBot-GPT)
  - **WebWISE** - WebWISE: Web Interface Control and Sequential Exploration with Large Language Models
  - **MemoDroid** - Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation
  - **MM-Navigator** - GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation [Github](https://github.com/zzxslp/MM-Navigator)
  - **Chameleon** - Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models [Github](https://github.com/lupantech/chameleon-llm)
  - **DDCoT** - DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models [Github](https://github.com/SooLab/DDCOT)
  - **AutoDroid** - Empowering LLM to use Smartphone for Intelligent Task Automation [Github](https://github.com/MobileLLM/AutoDroid)
  - **GPT4Tools** - GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction [Github](https://github.com/AILab-CVC/GPT4Tools)
  - Star
  - **Octopus** - Octopus: Embodied Vision-Language Programmer from Environmental Feedback [Github](https://github.com/dongyh20/Octopus)
  - Star
  - Star
  - Star
  - **ViperGPT** - ViperGPT: Visual Inference via Python Execution for Reasoning [Github](https://github.com/cvlab-columbia/viper)
  - **Chameleon** - Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models [Github](https://github.com/lupantech/chameleon-llm)
  - **Visual ChatGPT** - Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models [Github](https://github.com/moymix/TaskMatrix)
  - **Avis** - Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation
  - **LLaVA-Plus** - LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills [Github](https://github.com/LLaVA-VL/LLaVA-Plus-Codebase)
  - **CRAFT** - CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets
  - **AudioGPT** - AudioGPT: Understanding and Generating Speech,
  - **Auto-UI** - You Only Look at Screens: Multimodal Chain-of-Action Agents [Github](https://github.com/cooelf/Auto-UI)
  - **M3** - Towards Robust Multi-Modal Reasoning via Model Selection [Github](https://github.com/LINs-lab/M3)
  - **Mobile-Agent** - Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception [Github](https://github.com/X-PLUG/MobileAgent)![Star](https://img.shields.io/github/stars/X-PLUG/MobileAgent.svg?style=social&label=Star)
  - Star
  - **MEIA** - Multimodal Embodied Interactive Agent for Cafe Scene
  - **VisionGPT** - VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework
  - **MP5** - MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception [Github](https://github.com/IranQin/MP5)
  - Star
  - Star
  - **WavJourney** - WavJourney: Compositional Audio Creation with Large Language Models [Github](https://github.com/Audio-AGI/WavJourney)
  - **VisProgram** - Visual Programming: Compositional visual reasoning without training
  - **EMMA** - Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld [Github](https://github.com/stevenyangyj/Emma-Alfworld)
  - **VideoAgent** - - VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding [Project page](https://videoagent.github.io/)
  - **DLAH** - Drive Like a Human: Rethinking Autonomous Driving with Large Language Models [Github](https://github.com/PJLab-ADG/DriveLikeAHuman)
  - **Cradle** - Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study [Github](https://github.com/BAAI-Agents/Cradle) ![Star](https://img.shields.io/github/stars/BAAI-Agents/Cradle.svg?style=social&label=Star)
  - **CLOVA** - CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update
  - **STEVE** - See and Think: Embodied Agent in Virtual Environment [Github](https://github.com/rese1f/STEVE)
  - **DDCoT** - DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models [Github](https://github.com/SooLab/DDCOT)
  - **ASSISTGUI** - ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation [Github](https://github.com/showlab/assistgui)
  - **DEPS** - Describe, Explain, Plan and Select:
  - **MuLan** - MuLan: Multimodal-LLM Agent for Progressive Multi-Object Diffusion [Github](https://github.com/measure-infinity/mulan-code)
  - **AssistGPT** - AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn [Github](https://github.com/showlab/assistgpt)
  - **SeeAct** - GPT-4V(ision) is a Generalist Web Agent, if Grounded
  - **GPT-Driver** - GPT-Driver: Learning to Drive with GPT [Github](https://github.com/PointsCoder/GPT-Driver)
  - **Mobile-Agent** - Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception [Github](https://github.com/X-PLUG/MobileAgent)![Star](https://img.shields.io/github/stars/X-PLUG/MobileAgent.svg?style=social&label=Star)
  - **CLOVA** - CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update
  - **DEPS** - Describe, Explain, Plan and Select:
  - **Octopus** - Octopus: Embodied Vision-Language Programmer from Environmental Feedback [Github](https://github.com/dongyh20/Octopus)
  - **Auto-UI** - You Only Look at Screens: Multimodal Chain-of-Action Agents [Github](https://github.com/cooelf/Auto-UI)
  - **AppAgent** - AppAgent: Multimodal Agents as Smartphone Users [Github](https://github.com/mnotgod96/AppAgent)
  - **HuggingGPT** - HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face [Github](https://github.com/microsoft/JARVIS)
  - **MuLan** - MuLan: Multimodal-LLM Agent for Progressive Multi-Object Diffusion [Github](https://github.com/measure-infinity/mulan-code)
  - **Visual ChatGPT** - Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models [Github](https://github.com/moymix/TaskMatrix)
Benchmark
- Taxonomy
  - **SmartPlay** - SmartPlay: A Benchmark for LLMs as Intelligent Agents [Github](https://github.com/microsoft/SmartPlay)
  - **VisualWebArena** - VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks [Github](https://github.com/web-arena-x/visualwebarena)
  - **GAIA** - GAIA: a benchmark for General AI Assistants [Github](https://huggingface.co/gaia-benchmark)
  - **OmniACT** - OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist
  - **VisualWebArena** - VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks [Github](https://github.com/web-arena-x/visualwebarena)
  - Star
  - **GAIA** - GAIA: a benchmark for General AI Assistants [Github](https://huggingface.co/gaia-benchmark)
  - **OmniACT** - OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist
  - **SmartPlay** - SmartPlay: A Benchmark for LLMs as Intelligent Agents [Github](https://github.com/microsoft/SmartPlay)
  - **Mind2Web** - MIND2WEB: Towards a Generalist Agent for the Web [Github](https://github.com/OSU-NLP-Group/Mind2Web)
  - Star
Papers
- Taxonomy
  - **Agent-Smith** - Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast [Github](https://github.com/sail-sg/Agent-Smith)![Star](https://img.shields.io/github/stars/sail-sg/Agent-Smith.svg?style=social&label=Star)
  - Star
  - Star
  - Star
  - Star
  - Star
  - Star
  - Star
  - Star
  - Star
  - Star
  - Star
  - Star
  - Star
  - Star
  - Star
  - Star
  - Star
  - Star
  - Star
  - Star
  - Star
  - **Agent-Smith** - Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast [Github](https://github.com/sail-sg/Agent-Smith)![Star](https://img.shields.io/github/stars/sail-sg/Agent-Smith.svg?style=social&label=Star)

Programming Languages

Python 5 JavaScript 1 Java 1

Ecosyste.ms: Awesome

awesome-large-multimodal-agents

Application

Taxonomy

Benchmark

Taxonomy

Papers

Taxonomy