Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

awesome-large-multimodal-agents


https://github.com/jun0wanan/awesome-large-multimodal-agents

Last synced: 4 days ago
JSON representation

  • Application

    • Taxonomy

      • **MLLM-Tool** - MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning [Github](https://github.com/MLLM-Tool/MLLM-Tool)
      • **AppAgent** - AppAgent: Multimodal Agents as Smartphone Users [Github](https://github.com/mnotgod96/AppAgent)
      • **MM-Navigator** - GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation [Github](https://github.com/zzxslp/MM-Navigator)
      • **DroidBot-GPT** - DroidBot-GPT: GPT-powered UI Automation for Android [Github](https://github.com/MobileLLM/DroidBot-GPT)
      • **MemoDroid** - Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation
      • **GPT-Driver** - GPT-Driver: Learning to Drive with GPT [Github](https://github.com/PointsCoder/GPT-Driver)
      • **M3** - Towards Robust Multi-Modal Reasoning via Model Selection [Github](https://github.com/LINs-lab/M3)
      • **ASSISTGUI** - ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation [Github](https://github.com/showlab/assistgui)
      • Github
      • **MusicAgent** - MusicAgent: An AI Agent for Music Understanding and Generation with Large Language Models [Github](https://github.com/microsoft/muzic/tree/main)
      • **AudioGPT** - AudioGPT: Understanding and Generating Speech,
      • Github
      • Github
      • **WavJourney** - WavJourney: Compositional Audio Creation with Large Language Models [Github](https://github.com/Audio-AGI/WavJourney)
      • Github
      • **DLAH** - Drive Like a Human: Rethinking Autonomous Driving with Large Language Models [Github](https://github.com/PJLab-ADG/DriveLikeAHuman)
      • **MP5** - MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception [Github](https://github.com/IranQin/MP5)
      • **MM-REACT** - MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action [Github](https://github.com/microsoft/MM-REACT)
      • **STEVE** - See and Think: Embodied Agent in Virtual Environment [Github](https://github.com/rese1f/STEVE)
      • **AutoDroid** - Empowering LLM to use Smartphone for Intelligent Task Automation [Github](https://github.com/MobileLLM/AutoDroid)
      • **GPT4Tools** - GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction [Github](https://github.com/AILab-CVC/GPT4Tools)
      • **OpenAdapt**
      • **AssistGPT** - AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn [Github](https://github.com/showlab/assistgpt)
      • **ChatVideo** - ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System [Github](https://www.wangjunke.info/ChatVideo/)
      • **GPT-4V-Act** - GPT-4V-Act: Chromium Copilot [Github](https://github.com/ddupont808/GPT-4V-Act)
      • **VisionGPT** - VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework
      • **VideoAgent-L** - VideoAgent: Long-form Video Understanding with Large Language Model as Agent [Project page](https://wxh1996.github.io/VideoAgent-Website/)
      • **HuggingGPT** - HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face [Github](https://github.com/microsoft/JARVIS)
      • **MusicAgent** - MusicAgent: An AI Agent for Music Understanding and Generation with Large Language Models [Github](https://github.com/microsoft/muzic/tree/main)
      • **GRID** - GRID: A Platform for General Robot Intelligence Development [Github](https://github.com/ScaledFoundations/GRID-playground)
      • **WebWISE** - WebWISE: Web Interface Control and Sequential Exploration with Large Language Models
      • **LLaVA-Interactive**
      • **MM-REACT** - MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action [Github](https://github.com/microsoft/MM-REACT)
      • **ChatVideo** - ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System [Github](https://www.wangjunke.info/ChatVideo/)
      • **GRID** - GRID: A Platform for General Robot Intelligence Development [Github](https://github.com/ScaledFoundations/GRID-playground)
      • **CRAFT** - CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets
      • **LLaVA-Plus** - LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills [Github](https://github.com/LLaVA-VL/LLaVA-Plus-Codebase)
      • Github
      • **MLLM-Tool** - MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning [Github](https://github.com/MLLM-Tool/MLLM-Tool)
      • **EMMA** - Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld [Github](https://github.com/stevenyangyj/Emma-Alfworld)
      • **DroidBot-GPT** - DroidBot-GPT: GPT-powered UI Automation for Android [Github](https://github.com/MobileLLM/DroidBot-GPT)
      • **WebWISE** - WebWISE: Web Interface Control and Sequential Exploration with Large Language Models
      • **MemoDroid** - Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation
      • **MM-Navigator** - GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation [Github](https://github.com/zzxslp/MM-Navigator)
      • **Chameleon** - Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models [Github](https://github.com/lupantech/chameleon-llm)
      • **DDCoT** - DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models [Github](https://github.com/SooLab/DDCOT)
      • **AutoDroid** - Empowering LLM to use Smartphone for Intelligent Task Automation [Github](https://github.com/MobileLLM/AutoDroid)
      • **GPT4Tools** - GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction [Github](https://github.com/AILab-CVC/GPT4Tools)
      • Star
      • **Octopus** - Octopus: Embodied Vision-Language Programmer from Environmental Feedback [Github](https://github.com/dongyh20/Octopus)
      • Star
      • Star
      • Star
      • **ViperGPT** - ViperGPT: Visual Inference via Python Execution for Reasoning [Github](https://github.com/cvlab-columbia/viper)
      • **Chameleon** - Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models [Github](https://github.com/lupantech/chameleon-llm)
      • **Visual ChatGPT** - Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models [Github](https://github.com/moymix/TaskMatrix)
      • **Avis** - Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation
      • **LLaVA-Plus** - LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills [Github](https://github.com/LLaVA-VL/LLaVA-Plus-Codebase)
      • **CRAFT** - CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets
      • **AudioGPT** - AudioGPT: Understanding and Generating Speech,
      • **Auto-UI** - You Only Look at Screens: Multimodal Chain-of-Action Agents [Github](https://github.com/cooelf/Auto-UI)
      • **M3** - Towards Robust Multi-Modal Reasoning via Model Selection [Github](https://github.com/LINs-lab/M3)
      • **Mobile-Agent** - Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception [Github](https://github.com/X-PLUG/MobileAgent)![Star](https://img.shields.io/github/stars/X-PLUG/MobileAgent.svg?style=social&label=Star)
      • Star
      • **MEIA** - Multimodal Embodied Interactive Agent for Cafe Scene
      • **VisionGPT** - VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework
      • **MP5** - MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception [Github](https://github.com/IranQin/MP5)
      • Star
      • Star
      • **WavJourney** - WavJourney: Compositional Audio Creation with Large Language Models [Github](https://github.com/Audio-AGI/WavJourney)
      • **VisProgram** - Visual Programming: Compositional visual reasoning without training
      • **EMMA** - Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld [Github](https://github.com/stevenyangyj/Emma-Alfworld)
      • **VideoAgent** - - VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding [Project page](https://videoagent.github.io/)
      • **DLAH** - Drive Like a Human: Rethinking Autonomous Driving with Large Language Models [Github](https://github.com/PJLab-ADG/DriveLikeAHuman)
      • **Cradle** - Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study [Github](https://github.com/BAAI-Agents/Cradle) ![Star](https://img.shields.io/github/stars/BAAI-Agents/Cradle.svg?style=social&label=Star)
      • **CLOVA** - CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update
      • **STEVE** - See and Think: Embodied Agent in Virtual Environment [Github](https://github.com/rese1f/STEVE)
      • **DDCoT** - DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models [Github](https://github.com/SooLab/DDCOT)
      • **ASSISTGUI** - ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation [Github](https://github.com/showlab/assistgui)
      • **DEPS** - Describe, Explain, Plan and Select:
      • **MuLan** - MuLan: Multimodal-LLM Agent for Progressive Multi-Object Diffusion [Github](https://github.com/measure-infinity/mulan-code)
      • **AssistGPT** - AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn [Github](https://github.com/showlab/assistgpt)
      • **SeeAct** - GPT-4V(ision) is a Generalist Web Agent, if Grounded
      • **GPT-Driver** - GPT-Driver: Learning to Drive with GPT [Github](https://github.com/PointsCoder/GPT-Driver)
      • **Mobile-Agent** - Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception [Github](https://github.com/X-PLUG/MobileAgent)![Star](https://img.shields.io/github/stars/X-PLUG/MobileAgent.svg?style=social&label=Star)
      • **CLOVA** - CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update
      • **DEPS** - Describe, Explain, Plan and Select:
      • **Octopus** - Octopus: Embodied Vision-Language Programmer from Environmental Feedback [Github](https://github.com/dongyh20/Octopus)
      • **Auto-UI** - You Only Look at Screens: Multimodal Chain-of-Action Agents [Github](https://github.com/cooelf/Auto-UI)
      • **AppAgent** - AppAgent: Multimodal Agents as Smartphone Users [Github](https://github.com/mnotgod96/AppAgent)
      • **HuggingGPT** - HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face [Github](https://github.com/microsoft/JARVIS)
      • **MuLan** - MuLan: Multimodal-LLM Agent for Progressive Multi-Object Diffusion [Github](https://github.com/measure-infinity/mulan-code)
      • **Visual ChatGPT** - Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models [Github](https://github.com/moymix/TaskMatrix)
  • Benchmark

    • Taxonomy

      • **SmartPlay** - SmartPlay: A Benchmark for LLMs as Intelligent Agents [Github](https://github.com/microsoft/SmartPlay)
      • **VisualWebArena** - VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks [Github](https://github.com/web-arena-x/visualwebarena)
      • **GAIA** - GAIA: a benchmark for General AI Assistants [Github](https://huggingface.co/gaia-benchmark)
      • **OmniACT** - OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist
      • **VisualWebArena** - VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks [Github](https://github.com/web-arena-x/visualwebarena)
      • Star
      • **GAIA** - GAIA: a benchmark for General AI Assistants [Github](https://huggingface.co/gaia-benchmark)
      • **OmniACT** - OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist
      • **SmartPlay** - SmartPlay: A Benchmark for LLMs as Intelligent Agents [Github](https://github.com/microsoft/SmartPlay)
      • **Mind2Web** - MIND2WEB: Towards a Generalist Agent for the Web [Github](https://github.com/OSU-NLP-Group/Mind2Web)
      • Star
  • Papers

    • Taxonomy

      • **Agent-Smith** - Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast [Github](https://github.com/sail-sg/Agent-Smith)![Star](https://img.shields.io/github/stars/sail-sg/Agent-Smith.svg?style=social&label=Star)
      • Star
      • Star
      • Star
      • Star
      • Star
      • Star
      • Star
      • Star
      • Star
      • Star
      • Star
      • Star
      • Star
      • Star
      • Star
      • Star
      • Star
      • Star
      • Star
      • Star
      • Star
      • **Agent-Smith** - Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast [Github](https://github.com/sail-sg/Agent-Smith)![Star](https://img.shields.io/github/stars/sail-sg/Agent-Smith.svg?style=social&label=Star)