Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
awesome-gui-agent
💻 A curated list of papers and resources for multi-modal Graphical User Interface (GUI) agents.
https://github.com/showlab/awesome-gui-agent
Last synced: 2 days ago
JSON representation
-
Datasets / Benchmarks
- Reinforcement Learning on Web Interfaces using Workflow-Guided Exploration
- ![Star
- World of Bits: An Open-Domain Platform for Web-Based Agents
- Reinforcement Learning on Web Interfaces using Workflow-Guided Exploration
- ![Star
- Mapping Natural Language Instructions to Mobile UI Action Sequences
- ![Star - tensorflow)
- WebSRC: A Dataset for Web-Based Structural Reading Comprehension
- AndroidEnv: A Reinforcement Learning Platform for Android
- ![Star
- META-GUI: Towards Multi-modal Conversational Agents on Mobile GUI
- A Dataset for Interactive Vision-Language Navigation with Unknown Command Feasibility
- GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents
- ![Star - Chen/GUI-World)
- VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?
- GUICourse: From General Vision Language Models to Versatile GUI Agents
- ![Star
- GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices
- ![Star - Odyssey)
- VideoGUI: A Benchmark for GUI Automation from Instructional Videos
- ![Star
- Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding
- ![Star - ai-lab/Screen-Point-and-Read)
- MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents
- ![Star - agent-bench)
- ![Star
- World of Bits: An Open-Domain Platform for Web-Based Agents
- ![arXiv
- ![Star - tensorflow)
- WebSRC: A Dataset for Web-Based Structural Reading Comprehension
- AndroidEnv: A Reinforcement Learning Platform for Android
- ![Star - arena-x/webarena)
- Interactive Evolution: A Neural-Symbolic Self-Training Framework For Large Language Models
- ![Star
- ![Star - nlp/weblinx)
- AssistGUI: Task-Oriented Desktop Graphical User Interface Automation
- On the Multi-turn Instruction Following for Conversational Web Agents
- A Dataset for Interactive Vision-Language Navigation with Unknown Command Feasibility
- META-GUI: Towards Multi-modal Conversational Agents on Mobile GUI
- WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents
- ![Star - nlp/WebShop)
- WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents
- ![Star - nlp/WebShop)
- Language Models can Solve Computer Tasks
- Language Models can Solve Computer Tasks
- ![Star - agent)
- ![Star - agent)
- Mobile-Env: Building Qualified Evaluation Benchmarks for LLM-GUI Interaction
- Mobile-Env: Building Qualified Evaluation Benchmarks for LLM-GUI Interaction
- ![GitHub - LANCE/Mobile-Env)
- ![GitHub - LANCE/Mobile-Env)
- Mind2Web: Towards a Generalist Agent for the Web
- Mind2Web: Towards a Generalist Agent for the Web
- ![Star - nlp-group/mind2web)
- ![Star - nlp-group/mind2web)
- Android in the Wild: A Large-Scale Dataset for Android Device Control
- Android in the Wild: A Large-Scale Dataset for Android Device Control
- WebArena: A Realistic Web Environment for Building Autonomous Agents
- ![Star - arena-x/webarena)
- WebArena: A Realistic Web Environment for Building Autonomous Agents
- Interactive Evolution: A Neural-Symbolic Self-Training Framework For Large Language Models
- ![Star
- AssistGUI: Task-Oriented Desktop Graphical User Interface Automation
- VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks
- VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks
- OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web
- OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web
- WebLINX: Real-World Website Navigation with Multi-Turn Dialogue
- WebLINX: Real-World Website Navigation with Multi-Turn Dialogue
- On the Multi-turn Instruction Following for Conversational Web Agents
- ![Star - map)
- ![Star - nlp/weblinx)
- ![Star - map)
- AgentStudio: A Toolkit for Building General Virtual Agents
- ![Star - studio)
- AgentStudio: A Toolkit for Building General Virtual Agents
- ![Star - studio)
- OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
- OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
- ![Star - ai/OSWorld)
- ![Star - ai/OSWorld)
- Benchmarking Mobile Device Control Agents across Diverse Configurations
- Benchmarking Mobile Device Control Agents across Diverse Configurations
- ![Star - moca)
- MMInA: Benchmarking Multihop Multimodal Internet Agents
- ![Star
- Autonomous Evaluation and Refinement of Digital Agents
- ![Star - NLP/Agent-Eval-Refine)
- ![Star - moca)
- MMInA: Benchmarking Multihop Multimodal Internet Agents
- ![Star
- Autonomous Evaluation and Refinement of Digital Agents
- ![Star - NLP/Agent-Eval-Refine)
- LlamaTouch: A Faithful and Scalable Testbed for Mobile UI Automation Task Evaluation
- LlamaTouch: A Faithful and Scalable Testbed for Mobile UI Automation Task Evaluation
- ![Star
- ![Star
- VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?
- GUICourse: From General Vision Language Models to Versatile GUI Agents
- ![Star
- GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents
- ![Star - Chen/GUI-World)
- GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices
- ![Star - Odyssey)
- VideoGUI: A Benchmark for GUI Automation from Instructional Videos
- ![Star
- Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding
- ![Star - ai-lab/Screen-Point-and-Read)
- MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents
- ![Star - agent-bench)
- AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
- ![Star - research/android_world)
- Practical, Automated Scenario-based Mobile App Testing
- WebCanvas: Benchmarking Web Agents in Online Environments
- On the Effects of Data Scale on Computer Control Agents
- ![Star - research/google-research/tree/master/android_control)
- ![Star - codebase)
- Windows Agent Arena
- ![Star
- Harnessing Webpage UIs for Text-Rich Visual Understanding
- ![Star
- AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
- ![Star - research/android_world)
- Practical, Automated Scenario-based Mobile App Testing
- WebCanvas: Benchmarking Web Agents in Online Environments
- On the Effects of Data Scale on Computer Control Agents
- ![Star - research/google-research/tree/master/android_control)
- CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents
- ![Star - ai/crab)
- WebVLN: Vision-and-Language Navigation on Websites
- ![Star
- Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?
- CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents
- ![Star - ai/crab)
- WebVLN: Vision-and-Language Navigation on Websites
- ![Star
- Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?
- ![Star - ai/Spider2-V)
- ![Star - ai/Spider2-V)
- AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents
- ![Star - codebase)
- Windows Agent Arena
- ![Star
- AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents
- Harnessing Webpage UIs for Text-Rich Visual Understanding
- ![Star
- A Unified Solution for Structured Web Data Extraction
-
Models / Agents
- ![arXiv
- Grounding Open-Domain Instructions to Automate Web Support Tasks
- Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning
- A Data-Driven Approach for Learning to Control Computers
- Augmenting Autotelic Agents with Large Language Models
- ![Star - Copilot/OS-Copilot)
- UFO: A UI-Focused Agent for Windows OS Interaction
- ![Star
- Comprehensive Cognitive LLM Agent for Smartphone GUI Automation
- Improving Language Understanding from Screenshots
- AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web Navigating Agent
- ![Star
- Grounding Open-Domain Instructions to Automate Web Support Tasks
- Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning
- A Data-Driven Approach for Learning to Control Computers
- Augmenting Autotelic Agents with Large Language Models
- Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control
- ![Star
- Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control
- ![Star
- A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis
- A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis
- LASER: LLM Agent with State-Space Exploration for Web Navigation
- CogAgent: A Visual Language Model for GUI Agents
- CogAgent: A Visual Language Model for GUI Agents
- WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models
- ![Star
- WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models
- ![Star
- OS-Copilot: Towards Generalist Computer Agents with Self-Improvement
- ![Star - Copilot/OS-Copilot)
- UFO: A UI-Focused Agent for Windows OS Interaction
- OS-Copilot: Towards Generalist Computer Agents with Self-Improvement
- ![Star
- Comprehensive Cognitive LLM Agent for Smartphone GUI Automation
- Improving Language Understanding from Screenshots
- AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web Navigating Agent
- ![Star
- SheetCopilot: Bringing Software Productivity to the Next Level through Large Language Models
- SheetCopilot: Bringing Software Productivity to the Next Level through Large Language Models
- ![Star
- ![Star
- You Only Look at Screens: Multimodal Chain-of-Action Agents
- You Only Look at Screens: Multimodal Chain-of-Action Agents
- Reinforced UI Instruction Grounding: Towards a Generic UI Task Automation API
- OpenAgents: AN OPEN PLATFORM FOR LANGUAGE AGENTS IN THE WILD
- Reinforced UI Instruction Grounding: Towards a Generic UI Task Automation API
- OpenAgents: AN OPEN PLATFORM FOR LANGUAGE AGENTS IN THE WILD
- ![Star - ai/OpenAgents)
- ![Star - ai/OpenAgents)
- GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation
- AppAgent: Multimodal Agents as Smartphone Users
- GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation
- AppAgent: Multimodal Agents as Smartphone Users
- SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
- ![Star
- SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
- ![Star
- GPT-4V(ision) is a Generalist Web Agent, if Grounded
- GPT-4V(ision) is a Generalist Web Agent, if Grounded
- ![Star - NLP-Group/SeeAct)
- ![Star - NLP-Group/SeeAct)
- ![Website - nlp-group.github.io/SeeAct/)
Categories
Sub Categories