https://github.com/showlab/Awesome-GUI-Agent
💻 A curated list of papers and resources for multi-modal Graphical User Interface (GUI) agents.
https://github.com/showlab/Awesome-GUI-Agent
List: awesome-gui-agent
ai-assistant awesome graphical-user-interface gui-agents llm-agent
Last synced: 17 days ago
JSON representation
💻 A curated list of papers and resources for multi-modal Graphical User Interface (GUI) agents.
- Host: GitHub
- URL: https://github.com/showlab/Awesome-GUI-Agent
- Owner: showlab
- Created: 2024-06-28T16:12:23.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2024-10-27T08:03:37.000Z (6 months ago)
- Last Synced: 2024-11-09T21:16:00.655Z (6 months ago)
- Topics: ai-assistant, awesome, graphical-user-interface, gui-agents, llm-agent
- Homepage:
- Size: 577 KB
- Stars: 178
- Watchers: 5
- Forks: 10
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- awesome-ui-agents - Awesome-GUI-Agent
- awesome-ui-agents - Awesome-GUI-Agent
README
# Awesome GUI Agent [](https://github.com/sindresorhus/awesome)
A curated list of papers, projects, and resources for multi-modal Graphical User Interface (GUI) agents.
![]()
Build a digital assistant on your screen. Generated by DALL-E-3.**WELCOME CONTRIBUTE!**
🔥 This project is actively maintained, and we welcome your contributions. If you have any suggestions, such as missing papers or information, please feel free to open an issue or submit a pull request.
🤖 Try our [Awesome-Paper-Agent](https://chatgpt.com/g/g-qqs9km6wi-awesome-paper-agent). Just provide an arXiv URL link, and it will automatically return formatted information, like this:
```
User:
https://arxiv.org/abs/2312.13108GPT:
+ [AssistGUI: Task-Oriented Desktop Graphical User Interface Automation](https://arxiv.org/abs/2312.13108) (Dec. 2023)[](https://github.com/showlab/assistgui)
[](https://arxiv.org/abs/2312.13108)
[](https://showlab.github.io/assistgui/)
```So then you can easily copy and use this information in your pull requests.
⭐ If you find this repository useful, please give it a star.
---
**Quick Navigation**: [[Datasets / Benchmarks]](#datasets--benchmarks) [[Models / Agents]](#models--agents) [[Surveys]](#surveys) [[Projects]](#projects)## Datasets / Benchmarks
+ [World of Bits: An Open-Domain Platform for Web-Based Agents](https://proceedings.mlr.press/v70/shi17a.html) (Aug. 2017, ICML 2017)[](https://proceedings.mlr.press/v70/shi17a/shi17a.pdf)
+ [A Unified Solution for Structured Web Data Extraction](https://dl.acm.org/doi/10.1145/2009916.2010020) (Jul. 2011, SIGIR 2011)[](https://dl.acm.org/doi/10.1145/2009916.2010020)
+ [Rico: A Mobile App Dataset for Building Data-Driven Design Applications](https://dl.acm.org/doi/10.1145/3126594.3126651) (Oct. 2017)
[](https://dl.acm.org/doi/10.1145/3126594.3126651)
+ [Reinforcement Learning on Web Interfaces using Workflow-Guided Exploration](https://arxiv.org/abs/1802.08802) (Feb. 2018, ICLR 2018)[](https://github.com/stanfordnlp/wge)
[](https://arxiv.org/abs/1802.08802)+ [Mapping Natural Language Instructions to Mobile UI Action Sequences](https://arxiv.org/abs/2005.03776) (May. 2020, ACL 2020)
[](https://github.com/deepneuralmachine/seq2act-tensorflow)
[](https://arxiv.org/abs/2005.03776)+ [WebSRC: A Dataset for Web-Based Structural Reading Comprehension](https://arxiv.org/abs/2101.09465) (Jan. 2021, EMNLP 2021)
[](https://arxiv.org/abs/2101.09465)
[](https://x-lance.github.io/WebSRC/)+ [AndroidEnv: A Reinforcement Learning Platform for Android](https://arxiv.org/abs/2105.13231) (May. 2021)
[](https://github.com/deepmind/android_env)
[](https://arxiv.org/abs/2105.13231)
[](https://github.com/deepmind/android_env)+ [A Dataset for Interactive Vision-Language Navigation with Unknown Command Feasibility](https://arxiv.org/abs/2202.02312) (Feb. 2022)
[](https://arxiv.org/abs/2202.02312)
+ [META-GUI: Towards Multi-modal Conversational Agents on Mobile GUI](https://arxiv.org/abs/2205.11029) (May. 2022)
[](https://arxiv.org/abs/2205.11029)
[](https://x-lance.github.io/META-GUI-Leaderboard/)+ [WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents](https://arxiv.org/abs/2207.01206) (Jul. 2022)
[](https://github.com/princeton-nlp/WebShop)
[](https://arxiv.org/abs/2207.01206)
[](https://webshop-pnlp.github.io/)+ [Language Models can Solve Computer Tasks](https://arxiv.org/abs/2303.17491) (Mar. 2023)
[](https://github.com/posgnu/rci-agent)
[](https://arxiv.org/abs/2303.17491)
[](https://posgnu.github.io/rci-web/)+ [Mobile-Env: Building Qualified Evaluation Benchmarks for LLM-GUI Interaction](https://arxiv.org/abs/2305.08144) (May. 2023)
[](https://arxiv.org/abs/2305.08144)
[](https://github.com/X-LANCE/Mobile-Env)+ [Mind2Web: Towards a Generalist Agent for the Web](https://arxiv.org/abs/2306.06070) (Jun. 2023)
[](https://github.com/osu-nlp-group/mind2web)
[](https://arxiv.org/abs/2306.06070)
[](https://osu-nlp-group.github.io/Mind2Web/)+ [Android in the Wild: A Large-Scale Dataset for Android Device Control](https://arxiv.org/abs/2307.10088) (Jul. 2023)
[](https://github.com/google-research/google-research/tree/master/android_in_the_wild)
[](https://arxiv.org/abs/2307.10088)+ [WebArena: A Realistic Web Environment for Building Autonomous Agents](https://arxiv.org/abs/2307.13854) (Jul. 2023)
[](https://github.com/web-arena-x/webarena)
[](https://arxiv.org/abs/2307.13854)
[](https://webarena.dev/)+ [Interactive Evolution: A Neural-Symbolic Self-Training Framework For Large Language Models](https://arxiv.org/abs/2311.09278) (Nov. 2023)
[](https://github.com/xufangzhi/ENVISIONS)
[](https://arxiv.org/abs/2311.09278)
[](https://xufangzhi.github.io/symbol-llm-page/)+ [AssistGUI: Task-Oriented Desktop Graphical User Interface Automation](https://arxiv.org/abs/2401.07781) (Dec. 2023, CVPR 2024)
[](https://github.com/showlab/assistgui)
[](https://arxiv.org/abs/2401.07781)
[](https://showlab.github.io/assistgui/)+ [VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks](https://arxiv.org/abs/2401.13649) (Jan. 2024, ACL 2024)
[](https://github.com/jykoh/visualwebarena)
[](https://arxiv.org/abs/2401.13649)
[](https://jykoh.com/vwa)+ [OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web](https://arxiv.org/abs/2402.17553) (Feb. 2024)
[](https://arxiv.org/abs/2402.17553)
+ [WebLINX: Real-World Website Navigation with Multi-Turn Dialogue](https://arxiv.org/abs/2402.05930) (Feb. 2024)
[](https://github.com/mcgill-nlp/weblinx)
[](https://arxiv.org/abs/2402.05930)
[](https://mcgill-nlp.github.io/weblinx/)+ [On the Multi-turn Instruction Following for Conversational Web Agents](https://arxiv.org/abs/2402.15057) (Feb. 2024)
[](https://github.com/magicgh/self-map)
[](https://arxiv.org/abs/2402.15057)+ [AgentStudio: A Toolkit for Building General Virtual Agents](https://arxiv.org/abs/2403.17918) (Mar. 2024)
[](https://github.com/skyworkai/agent-studio)
[](https://arxiv.org/abs/2403.17918)
[](https://skyworkai.github.io/agent-studio/)+ [OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments](https://arxiv.org/abs/2404.07972) (Apr. 2024)
[](https://github.com/xlang-ai/OSWorld)
[](https://arxiv.org/abs/2404.07972)
[](https://os-world.github.io/)+ [Benchmarking Mobile Device Control Agents across Diverse Configurations](https://arxiv.org/abs/2404.16660) (Apr. 2024, ICLR 2024)
[](https://github.com/gimme1dollar/b-moca)
[](https://arxiv.org/abs/2404.16660)+ [MMInA: Benchmarking Multihop Multimodal Internet Agents](https://arxiv.org/abs/2404.09992) (Apr. 2024)
[](https://github.com/shulin16/MMInA)
[](https://arxiv.org/abs/2404.09992)
[](https://mmina.cliangyu.com)
+ [Autonomous Evaluation and Refinement of Digital Agents](https://arxiv.org/abs/2404.06474) (Apr. 2024)[](https://github.com/Berkeley-NLP/Agent-Eval-Refine)
[](https://arxiv.org/abs/2404.06474)+ [LlamaTouch: A Faithful and Scalable Testbed for Mobile UI Automation Task Evaluation](https://arxiv.org/abs/2404.16054) (Apr. 2024)
[](https://github.com/LlamaTouch/LlamaTouch)
[](https://arxiv.org/abs/2404.16054)+ [VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?](https://arxiv.org/abs/2404.05955) (Apr. 2024)
[](https://arxiv.org/abs/2404.05955)
+ [GUICourse: From General Vision Language Models to Versatile GUI Agents](https://arxiv.org/abs/2406.11317) (Jun. 2024)
[](https://github.com/yiye3/GUICourse)
[](https://arxiv.org/abs/2406.11317)
+ [GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents](https://arxiv.org/abs/2406.10819) (Jun. 2024)[](https://github.com/Dongping-Chen/GUI-World)
[](https://arxiv.org/abs/2406.10819)
[](https://gui-world.github.io/)+ [GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices](https://arxiv.org/abs/2406.08451) (Jun. 2024)
[](https://github.com/OpenGVLab/GUI-Odyssey)
[](https://arxiv.org/abs/2406.08451)+ [VideoGUI: A Benchmark for GUI Automation from Instructional Videos](https://arxiv.org/abs/2406.10227) (Jun. 2024)
[](https://github.com/showlab/videogui)
[](https://arxiv.org/abs/2406.10227)
[](https://showlab.github.io/videogui/)+ [Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding](https://arxiv.org/abs/2406.19263) (Jun. 2024)
[](https://github.com/eric-ai-lab/Screen-Point-and-Read)
[](https://arxiv.org/abs/2406.19263)
[](https://screen-point-and-read.github.io/)+ [MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents](https://arxiv.org/abs/2406.08184) (Jun. 2024)
[](https://github.com/MobileAgentBench/mobile-agent-bench)
[](https://arxiv.org/abs/2406.08184)
[](https://mobileagentbench.github.io)+ [AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents](https://arxiv.org/abs/2405.14573) (Jun. 2024)
[](https://github.com/google-research/android_world)
[](https://arxiv.org/abs/2405.14573)+ [Practical, Automated Scenario-based Mobile App Testing](https://arxiv.org/abs/2406.08340) (Jun. 2024)
[](https://arxiv.org/abs/2406.08340)
+ [WebCanvas: Benchmarking Web Agents in Online Environments](https://arxiv.org/abs/2406.12373) (Jun. 2024)
[](https://arxiv.org/abs/2406.12373)
[](https://www.imean.ai/web-canvas)+ [On the Effects of Data Scale on Computer Control Agents](https://arxiv.org/abs/2406.03679) (Jun. 2024)
[](https://github.com/google-research/google-research/tree/master/android_control)
[](https://arxiv.org/abs/2406.03679)+ [CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents](https://arxiv.org/abs/2407.01511) (Jul. 2024)
[](https://github.com/camel-ai/crab)
[](https://arxiv.org/abs/2407.01511)+ [WebVLN: Vision-and-Language Navigation on Websites](https://ojs.aaai.org/index.php/AAAI/article/view/27878) (AAAI 2024)
[](https://github.com/WebVLN/WebVLN)
[](https://ojs.aaai.org/index.php/AAAI/article/view/27878)+ [Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?](https://arxiv.org/abs/2407.10956) (Jul. 2024)
[](https://github.com/xlang-ai/Spider2-V)
[](https://arxiv.org/abs/2407.10956)
[](https://spider2-v.github.io/)+ [AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents](https://arxiv.org/abs/2407.17490)
[](https://github.com/YuxiangChai/AMEX-codebase)
[](https://arxiv.org/abs/2407.17490)
[](https://yuxiangchai.github.io/AMEX/)+ [Windows Agent Arena](https://raw.githubusercontent.com/microsoft/WindowsAgentArena/website/static/files/windows_agent_arena.pdf)
[](https://github.com/microsoft/WindowsAgentArena)
[](https://microsoft.github.io/WindowsAgentArena/)
[](https://raw.githubusercontent.com/microsoft/WindowsAgentArena/website/static/files/windows_agent_arena.pdf)+ [Harnessing Webpage UIs for Text-Rich Visual Understanding](https://arxiv.org/abs/2410.13824) (Oct, 2024)
[](https://github.com/neulab/multiui)
[](https://neulab.github.io/MultiUI/)
[](https://neulab.github.io/MultiUI/)+ [GUI Testing Arena: A Unified Benchmark for Advancing Autonomous GUI Testing Agent](https://arxiv.org/abs/2412.18426) (Dec, 2024)
[](https://github.com/ZJU-ACES-ISE/ChatUITest)
[](https://arxiv.org/abs/2412.18426)+ [A3: Android Agent Arena for Mobile GUI Agents](https://arxiv.org/abs/2501.01149) (Jan. 2025)
[](https://arxiv.org/abs/2501.01149)
[](https://yuxiangchai.github.io/Android-Agent-Arena/)+ [ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use](https://likaixin2000.github.io/papers/ScreenSpot_Pro.pdf)
[](https://github.com/likaixin2000/ScreenSpot-Pro-GUI-Grounding)
[](https://gui-agent.github.io/grounding-leaderboard/)
[](https://likaixin2000.github.io/papers/ScreenSpot_Pro.pdf)+ [WebWalker: Benchmarking LLMs in Web Traversal](https://github.com/Alibaba-nlp/WebWalker)
[](https://github.com/Alibaba-nlp/WebWalker)
[](https://alibaba-nlp.github.io/WebWalker/)
[](https://arxiv.org/pdf/2501.07572)+ [SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation](https://ai-agents-2030.github.io/SPA-Bench/) (ICLR 2025)
[](https://arxiv.org/abs/2410.15164)
[](https://ai-agents-2030.github.io/SPA-Bench/)+ [WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation](https://arxiv.org/abs/2502.08047) (Feb. 2025)
[](https://github.com/showlab/GUI-Thinker)
[](https://arxiv.org/abs/2502.08047)
[](https://showlab.github.io/GUI-Thinker/)## Models / Agents
+ [Grounding Open-Domain Instructions to Automate Web Support Tasks](https://web3.arxiv.org/abs/2103.16057) (Mar. 2021)
[](https://web3.arxiv.org/abs/2103.16057)
+ [Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning](https://arxiv.org/abs/2108.03353) (Aug. 2021)[](http://arxiv.org/abs/2108.03353)
+ [A Data-Driven Approach for Learning to Control Computers](https://arxiv.org/abs/2202.08137) (Feb. 2022)
[](https://arxiv.org/abs/2202.08137)
+ [Augmenting Autotelic Agents with Large Language Models](https://arxiv.org/pdf/2305.12487) (May. 2023)
[](https://arxiv.org/pdf/2305.12487)
+ [Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control](https://arxiv.org/abs/2306.07863) (Jun. 2023, ICLR 2024)
[](https://github.com/ltzheng/synapse)
[](https://arxiv.org/abs/2306.07863)+ [A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis](https://arxiv.org/abs/2307.12856) (Jul. 2023, ICLR 2024)
[](http://arxiv.org/abs/2307.12856)
+ [LASER: LLM Agent with State-Space Exploration for Web Navigation](https://arxiv.org/abs/2309.08172) (Sep. 2023)
[](https://arxiv.org/abs/2309.08172)
+ [CogAgent: A Visual Language Model for GUI Agents](https://arxiv.org/abs/2312.08914) (Dec. 2023, CVPR 2024)
[](https://github.com/THUDM/CogVLM)
[](https://arxiv.org/abs/2312.08914)+ [WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models](https://arxiv.org/abs/2401.13919)
[](https://github.com/MinorJerry/WebVoyager)
[](https://arxiv.org/abs/2401.13919)+ [OS-Copilot: Towards Generalist Computer Agents with Self-Improvement](https://arxiv.org/abs/2402.07456) (Feb. 2024)
[](https://github.com/OS-Copilot/OS-Copilot)
[](https://arxiv.org/abs/2402.07456)
[](https://os-copilot.github.io/)+ [UFO: A UI-Focused Agent for Windows OS Interaction](https://arxiv.org/abs/2402.07939) (Feb. 2024)
[](https://github.com/microsoft/UFO)
[](https://arxiv.org/abs/2402.07939)
[](https://microsoft.github.io/UFO/)+ [Comprehensive Cognitive LLM Agent for Smartphone GUI Automation](https://arxiv.org/abs/2402.11941) (Feb. 2024)
[](https://arxiv.org/abs/2402.11941)
+ [Improving Language Understanding from Screenshots](https://arxiv.org/abs/2402.14073) (Feb. 2024)
[](https://arxiv.org/abs/2402.14073)
+ [AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web Navigating Agent](https://arxiv.org/abs/2404.03648) (Apr. 2024, KDD 2024)
[](https://github.com/THUDM/AutoWebGLM)
[](https://arxiv.org/abs/2404.03648)+ [SheetCopilot: Bringing Software Productivity to the Next Level through Large Language Models](https://arxiv.org/abs/2305.19308) (May. 2023, NeurIPS 2023)
[](https://github.com/BraveGroup/SheetCopilot)
[](https://arxiv.org/abs/2305.19308)
[](https://sheetcopilot.github.io/)+ [You Only Look at Screens: Multimodal Chain-of-Action Agents](https://arxiv.org/abs/2309.11436) (Sep. 2023)
[](https://github.com/cooelf/Auto-UI)
[](https://arxiv.org/abs/2309.11436)+ [Reinforced UI Instruction Grounding: Towards a Generic UI Task Automation API](https://arxiv.org/abs/2310.04716) (Oct. 2023)
[](https://arxiv.org/abs/2310.04716)
+ [OpenAgents: AN OPEN PLATFORM FOR LANGUAGE AGENTS IN THE WILD](https://arxiv.org/pdf/2310.10634) (Oct. 2023)
[](https://github.com/xlang-ai/OpenAgents)
[](https://arxiv.org/pdf/2310.10634)+ [AgentStore: Scalable Integration of Heterogeneous Agents As Specialized Generalist Computer Assistant](https://arxiv.org/abs/2410.18603) (Oct. 2024)
[](https://github.com/chengyou-jia/AgentStore)
[](https://arxiv.org/abs/2410.18603)
[](https://chengyou-jia.github.io/AgentStore-Home/)+ [GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation](https://arxiv.org/abs/2311.07562) (Nov. 2023)
[](https://github.com/zzxslp/MM-Navigator)
[](https://arxiv.org/abs/2311.07562)+ [AppAgent: Multimodal Agents as Smartphone Users](https://arxiv.org/abs/2312.13771) (Dec. 2023)
[](https://github.com/mnotgod96/AppAgent)
[](https://arxiv.org/abs/2312.13771)
[](https://appagent-official.github.io)+ [SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents](https://arxiv.org/abs/2401.10935) (Jan. 2024, ACL 2024)
[](https://github.com/njucckevin/SeeClick)
[](https://arxiv.org/abs/2401.10935)+ [GPT-4V(ision) is a Generalist Web Agent, if Grounded](https://arxiv.org/abs/2401.01614) (Jan. 2024, ICML 2024)
[](https://github.com/OSU-NLP-Group/SeeAct)
[](https://arxiv.org/abs/2401.01614)
[](https://osu-nlp-group.github.io/SeeAct/)+ [Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception](http://arxiv.org/abs/2401.16158) (Jan. 2024)
[](http://arxiv.org/abs/2401.16158)
+ [Dual-View Visual Contextualization for Web Navigation](https://arxiv.org/abs/2402.04476) (Feb. 2024, CVPR 2024)
[](https://arxiv.org/abs/2402.04476)
+ [DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning](https://arxiv.org/abs/2406.11896) (Jun. 2024)
[](https://github.com/DigiRL-agent/digirl)
[](https://arxiv.org/abs/2406.11896)
[](https://digirl-agent.github.io/)+ [Visual Grounding for User Interfaces](https://aclanthology.org/2024.naacl-industry.9.pdf) (NAACL 2024)
[](https://aclanthology.org/2024.naacl-industry.9.pdf)
+ [ScreenAgent: A Computer Control Agent Driven by Visual Language Large Model](https://arxiv.org/abs/2402.07945) (Feb. 2024)
[](https://github.com/niuzaisheng/ScreenAgent)
[](https://arxiv.org/abs/2402.07945)
[](https://screenagent.pages.dev/)+ [ScreenAI: A Vision-Language Model for UI and Infographics Understanding](https://arxiv.org/abs/2402.04615) (Feb. 2024)
[](https://arxiv.org/abs/2402.04615)
+ [Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs](https://arxiv.org/abs/2404.05719) (Apr. 2024)[](https://github.com/apple/ml-ferret)
[](https://arxiv.org/abs/2404.05719)+ [Octopus: On-device language model for function calling of software APIs](https://arxiv.org/abs/2404.01549) (Apr., 2024)
[](https://arxiv.org/abs/2404.01549)
+ [Octopus v2: On-device language model for super agent](https://arxiv.org/abs/2404.01744) (Apr., 2024)
[](https://arxiv.org/abs/2404.01744)
+ [Octopus v3: Technical Report for On-device Sub-billion Multimodal AI Agent](https://arxiv.org/abs/2404.11459) (Apr., 2024)
[](https://arxiv.org/abs/2404.11459)
[](https://www.nexa4ai.com/octopus-v3)+ [Octopus v4: Graph of language models](https://arxiv.org/abs/2404.19296) (Apr., 2024)
[](https://arxiv.org/abs/2404.19296)
+ [AutoWebGLM: Bootstrap and Reinforce a Large Language Model-based Web Navigating Agent](https://arxiv.org/abs/2404.03648) (Apr. 2024)
[](https://github.com/THUDM/AutoWebGLM)
[](https://arxiv.org/abs/2404.03648)
+ [Search Beyond Queries: Training Smaller Language Models for Web Interactions via Reinforcement Learning](https://arxiv.org/abs/2404.10887) (Apr. 2024)[](https://arxiv.org/abs/2404.10887)
+ [Enhancing Mobile "How-to" Queries with Automated Search Results Verification and Reranking](https://arxiv.org/pdf/2404.08860v3) (Apr. 2024, SIGIR 2024)
[](https://arxiv.org/pdf/2404.08860v3)
+ [AutoDroid: LLM-powered Task Automation in Android](https://arxiv.org/abs/2308.15272)
[](https://arxiv.org/abs/2308.15272)
+ [Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation](https://arxiv.org/abs/2312.03003) (Dec. 2023, MobiCom 2024)[](https://arxiv.org/abs/2312.03003)
[](https://mobile-gpt.github.io/)+ [Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study](https://arxiv.org/abs/2403.03186) (Mar. 2024)
[](https://github.com/BAAI-Agents/Cradle)
[](https://arxiv.org/abs/2403.03186)
[](https://baai-agents.github.io/Cradle/)+ [Android in the Zoo: Chain-of-Action-Thought for GUI Agents](https://arxiv.org/abs/2403.02713) (Mar. 2024)
[](https://github.com/IMNearth/CoAT)
[](https://arxiv.org/abs/2403.02713)+ [Navigating WebAI: Training Agents to Complete Web Tasks with Large Language Models and Reinforcement Learning](https://arxiv.org/abs/2405.00516v1) (May 2024)
[](https://arxiv.org/abs/2405.00516v1)
+ [GUI Action Narrator: Where and When Did That Action Take Place?](https://arxiv.org/abs/2406.13719) (Jun. 2024)
[](https://github.com/showlab/GUI-Action-Narrator)
[](https://arxiv.org/abs/2406.13719)
[](https://showlab.github.io/GUI-Narrator)+ [Identifying User Goals from UI Trajectories](https://arxiv.org/abs/2406.14314) (Jun. 2024)
[](https://arxiv.org/abs/2406.14314)+ [VGA: Vision GUI Assistant -- Minimizing Hallucinations through Image-Centric Fine-Tuning](https://arxiv.org/abs/2406.14056) (Jun. 2024)
[](https://arxiv.org/abs/2406.14056)
+ [Octo-planner: On-device Language Model for Planner-Action Agents](https://arxiv.org/abs/2406.18082) (Jun. 2024)[](https://arxiv.org/abs/2406.18082)
[](https://www.nexa4ai.com/octo-planner#video)+ [E-ANT: A Large-Scale Dataset for Efficient Automatic GUI NavigaTion](https://arxiv.org/abs/2406.14250) (Jun. 2024)
[](https://arxiv.org/abs/2406.14250)
+ [Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration](https://arxiv.org/abs/2406.01014) (Jun. 2024)
[](https://github.com/X-PLUG/MobileAgent)
[](https://arxiv.org/abs/2406.01014)+ [MobileFlow: A Multimodal LLM For Mobile GUI Agent](https://arxiv.org/abs/2407.04346) (Jul. 2024)
[](https://arxiv.org/abs/2407.04346)
+ [Vision-driven Automated Mobile GUI Testing via Multimodal Large Language Model](https://arxiv.org/abs/2407.03037) (Jul. 2024)
[](https://arxiv.org/abs/2407.03037)
+ [Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence](https://arxiv.org/abs/2407.07061) (Jul. 2024)
[](https://github.com/OpenBMB/IoA)
[](https://arxiv.org/abs/2407.07061)+ [MobileExperts: A Dynamic Tool-Enabled Agent Team in Mobile Devices](https://arxiv.org/abs/2407.03913) (Jul. 2024)
[](https://arxiv.org/abs/2407.03913)
+ [AUITestAgent: Automatic Requirements Oriented GUI Function Testing](https://arxiv.org/abs/2407.09018) (Jul. 2024)
[](https://github.com/bz-lab/AUITestAgent)
[](https://arxiv.org/abs/2407.09018)+ [Agent-E: From Autonomous Web Navigation to Foundational Design Principles in Agentic Systems](https://arxiv.org/abs/2407.13032) (Jul. 2024)
[](https://github.com/EmergenceAI/Agent-E)
[](https://arxiv.org/abs/2407.13032)+ [OmniParser for Pure Vision Based GUI Agent](https://arxiv.org/pdf/2408.00203) (Aug. 2024)
[](https://arxiv.org/pdf/2408.00203)
+ [VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents](https://arxiv.org/abs/2408.06327) (Aug. 2024)
[](https://github.com/THUDM/VisualAgentBench)
[](https://arxiv.org/abs/2408.06327)
[](https://github.com/THUDM/VisualAgentBench)+ [Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents](https://web3.arxiv.org/abs/2408.07199v1) (Aug. 2024)
[](https://web3.arxiv.org/abs/2408.07199v1)
[](https://www.multion.ai/blog/introducing-agent-q-research-breakthrough-for-the-next-generation-of-ai-agents-with-planning-and-self-healing-capabilities)+ [MindSearch: Mimicking Human Minds Elicits Deep AI Searcher](https://arxiv.org/abs/2407.20183) (Jul. 2023)
[](https://github.com/InternLM/MindSearch)
[](https://arxiv.org/abs/2407.20183)
[](https://mindsearch.netlify.app/)+ [AppAgent v2: Advanced Agent for Flexible Mobile Interactions](https://arxiv.org/abs/2408.11824) (Aug. 2024)
[](https://arxiv.org/abs/2408.11824)
+ [Caution for the Environment:
Multimodal Agents are Susceptible to Environmental Distractions](https://arxiv.org/abs/2408.02544) (Aug. 2024)[](https://arxiv.org/pdf/2408.02544)
+ [Agent Workflow Memory](https://arxiv.org/abs/2409.07429) (Sep. 2024)
[](https://github.com/zorazrw/agent-workflow-memory)
[](https://arxiv.org/abs/2409.07429)+ [MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understandin](https://arxiv.org/abs/2409.14818) (Sep. 2024)
[](https://arxiv.org/abs/2409.14818)
+ [Agent S: An Open Agentic Framework that Uses Computers Like a Human](https://arxiv.org/abs/2410.08164) (Oct. 2024)
[](https://github.com/simular-ai/Agent-S)
[](https://arxiv.org/abs/2410.08164)+ [MobA: A Two-Level Agent System for Efficient Mobile Task Automation](https://arxiv.org/abs/2410.13757) (Oct. 2024)
[](https://github.com/OpenDFM/MobA)
[](https://arxiv.org/abs/2410.13757)
[](https://huggingface.co/datasets/OpenDFM/MobA-MobBench)
+ [Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents](https://arxiv.org/abs/2410.05243) (Oct. 2024)[](https://github.com/OSU-NLP-Group/UGround)
[](https://osu-nlp-group.github.io/UGround/)
[](https://arxiv.org/abs/2410.05243)+ [OS-ATLAS: A Foundation Action Model For Generalist GUI Agents](https://arxiv.org/pdf/2410.23218) (Oct. 2024)
[](https://github.com/OS-Copilot/OS-Atlas)
[](https://arxiv.org/abs/2410.23218)
[](https://osatlas.github.io/)
[](https://huggingface.co/datasets/OS-Copilot/OS-Atlas-data)+ [Attacking Vision-Language Computer Agents via Pop-ups](https://arxiv.org/abs/2411.02391) (Nov. 2024)
[](https://github.com/SALT-NLP/PopupAttack)
[](https://arxiv.org/abs/2411.02391)+ [AutoGLM: Autonomous Foundation Agents for GUIs](https://arxiv.org/abs/2411.00820) (Nov. 2024)
[](https://github.com/THUDM/AutoGLM)
[](https://arxiv.org/abs/2411.00820)+ [AdaptAgent: Adapting Multimodal Web Agents with Few-Shot Learning from Human Demonstrations](https://arxiv.org/abs/2411.13451) (Nov. 2024)
[](https://arxiv.org/abs/2411.13451)
+ [ShowUI: One Vision-Language-Action Model for Generalist GUI Agent](https://arxiv.org/abs/2411.17465) (Nov. 2024)
[](https://github.com/showlab/ShowUI)
[](https://arxiv.org/abs/2411.17465)+ [Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction](https://arxiv.org/abs/2412.04454) (Dec. 2024)
[](https://aguvis-project.github.io/)
[](https://github.com/xlang-ai/aguvis)
[](https://arxiv.org/abs/2412.04454)+ [Falcon-UI: Understanding GUI Before Following User Instructions](https://arxiv.org/abs/2412.09362) (Dec. 2024)
[](https://arxiv.org/abs/2412.09362)
+ [PC Agent: While You Sleep, AI Works - A Cognitive Journey into Digital World](https://arxiv.org/abs/2412.17589) (Dec. 2024)
[](https://github.com/GAIR-NLP/PC-Agent)
[](https://arxiv.org/abs/2412.17589)
[](https://gair-nlp.github.io/PC-Agent/)
+ [Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining](https://arxiv.org/pdf/2412.10342) (Dec. 2024)[](https://arxiv.org/pdf/2412.10342)
+ [Aria-UI: Visual Grounding for GUI Instructions](https://arxiv.org/abs/2412.16256) (Dec. 2024)
[](https://github.com/AriaUI/Aria-UI)
[](https://arxiv.org/abs/2412.16256)
[](https://ariaui.github.io)
[](https://huggingface.co/datasets/Aria-UI/Aria-UI_Data)+ [CogAgent v2](https://github.com/THUDM/CogAgent) (Dec. 2024)
[](https://github.com/THUDM/CogAgent)
+ [OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis](https://arxiv.org/abs/2412.19723) (Dec. 2024)
[](https://github.com/OS-Copilot/OS-Genesis)
[](https://arxiv.org/abs/2412.19723)
[](https://qiushisun.github.io/OS-Genesis-Home/)+ [InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection](https://arxiv.org/pdf/2501.04575) (Jan. 2025)
[](https://github.com/Reallm-Labs/InfiGUIAgent)
[](https://arxiv.org/pdf/2501.04575)+ [GUI-Bee : Align GUI Action Grounding to Novel Environments via Autonomous Exploration](https://arxiv.org/pdf/2501.13896) (Jan. 2025)
[](https://arxiv.org/pdf/2501.13896)
[](https://gui-bee.github.io/)+ [Lightweight Neural App Control](https://arxiv.org/abs/2410.17883) (ICLR 2025)
[](https://arxiv.org/abs/2410.17883)
+ [DistRL: An Asynchronous Distributed Reinforcement Learning Framework for On-Device Control Agents](https://arxiv.org/abs/2410.14803) (ICLR 2025)
[](https://arxiv.org/abs/2410.14803)
[](https://ai-agents-2030.github.io/DistRL/)+ [AppVLM: A Lightweight Vision Language Model for Online App Control](https://arxiv.org/abs/2502.06395) (Feb. 2025)
[](https://arxiv.org/abs/2502.06395)
+ [VSC-RL: Advancing Autonomous Vision-Language Agents with Variational Subgoal-Conditioned Reinforcement Learning](https://arxiv.org/abs/2502.07949) (Feb. 2025)
[](https://arxiv.org/abs/2502.07949)
+ [GUI-Thinker: GUI-Thinker: A Basic yet Comprehensive GUI Agent Developed with Self-Reflection](https://arxiv.org/abs/2502.08047) (Feb. 2025)[](https://github.com/showlab/GUI-Thinker)
[](https://arxiv.org/abs/2502.08047)
[](https://showlab.github.io/GUI-Thinker/)## Surveys
+ [OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use](https://github.com/OS-Agent-Survey/OS-Agent-Survey) (Dec. 2024)[](https://github.com/OS-Agent-Survey/OS-Agent-Survey)
[](https://github.com/OS-Agent-Survey/OS-Agent-Survey/blob/main/paper.pdf)
[](https://os-agent-survey.github.io/)+ [GUI Agents with Foundation Models: A Comprehensive Survey](https://arxiv.org/abs/2411.04890) (Nov. 2024)
[](https://arxiv.org/abs/2411.04890)
+ [Large Language Model-Brained GUI Agents: A Survey](https://arxiv.org/abs/2411.18279) (Nov. 2024)
[](https://vyokky.github.io/LLM-Brained-GUI-Agents-Survey/)
[](https://arxiv.org/abs/2411.18279)+ [GUI Agents: A Survey](https://arxiv.org/abs/2412.13501) (Dec. 2024)
[](https://arxiv.org/abs/2412.13501)
## Projects
+ [PyAutoGUI](https://pyautogui.readthedocs.io/en/latest/index.html)[](https://github.com/asweigart/pyautogui/tree/master)
[](https://pyautogui.readthedocs.io/en/latest/)+ [nut.js](https://nutjs.dev/)
[](https://github.com/nut-tree/nut.js)
[](https://nutjs.dev/)+ [GPT-4V-Act: AI agent using GPT-4V(ision) for web UI interaction](https://github.com/ddupont808/GPT-4V-Act)
[](https://github.com/ddupont808/GPT-4V-Act)
+ [gpt-computer-assistant](https://github.com/onuratakan/gpt-computer-assistant)
[](https://github.com/onuratakan/gpt-computer-assistant)
+ [Mobile-Agent: The Powerful Mobile Device Operation Assistant Family](https://github.com/X-PLUG/MobileAgent)
[](https://github.com/X-PLUG/MobileAgent)
+ [OpenUI](https://github.com/wandb/openui)[](https://github.com/wandb/openui)
[](https://openui.fly.dev)+ [ACT-1](https://www.adept.ai/blog/act-1)
[](https://www.adept.ai/blog/act-1)
+ [NatBot](https://github.com/nat/natbot)
[](https://github.com/nat/natbot)
+ [Multion](https://www.multion.ai)
[](https://www.multion.ai/)
+ [Auto-GPT](https://github.com/Significant-Gravitas/Auto-GPT)
[](https://github.com/Significant-Gravitas/Auto-GPT)
+ [WebLlama](https://github.com/McGill-NLP/webllama)
[](https://github.com/McGill-NLP/webllama)
[](https://webllama.github.io)+ [LaVague: Large Action Model Framework to Develop AI Web Agents](https://github.com/lavague-ai/LaVague)
[](https://github.com/lavague-ai/LaVague)
[](https://docs.lavague.ai/)+ [OpenAdapt: AI-First Process Automation with Large Multimodal Models](https://github.com/OpenAdaptAI/OpenAdapt)
[](https://github.com/OpenAdaptAI/OpenAdapt)
+ [Surfkit: A toolkit for building and sharing AI agents that operate on devices](https://github.com/agentsea/surfkit)
[](https://github.com/agentsea/surfkit)
+ [AGI Computer Control](https://github.com/James4Ever0/agi_computer_control)
+ [Open Interpreter](https://github.com/OpenInterpreter/open-interpreter)
[](https://github.com/OpenInterpreter/open-interpreter)
[](https://openinterpreter.com/)+ [WebMarker: Mark web pages for use with vision-language models](https://github.com/reidbarber/webmarker)
[](https://github.com/reidbarber/webmarker)
[](https://www.webmarkerjs.com/)+ [Computer Use Out-of-the-box](https://github.com/showlab/computer_use_ootb)
[](https://github.com/showlab/computer_use_ootb/tree/master)
[](https://computer-use-ootb.github.io/)## Safety
+ [Adversarial Attacks on Multimodal Agents](https://github.com/ChenWu98/agent-attack)
[](https://github.com/ChenWu98/agent-attack)
[](https://chenwu.io/attack-agent/)+ [AdvWeb: Controllable Black-box Attacks on VLM-powered Web Agents](https://github.com/AI-secure/AdvWeb)
[](https://github.com/AI-secure/AdvWeb)
[](https://ai-secure.github.io/AdvWeb/)+ [MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control](https://github.com/jylee425/mobilesafetybench)
[](https://github.com/jylee425/mobilesafetybench)
[](https://mobilesafetybench.github.io/)+ [EIA: Environmental Injection Attack on Generalist Web Agents for Privacy Leakage](https://github.com/OSU-NLP-Group/EIA_against_webagent)
[](https://github.com/OSU-NLP-Group/EIA_against_webagent)
+ [Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents](https://github.com/OSU-NLP-Group/WebDreamer)
[](https://github.com/OSU-NLP-Group/WebDreamer)
+ [Caution for the Environment: Multimodal Agents are Susceptible to Environmental Distractions](https://arxiv.org/abs/2408.02544)
+ [Security Matrix for Multimodal Agents on Mobile Devices: A Systematic and Proof of Concept Study](https://arxiv.org/html/2407.09295v2)
## Related Repositories
- [awesome-llm-powered-agent](https://github.com/hyp1231/awesome-llm-powered-agent)
- [Awesome-LLM-based-Web-Agent-and-Tools](https://github.com/albzni/Awesome-LLM-based-Web-Agent-and-Tools)
- [awesome-ui-agents](https://github.com/opendilab/awesome-ui-agents/)
- [computer-control-agent-knowledge-base](https://github.com/James4Ever0/computer_control_agent_knowledge_base)
- [Awesome GUI Agent Paper List](https://github.com/OSU-NLP-Group/GUI-Agents-Paper-List/)## Acknowledgements
This template is provided by [Awesome-Video-Diffusion](https://github.com/showlab/Awesome-Video-Diffusion) and [Awesome-MLLM-Hallucination](https://github.com/showlab/Awesome-MLLM-Hallucination).