An open API service indexing awesome lists of open source software.

https://github.com/OSU-NLP-Group/GUI-Agents-Paper-List

Building a comprehensive and handy list of papers for GUI agents
https://github.com/OSU-NLP-Group/GUI-Agents-Paper-List

Last synced: about 1 year ago
JSON representation

Building a comprehensive and handy list of papers for GUI agents

Awesome Lists containing this project

README

          

# Awesome GUI Agent Paper List

This repo covers a variety of papers related to GUI Agents, such as:
- Datasets
- Benchmarks
- Models
- Agent frameworks
- Vision, language, multimodal foundation models (with explicit support for GUI)
- Works in general domains extensively used by GUI Agents (e.g., SoM prompting)

[//]: # (

)

[//]: # ( Image 1)

[//]: # ( Image 2)

[//]: # (

)

![keyword_wordcloud_long.png](update_template_or_data/statistics/keyword_wordcloud_long.png)

## Papers Grouped by Environments
| [Web](paper_by_env/paper_web.md) | [Mobile](paper_by_env/paper_mobile.md) | [Desktop](paper_by_env/paper_desktop.md) | [GUI](paper_by_env/paper_gui.md) | [Misc](paper_by_env/paper_misc.md) |
|--------------------------------|---------------------------------------|------------------------------------------|----------------------------------|------------------------------------|

(Misc: Papers for general topics that have important applications in GUI agents.)

## Papers Grouped by Keywords
[framework (116)](paper_by_key/paper_framework.md) | [benchmark (70)](paper_by_key/paper_benchmark.md) | [dataset (66)](paper_by_key/paper_dataset.md) | [model (37)](paper_by_key/paper_model.md) | [reinforcement learning (18)](paper_by_key/paper_reinforcement_learning.md) | [safety (12)](paper_by_key/paper_safety.md) | [visual grounding (10)](paper_by_key/paper_visual_grounding.md) | [planning (8)](paper_by_key/paper_planning.md) | [reasoning (7)](paper_by_key/paper_reasoning.md) | [grounding (6)](paper_by_key/paper_grounding.md) | [survey (5)](paper_by_key/paper_survey.md) | [evaluation (5)](paper_by_key/paper_evaluation.md) | [vision language model (5)](paper_by_key/paper_vision_language_model.md) | [attack (4)](paper_by_key/paper_attack.md) | [learning (4)](paper_by_key/paper_learning.md) | [memory (3)](paper_by_key/paper_memory.md) | [synthetic data (3)](paper_by_key/paper_synthetic_data.md) | [foundation model (3)](paper_by_key/paper_foundation_model.md) | [UI understanding (3)](paper_by_key/paper_UI_understanding.md) | [self-improvement (3)](paper_by_key/paper_self-improvement.md)

## Papers Grouped by Authors
[Yu Su (11)](paper_by_author/paper_Yu_Su.md) | [Graham Neubig (8)](paper_by_author/paper_Graham_Neubig.md) | [Tao Yu (8)](paper_by_author/paper_Tao_Yu.md) | [Huan Sun (8)](paper_by_author/paper_Huan_Sun.md) | [Tianbao Xie (7)](paper_by_author/paper_Tianbao_Xie.md) | [Boyuan Zheng (7)](paper_by_author/paper_Boyuan_Zheng.md) | [Shuyan Zhou (7)](paper_by_author/paper_Shuyan_Zhou.md) | [Kun Shao (6)](paper_by_author/paper_Kun_Shao.md) | [Zhiyong Wu (6)](paper_by_author/paper_Zhiyong_Wu.md) | [Xiao Liu (6)](paper_by_author/paper_Xiao_Liu.md) | [Hanyu Lai (6)](paper_by_author/paper_Hanyu_Lai.md) | [Jie Tang (6)](paper_by_author/paper_Jie_Tang.md) | [Yuxiao Dong (6)](paper_by_author/paper_Yuxiao_Dong.md) | [Yu Gu (5)](paper_by_author/paper_Yu_Gu.md) | [Jianfeng Gao (5)](paper_by_author/paper_Jianfeng_Gao.md) | [Boyu Gou (5)](paper_by_author/paper_Boyu_Gou.md) | [Jianye Hao (5)](paper_by_author/paper_Jianye_Hao.md) | [Jun Wang (5)](paper_by_author/paper_Jun_Wang.md) | [Qiushi Sun (5)](paper_by_author/paper_Qiushi_Sun.md) | [Fangzhi Xu (5)](paper_by_author/paper_Fangzhi_Xu.md)

## All Papers (from most recent to oldest)

Papers

- [Magma: A Foundation Model for Multimodal AI Agents](https://microsoft.github.io/Magma/)
- Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, Yuquan Deng, Lars Liden, Jianfeng Gao
- πŸ›οΈ Institutions: Microsoft Research, Univ. of Maryland, Univ. of Wisconsin-Madison, KAIST, Univ. of Washington
- πŸ“… Date: Feb 18, 2025
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [GUI], [Robotics]
- πŸ”‘ Key: [model], [framework], [SoM], [ToM], [robotics], [UI navigation]
- πŸ“– TLDR: This paper introduces **Magma**, a foundation model designed for multimodal AI agents operating in both digital and physical environments. Magma extends traditional vision-language models by incorporating planning and action capabilities, enabling tasks from UI navigation to robotic manipulation. The model is pretrained on diverse datasets, including images, videos, and robotics data, utilizing **Set-of-Mark (SoM)** for action grounding and **Trace-of-Mark (ToM)** for action planning. Experiments demonstrate that SoM and ToM synergistically enhance Magma's spatial-temporal intelligence, achieving state-of-the-art performance in UI navigation and robotic manipulation tasks.

- [AEIA-MN: Evaluating the Robustness of Multimodal LLM-Powered Mobile Agents Against Active Environmental Injection Attacks](https://arxiv.org/abs/2502.13053)
- Yurun Chen, Xueyu Hu, Keting Yin, Juncheng Li, Shengyu Zhang
- πŸ›οΈ Institutions: Zhejiang University
- πŸ“… Date: February 18, 2025
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Mobile]
- πŸ”‘ Key: [safety], [AEIA-MN]
- πŸ“– TLDR: This paper introduces the concept of Active Environment Injection Attack (AEIA), where attackers disguise malicious actions as environmental elements to disrupt AI agents' decision-making processes. The authors propose AEIA-MN, an attack scheme leveraging mobile notifications to evaluate the robustness of multimodal large language model-based mobile agents. Experimental results demonstrate that even advanced models are highly vulnerable to such attacks, with success rates reaching up to 93% in the AndroidWorld benchmark.

- [Explorer: Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web Agents](https://arxiv.org/abs/2502.11357)
- Vardaan Pahuja, Yadong Lu, Corby Rosset, Boyu Gou, Arindam Mitra, Spencer Whitehead, Yu Su, Ahmed Awadallah
- πŸ›οΈ Institutions: Microsoft Research, The Ohio State University
- πŸ“… Date: February 17, 2025
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Web]
- πŸ”‘ Key: [framework], [dataset], [web agents], [reinforcement learning], [Explorer]
- πŸ“– TLDR: This paper introduces *Explorer*, a multimodal web agent trained on a newly synthesized dataset comprising over 94,000 successful web trajectories. The dataset, generated through scalable exploration and refinement techniques, spans 49,000 unique URLs and includes 720,000 screenshots and 33 million web elements. *Explorer* demonstrates strong performance on benchmarks like Mind2Web-Live, Multimodal-Mind2Web, and MiniWob++, highlighting the importance of data scaling in enhancing web agent capabilities.

- [VSC-RL: Advancing Autonomous Vision-Language Agents with Variational Subgoal-Conditioned Reinforcement Learning](https://arxiv.org/abs/2502.07949)
- Qingyuan Wu, Jianheng Liu, Jianye Hao, Jun Wang, Kun Shao
- πŸ›οΈ Institutions: Huawei Noah's Ark Lab, Univ. College London
- πŸ“… Date: February 11, 2025
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Mobile]
- πŸ”‘ Key: [framework], [reinforcement learning], [subgoal generation], [VSC-RL], [learning efficiency]
- πŸ“– TLDR: This paper introduces **VSC-RL**, a novel reinforcement learning framework that enhances learning efficiency for vision-language agents in complex sequential decision-making tasks. By reformulating these tasks as variational goal-conditioned problems, VSC-RL leverages advanced optimization techniques and utilizes vision-language models to autonomously decompose goals into feasible subgoals. Empirical results demonstrate that VSC-RL significantly outperforms state-of-the-art agents, particularly in challenging mobile device control tasks.

- [AppVLM: A Lightweight Vision Language Model for Online App Control](https://arxiv.org/abs/2502.06395)
- Georgios Papoudakis, Thomas Coste, Zhihao Wu, Jianye Hao, Jun Wang, Kun Shao
- πŸ›οΈ Institutions: Univ. of Cambridge, Huawei Noah's Ark Lab, Univ. College London
- πŸ“… Date: February 10, 2025
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Mobile]
- πŸ”‘ Key: [model], [framework], [evaluation], [AppVLM], [on-device control]
- πŸ“– TLDR: This paper introduces **AppVLM**, a lightweight Vision-Language Model designed for efficient on-device control of smartphone applications. AppVLM is fine-tuned on the AndroidControl dataset and further refined through interactions within the AndroidWorld environment. It achieves superior action prediction accuracy in offline evaluations and matches GPT-4o in online task completion success rates, while operating up to ten times faster, making it a practical solution for real-world deployment.

- [UI-TARS: Pioneering Automated GUI Interaction with Native Agents](https://arxiv.org/abs/2501.12326)
- Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Yang, Haifeng Liu, Feng Lin, Tao Peng, Xin Liu, Guang Shi
- πŸ›οΈ Institutions: ByteDance, Tsinghua University
- πŸ“… Date: January 21, 2025
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [GUI]
- πŸ”‘ Key: [model], [UI-TARS]
- πŸ“– TLDR: This paper introduces **UI-TARS**, a native GUI agent model that processes screenshots to perform human-like interactions such as keyboard and mouse operations. Unlike traditional frameworks relying on commercial models with handcrafted prompts, UI-TARS is an end-to-end model demonstrating superior performance across over 10 GUI agent benchmarks. Key innovations include enhanced perception through large-scale GUI screenshot datasets, unified action modeling across platforms, incorporation of deliberate multi-step reasoning (System-2 Reasoning), and iterative training with reflective online traces, enabling continuous learning and adaptation with minimal human intervention.

- [WebWalker: Benchmarking LLMs in Web Traversal](https://arxiv.org/abs/2501.07572)
- Jialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang, Zekun Xi, Runnan Fang, Deyu Zhou, Pengjun Xie, Fei Huang
- πŸ›οΈ Institutions: Tongyi Lab, Alibaba NLP
- πŸ“… Date: January 13, 2025
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Web]
- πŸ”‘ Key: [benchmark], [framework], [RAG], [WebWalker], [WebWalkerQA]
- πŸ“– TLDR: This paper presents **WebWalker**, a multi-agent framework designed to improve the ability of large language models (LLMs) to traverse websites, addressing the challenges of retrieving complex, multi-layered information. WebWalker integrates an "explore-critic" paradigm, where the explorer agent navigates the web, and the critic agent evaluates the progress. The **WebWalkerQA** benchmark is introduced to assess web traversal tasks, showing how retrieval-augmented generation (RAG) can be enhanced with vertical exploration to solve real-world queries.

- [ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use](https://likaixin2000.github.io/papers/ScreenSpot_Pro.pdf)
- Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, Tat-Seng Chua
- πŸ›οΈ Institutions: NUS, ECNU, HKBU
- πŸ“… Date: January 3, 2025
- πŸ“‘ Publisher: GitHub
- πŸ’» Env: [Desktop]
- πŸ”‘ Key: [benchmark], [GUI grounding], [high-resolution], [ScreenSpot-Pro]
- πŸ“– TLDR: ScreenSpot-Pro introduces a benchmark designed to evaluate GUI grounding models in professional, high-resolution environments. It encompasses 1,581 tasks across 23 applications in various industries, highlighting the challenges models face with complex software interfaces. Current models achieve low accuracy, underscoring the need for further research in this domain.

- [OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis](https://qiushisun.github.io/OS-Genesis-Home/)
- Qiushi Sun, Kanzhi Cheng, Zichen Ding, Chuanyang Jin, Yian Wang, Fangzhi Xu, Zhenyu Wu, Chengyou Jia, Liheng Chen, Zhoumianze Liu, Ben Kao, Guohao Li, Junxian He, Yu Qiao, Zhiyong Wu
- πŸ›οΈ Institutions: Shanghai AI Lab, HKU, Johns Hopkins University, SJTU, Oxford, HKUST
- πŸ“… Date: Dec 27, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [GUI]
- πŸ”‘ Key: [model], [dataset], [trajectory data], [reward model], [e2e model], [data synthesis], [OS-Genesis]
- πŸ“– TLDR: This paper introduces *OS-Genesis*, an interaction-driven pipeline that automates the construction of high-quality and diverse GUI agent trajectory data without human supervision. By employing reverse task synthesis and a trajectory reward model, OS-Genesis enables effective end-to-end training of GUI agents, significantly enhancing their performance on online benchmarks.

- [PC Agent: While You Sleep, AI Works -- A Cognitive Journey into Digital World](https://arxiv.org/abs/2412.17589)
- Yanheng He, Jiahe Jin, Shijie Xia, Jiadi Su, Runze Fan, Haoyang Zou, Xiangkun Hu, Pengfei Liu
- πŸ›οΈ Institutions: SJTU
- πŸ“… Date: Dec 23, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Desktop]
- πŸ”‘ Key: [framework], [PC Agent], [human cognition transfer], [PC Tracker], [Cognition Completion], [multi-agent system]
- πŸ“– TLDR: This paper introduces *PC Agent*, an AI system designed to autonomously perform complex computer-based tasks by emulating human cognitive processes. The system comprises three key components: **PC Tracker**, a lightweight infrastructure for collecting high-quality human-computer interaction data; a **Cognition Completion** pipeline that enriches raw interaction data into detailed cognitive trajectories; and a **multi-agent system** combining a planning agent for decision-making with a grounding agent for precise visual grounding. This approach represents a significant advancement toward AI systems capable of handling intricate real-world work autonomously.

- [OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use](https://github.com/OS-Agent-Survey/OS-Agent-Survey)
- Xueyu Hu, Tao Xiong, Biao Yi, Zishu Wei, Ruixuan Xiao, Yurun Chen, Jiasheng Ye, Meiling Tao, Xiangxin Zhou, Ziyu Zhao, Yuhuai Li, Shengze Xu, Shawn Wang, Xinchen Xu, Shuofei Qiao , Kun Kuang, Tieyong Zeng, Liang Wang, Jiwei Li, Yuchen Eleanor Jiang, Wangchunshu Zhou, Guoyin Wang, Keting Yin, Zhou Zhao, Hongxia Yang, Fan Wu, Shengyu Zhang, Fei Wu
- πŸ›οΈ Institutions: Zhejiang University, Fudan University, OPPO AI Center, University of Chinese Academy of Sciences, Chinese Academy of Sciences, The Chinese HKU, Tsinghua University, 01.AI, The Hong Kong Polytechnic University, SJTU
- πŸ“… Date: December 20, 2024
- πŸ“‘ Publisher: Github Repo
- πŸ’» Env: [GUI]
- πŸ”‘ Key: [survey]
- πŸ“– TLDR: This paper conducts a comprehensive survey on OS Agents, which are (M)LLM-based agents that use computing devices (e.g., computers and mobile phones) by operating within the environments and interfaces (e.g., Graphical User Interface (GUI)) provided by operating systems (OS) to automate tasks. The survey begins by elucidating the fundamentals of OS Agents, exploring their key components including the environment, observation space, and action space, and outlining essential capabilities such as understanding, planning, and grounding. Methodologies for constructing OS Agents are examined, with a focus on domain-specific foundation models and agent frameworks. A detailed review of evaluation protocols and benchmarks highlights how OS Agents are assessed across diverse tasks. Finally, current challenges and promising future research directions, including safety and privacy, personalization and self-evolution, are discussed.

- [Aria-UI: Visual Grounding for GUI Instructions](https://ariaui.github.io/)
- Yuhao Yang, Yue Wang, Dongxu Li, Ziyang Luo, Bei Chen, Chao Huang, Junnan Li
- πŸ›οΈ Institutions: HKU, Rhymes AI
- πŸ“… Date: Dec 20, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [GUI]
- πŸ”‘ Key: [model], [grounding], [visual grounding], [Aria-UI]
- πŸ“– TLDR: This paper introduces *Aria-UI*, a large multimodal model specifically designed for GUI grounding. Aria-UI employs a pure-vision approach, avoiding reliance on auxiliary inputs like HTML or AXTree. It utilizes a scalable data pipeline to synthesize diverse and high-quality instruction samples and incorporates textual and text-image interleaved action histories for robust context-aware reasoning. Aria-UI achieves state-of-the-art results across offline and online agent benchmarks, outperforming both vision-only and AXTree-reliant baselines.

- [GUI Agents: A Survey](https://arxiv.org/abs/2412.13501)
- Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namyong Park, Zhengmian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia, Xintong Li, Jing Shi, Hongjie Chen, Viet Dac Lai, Zhouhang Xie, Sungchul Kim, Ruiyi Zhang, Tong Yu, Mehrab Tanjim, Nesreen K. Ahmed, Puneet Mathur, Seunghyun Yoon, Lina Yao, Branislav Kveton, Thien Huu Nguyen, Trung Bui, Tianyi Zhou, Ryan A. Rossi, Franck Dernoncourt
- πŸ›οΈ Institutions: UMD, SUNY Buffalo, University of Oregon, Adobe Research, Meta AI, University of Rochester, UCSD, CMU, Dolby Labs, Intel AI Research, UNSW
- πŸ“… Date: December 18, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [GUI]
- πŸ”‘ Key: [survey]
- πŸ“– TLDR: This survey provides a comprehensive overview of GUI agents powered by Large Foundation Models, detailing their benchmarks, evaluation metrics, architectures, and training methods. It introduces a unified framework outlining their perception, reasoning, planning, and acting capabilities, identifies open challenges, and discusses future research directions, serving as a resource for both practitioners and researchers in the field.

- [Proposer-Agent-Evaluator (PAE): Autonomous Skill Discovery For Foundation Model Internet Agents](https://arxiv.org/abs/2412.13194)
- Yifei Zhou, Qianlan Yang, Kaixiang Lin, Min Bai, Xiong Zhou, Yu-Xiong Wang, Sergey Levine, Erran Li
- πŸ›οΈ Institutions: UCB, UIUC, Amazon
- πŸ“… Date: December 17, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Web]
- πŸ”‘ Key: [framework], [reinforcement learning], [skill discovery], [PAE]
- πŸ“– TLDR: This paper introduces the Proposer-Agent-Evaluator (PAE) system, enabling foundation model agents to autonomously discover and practice skills in real-world web environments. PAE comprises a context-aware task proposer, an agent policy for task execution, and a vision-language model-based success evaluator. Validated on vision-based web navigation tasks, PAE significantly enhances zero-shot generalization capabilities of vision-language model Internet agents, achieving over 30% relative improvement on unseen tasks and websites, and surpassing state-of-the-art open-source agents by more than 10%.

- [Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining](https://arxiv.org/abs/2412.10342)
- Zhiqi Ge, Juncheng Li, Xinglei Pang, Minghe Gao, Kaihang Pan, Wang Lin, Hao Fei, Wenqiao Zhang, Siliang Tang, Yueting Zhuang
- πŸ›οΈ Institutions: Zhejiang University, NUS
- πŸ“… Date: December 13, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [GUI]
- πŸ”‘ Key: [framework], [Information-Sensitive Cropping], [Self-Refining Dual Learning], [visual grounding], [model]
- πŸ“– TLDR: This paper introduces *Iris*, a visual agent designed to enhance GUI automation by addressing challenges in high-resolution, complex digital environments. It employs two key innovations: **Information-Sensitive Cropping (ISC)**, which dynamically identifies and prioritizes visually dense regions using an edge detection algorithm for efficient processing, and **Self-Refining Dual Learning (SRDL)**, which enhances the agent's ability to handle complex tasks through a dual-learning loop that iteratively refines its performance without requiring additional annotated data. Empirical evaluations demonstrate that Iris achieves state-of-the-art performance across multiple benchmarks with only 850K GUI annotations, outperforming methods using ten times more training data.

- [Falcon-UI: Understanding GUI Before Following User Instructions](https://arxiv.org/abs/2412.09362)
- Huawen Shen, Chang Liu, Gengluo Li, Xinlong Wang, Yu Zhou, Can Ma, Xiangyang Ji
- πŸ›οΈ Institutions: Chinese Academy of Sciences, Tsinghua University, Nankai University, BAAI
- πŸ“… Date: Dec 12, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [GUI]
- πŸ”‘ Key: [model], [dataset], [Falcon-UI], [GUI understanding]
- πŸ“– TLDR: This paper introduces *Falcon-UI*, a GUI agent model that emphasizes understanding GUI contexts before following user instructions. The authors present the *Insight-UI Dataset*, an instruction-free GUI navigation dataset generated from the Common Crawl corpus, simulating various platforms like iOS, Android, Windows, and Linux across multiple resolutions on 312K domains. Falcon-UI is pretrained on this dataset and fine-tuned on Android and Web GUI datasets, achieving performance comparable to larger models, highlighting the importance of decoupling GUI understanding from instruction following in agent performance.

- [The BrowserGym Ecosystem for Web Agent Research](https://arxiv.org/abs/2412.05467)
- Thibault Le Sellier De Chezelles, Maxime Gasse, Alexandre Drouin, Massimo Caccia, LΓ©o Boisvert, Megh Thakkar, Tom Marty, Rim Assouel, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han LΓΉ, Ori Yoran, Dehan Kong, Frank F. Xu, Siva Reddy, Quentin Cappart, Graham Neubig, Ruslan Salakhutdinov, Nicolas Chapados, Alexandre Lacoste
- πŸ›οΈ Institutions: ServiceNow Research, Mila, Polytechnique MontrΓ©al, CMU, McGill University, Tel Aviv University, UniversitΓ© de MontrΓ©al, iMean AI
- πŸ“… Date: December 6, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Web]
- πŸ”‘ Key: [benchmark], [framework], [LLM], [automation], [BrowserGym], [AgentLab]
- πŸ“– TLDR: This paper presents *BrowserGym*, an ecosystem designed to standardize the evaluation and benchmarking of web agents, particularly those leveraging Large Language Models (LLMs). It addresses the challenges posed by fragmented benchmarks and inconsistent methodologies in web agent research. BrowserGym provides a unified, gym-like environment with clearly defined observation and action spaces, enabling reproducible comparisons across various benchmarks. Additionally, *AgentLab*, a complementary framework, supports agent creation, testing, and analysis. The paper also features a large-scale experiment comparing the performance of 6 leading LLMs, highlighting the strengths and weaknesses of different models in real-world web tasks, while emphasizing the ongoing challenges in building efficient and robust web agents.

- [Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction](https://aguvis-project.github.io/)
- Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, Caiming Xiong
- πŸ›οΈ Institutions: HKU, NTU, Salesforce
- πŸ“… Date: Dec 5, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [GUI]
- πŸ”‘ Key: [model], [dataset], [planning], [reasoning], [Aguvis], [visual grounding]
- πŸ“– TLDR: This paper introduces *Aguvis*, a unified pure vision-based framework for autonomous GUI agents that operates across various platforms. It leverages image-based observations and grounds natural language instructions to visual elements, employing a consistent action space to ensure cross-platform generalization. The approach integrates explicit planning and reasoning within the model, enhancing its ability to autonomously navigate and interact with complex digital environments. A large-scale dataset of GUI agent trajectories is constructed, incorporating multimodal reasoning and grounding. Comprehensive experiments demonstrate that Aguvis surpasses previous state-of-the-art methods in both offline and real-world online scenarios, achieving the first fully autonomous pure vision GUI agent capable of performing tasks independently without collaboration with external closed-source models. All datasets, models, and training recipes are open-sourced to facilitate future research.

- [Ponder & Press: Advancing Visual GUI Agent towards General Computer Control](https://arxiv.org/abs/2412.01268)
- Yiqin Wang, Haoji Zhang, Jingqi Tian, Yansong Tang
- πŸ›οΈ Institutions: Tsinghua University
- πŸ“… Date: December 2, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [GUI]
- πŸ”‘ Key: [framework], [visual grounding], [reinforcement learning], [Ponder & Press]
- πŸ“– TLDR: This paper introduces *Ponder & Press*, a divide-and-conquer framework for general computer control using only visual input. The approach combines a general-purpose multimodal large language model (MLLM) as an 'interpreter' to translate high-level user instructions into detailed action descriptions, with a GUI-specific MLLM as a 'locator' that precisely identifies GUI elements for action execution. By relying solely on visual inputs, the agent offers a versatile, human-like interaction paradigm applicable across various platforms, achieving state-of-the-art performance in GUI grounding and control tasks.

- [ShowUI: One Vision-Language-Action Model for GUI Visual Agent](https://arxiv.org/abs/2411.17465)
- Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, Mike Zheng Shou
- πŸ›οΈ Institutions: NUS, Microsoft
- πŸ“… Date: November 26, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [GUI]
- πŸ”‘ Key: [model], [framework], [dataset], [UI-Guided Visual Token Selection], [Interleaved Vision-Language-Action Streaming], [ShowUI]
- πŸ“– TLDR: This paper introduces *ShowUI*, a vision-language-action model designed to enhance GUI automation by addressing challenges in UI visual perception and action modeling. It features innovations like UI-Guided Visual Token Selection to reduce computational costs and Interleaved Vision-Language-Action Streaming for effective management of visual-action history. Trained on a curated dataset, ShowUI achieves 75.1% accuracy in zero-shot screenshot grounding and demonstrates competitive performance across web, mobile, and online environments.

- [AdaptAgent: Adapting Multimodal Web Agents with Few-Shot Learning from Human Demonstrations](https://arxiv.org/abs/2411.13451)
- Gaurav Verma, Rachneet Kaur, Nishan Srishankar, Zhen Zeng, Tucker Balch, Manuela Veloso
- πŸ›οΈ Institutions: J.P. Morgan AI Research
- πŸ“… Date: November 24, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Web]
- πŸ”‘ Key: [framework], [few-shot learning], [meta-learning], [AdaptAgent]
- πŸ“– TLDR: This paper introduces **AdaptAgent**, a framework that enables multimodal web agents to adapt to new websites and domains using few human demonstrations (up to 2). The approach enhances agents' adaptability beyond large-scale pre-training and fine-tuning by leveraging in-context learning and meta-learning techniques. Experiments on benchmarks like Mind2Web and VisualWebArena show that incorporating minimal human demonstrations boosts task success rates significantly, highlighting the effectiveness of multimodal demonstrations over text-only ones and the impact of data selection strategies during meta-learning on agent generalization.

- [Improved GUI Grounding via Iterative Narrowing](https://arxiv.org/abs/2411.13591)
- Anthony Nguyen
- πŸ›οΈ Institutions: Algoma University
- πŸ“… Date: November 18, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [GUI]
- πŸ”‘ Key: [framework], [grounding], [visual grounding], [iterative narrowing]
- πŸ“– TLDR: This paper introduces a visual framework to enhance GUI grounding. By iteratively refining model predictions through progressively focused image crops, the proposed method improves the performance of both general and fine-tuned Vision-Language Models (VLMs) in GUI grounding tasks.

- [Generalist Virtual Agents: A Survey on Autonomous Agents Across Digital Platforms](https://arxiv.org/abs/2411.10943)
- Minghe Gao, Wendong Bu, Bingchen Miao, Yang Wu, Yunfei Li, Juncheng Li, Siliang Tang, Qi Wu, Yueting Zhuang, Meng Wang
- πŸ›οΈ Institutions: Zhejiang University, University of Adelaide, Hefei University of Technology
- πŸ“… Date: November 17, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [GUI]
- πŸ”‘ Key: [survey], [Generalist Virtual Agent], [GVA], [autonomous agents], [digital platforms]
- πŸ“– TLDR: This survey introduces the concept of Generalist Virtual Agents (GVAs), autonomous entities designed to operate across various digital platforms and environments to assist users in performing diverse tasks. It traces the evolution of GVAs from early intelligent assistants to modern implementations incorporating large-scale models, discussing their philosophical foundations, development challenges, and current methodologies. The paper provides a detailed taxonomy of GVA environments, tasks, and capabilities, aiming to bridge theoretical and practical aspects and suggesting that agents operating in environments closely mirroring the real world are more likely to exhibit human-like intelligence.

- [The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use](https://arxiv.org/abs/2411.10323)
- Siyuan Hu, Mingyu Ouyang, Difei Gao, Mike Zheng Shou
- πŸ›οΈ Institutions: NUS
- πŸ“… Date: Nov 15, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Desktop]
- πŸ”‘ Key: [framework], [Claude 3.5 Computer Use], [GUI automation], [planning], [action], [critic]
- πŸ“– TLDR: This study evaluates Claude 3.5 Computer Use, an AI model enabling end-to-end language-to-desktop actions, through curated tasks across various domains. It introduces an out-of-the-box framework for deploying API-based GUI automation models, analyzing the model's planning, action execution, and adaptability to dynamic environments.

- [WebOlympus: An Open Platform for Web Agents on Live Websites](https://aclanthology.org/2024.emnlp-demo.20/)
- Boyuan Zheng, Boyu Gou, Scott Salisbury, Zheng Du, Huan Sun, Yu Su
- πŸ›οΈ Institutions: OSU
- πŸ“… Date: November 12, 2024
- πŸ“‘ Publisher: EMNLP 2024
- πŸ’» Env: [Web]
- πŸ”‘ Key: [safety], [Chrome extension], [WebOlympus], [SeeAct], [Annotation Tool]
- πŸ“– TLDR: This paper introduces *WebOlympus*, an open platform designed to facilitate the research and deployment of web agents on live websites. It features a user-friendly Chrome extension interface, allowing users without programming expertise to operate web agents with minimal effort. The platform incorporates a safety monitor module to prevent harmful actions through human supervision or model-based control, supporting applications such as annotation interfaces for web agent trajectories and data crawling.

- [Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents](https://arxiv.org/abs/2411.06559)
- Yu Gu, Boyuan Zheng, Boyu Gou, Kai Zhang, Cheng Chang, Sanjari Srivastava, Yanan Xie, Peng Qi, Huan Sun, Yu Su
- πŸ›οΈ Institutions: OSU, Orby AI
- πŸ“… Date: November 10, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Web]
- πŸ”‘ Key: [framework], [WebDreamer], [model-based planning], [world model]
- πŸ“– TLDR: This paper investigates whether Large Language Models (LLMs) can function as world models within web environments, enabling model-based planning for web agents. Introducing **WebDreamer**, a framework that leverages LLMs to simulate potential action sequences in web environments, the study demonstrates significant performance improvements over reactive baselines on benchmarks like VisualWebArena and Mind2Web-live. The findings suggest that LLMs possess the capability to model the dynamic nature of the internet, paving the way for advancements in automated web interaction and opening new research avenues in optimizing LLMs for complex, evolving environments.

- [GUI Agents with Foundation Models: A Comprehensive Survey](https://arxiv.org/abs/2411.04890)
- Shuai Wang, Weiwen Liu, Jingxuan Chen, Weinan Gan, Xingshan Zeng, Shuai Yu, Xinlong Hao, Kun Shao, Yasheng Wang, Ruiming Tang
- πŸ›οΈ Institutions: Huawei Noah’s Ark Lab
- πŸ“… Date: November 7, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [GUI]
- πŸ”‘ Key: [survey]
- πŸ“– TLDR: This survey consolidates recent research on GUI agents powered by foundation models, particularly Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs). It discusses representative datasets and benchmarks, summarizes a unified framework capturing essential components from prior studies, and explores commercial applications. The paper identifies key challenges and proposes future research directions to inspire further developments in (M)LLM-based GUI agents.

- [WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning](https://arxiv.org/abs/2411.02337v1)
- Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Xinyue Yang, Jiadai Sun, Yu Yang, Shuntian Yao, Tianjie Zhang, Wei Xu, Jie Tang, Yuxiao Dong
- πŸ›οΈ Institutions: Tsinghua University, BAAI
- πŸ“… Date: November 4, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Web]
- πŸ”‘ Key: [framework], [reinforcement learning], [self-evolving curriculum], [WebRL], [outcome-supervised reward model]
- πŸ“– TLDR: This paper introduces *WebRL*, a self-evolving online curriculum reinforcement learning framework designed to train high-performance web agents using open large language models (LLMs). WebRL addresses challenges such as the scarcity of training tasks, sparse feedback signals, and policy distribution drift in online learning. It incorporates a self-evolving curriculum that generates new tasks from unsuccessful attempts, a robust outcome-supervised reward model (ORM), and adaptive reinforcement learning strategies to ensure consistent improvements. Applied to Llama-3.1 and GLM-4 models, WebRL significantly enhances their performance on web-based tasks, surpassing existing state-of-the-art web agents.

- [Attacking Vision-Language Computer Agents via Pop-ups](https://arxiv.org/abs/2411.02391)
- Yanzhe Zhang, Tao Yu, Diyi Yang
- πŸ›οΈ Institutions: Georgia Tech, HKU, Stanford
- πŸ“… Date: Nov 4, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [GUI]
- πŸ”‘ Key: [attack], [adversarial pop-ups], [VLM agents], [safety]
- πŸ“– TLDR: This paper demonstrates that vision-language model (VLM) agents can be easily deceived by carefully designed adversarial pop-ups, leading them to perform unintended actions such as clicking on these pop-ups instead of completing their assigned tasks. Integrating these pop-ups into environments like OSWorld and VisualWebArena resulted in an average attack success rate of 86% and a 47% decrease in task success rate. Basic defense strategies, such as instructing the agent to ignore pop-ups or adding advertisement notices, were found to be ineffective against these attacks.

- [Language Agents: Foundations, Prospects, and Risks](https://language-agent-tutorial.github.io/)
- Yu Su, Diyi Yang, Shunyu Yao, Tao Yu
- πŸ›οΈ Institutions: OSU, Stanford, Princeton, HKU
- πŸ“… Date: November 2024
- πŸ“‘ Publisher: EMNLP 2024
- πŸ’» Env: [Misc]
- πŸ”‘ Key: [survey], [tutorial], [reasoning], [planning], [memory], [multi-agent systems], [safty]
- πŸ“– TLDR: This tutorial provides a comprehensive exploration of language agentsβ€”autonomous systems powered by large language models capable of executing complex tasks through language instructions. It delves into their theoretical foundations, potential applications, associated risks, and future directions, covering topics such as reasoning, memory, planning, tool augmentation, grounding, multi-agent systems, and safety considerations.

- [From Context to Action: Analysis of the Impact of State Representation and Context on the Generalization of Multi-Turn Web Navigation Agents](https://arxiv.org/abs/2409.13701)
- Nalin Tiwary, Vardhan Dongre, Sanil Arun Chawla, Ashwin Lamani, Dilek Hakkani-TΓΌr
- πŸ›οΈ Institutions: UIUC
- πŸ“… Date: October 31, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Web]
- πŸ”‘ Key: [framework], [context management], [generalization], [multi-turn navigation], [CWA]
- πŸ“– TLDR: This study examines how different contextual elements affect the performance and generalization of Conversational Web Agents (CWAs) in multi-turn web navigation tasks. By optimizing context managementβ€”specifically interaction history and web page representationβ€”the research demonstrates enhanced agent performance across various out-of-distribution scenarios, including unseen websites, categories, and geographic locations.

- [AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents](https://arxiv.org/abs/2410.24024)
- Yifan Xu, Xiao Liu, Xueqiao Sun, Siyi Cheng, Hao Yu, Hanyu Lai, Shudan Zhang, Dan Zhang, Jie Tang, Yuxiao Dong
- πŸ›οΈ Institutions: Tsinghua University, Peking University
- πŸ“… Date: October 31, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Mobile]
- πŸ”‘ Key: [framework], [dataset], [benchmark], [AndroidLab]
- πŸ“– TLDR: This paper introduces **AndroidLab**, a comprehensive framework for training and systematically benchmarking Android autonomous agents. It provides an operational environment with diverse modalities and action spaces, supporting both large language models (LLMs) and multimodal models (LMMs). The benchmark includes 138 tasks across nine apps on predefined Android virtual devices. Utilizing AndroidLab, the authors developed an Android Instruction dataset and trained six open-source LLMs and LMMs, significantly improving their average success rates.

- [OS-ATLAS: A Foundation Action Model For Generalist GUI Agents](https://osatlas.github.io/)
- Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, Yu Qiao
- πŸ›οΈ Institutions: Shanghai AI Lab, SJTU, HKU, MIT
- πŸ“… Date: October 30, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [GUI]
- πŸ”‘ Key: [model], [dataset], [benchmark], [OS-Atlas]
- πŸ“– TLDR: This paper introduces OS-Atlas, a foundational GUI action model designed to enhance GUI grounding and out-of-distribution tasks. The authors developed a toolkit to synthesize multi-platform GUI grounding data, resulting in a cross-platform corpus of over 13 million GUI elements. OS-Atlas demonstrates significant performance improvements across six benchmarks spanning mobile, desktop, and web platforms.

- [Evaluating Cultural and Social Awareness of LLM Web Agents](https://arxiv.org/abs/2410.23252)
- Haoyi Qiu, Alexander R. Fabbri, Divyansh Agarwal, Kung-Hsiang Huang, Sarah Tan, Nanyun Peng, Chien-Sheng Wu
- πŸ›οΈ Institutions: UCLA, Salesforce AI Research
- πŸ“… Date: October 30, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Web]
- πŸ”‘ Key: [benchmark], [CASA], [cultural awareness], [social awareness], [fine-tuning], [prompting]
- πŸ“– TLDR: This paper introduces CASA, a benchmark designed to assess the cultural and social awareness of LLM web agents in tasks like online shopping and social discussion forums. It evaluates agents' abilities to detect and appropriately respond to norm-violating user queries and observations. The study finds that current LLM agents have limited cultural and social awareness, with less than 10% awareness coverage and over 40% violation rates. To enhance performance, the authors explore prompting and fine-tuning methods, demonstrating that combining both can offer complementary advantages.

- [Auto-Intent: Automated Intent Discovery and Self-Exploration for Large Language Model Web Agents](https://arxiv.org/abs/2410.22552)
- Jaekyeom Kim, Dong-Ki Kim, Lajanugen Logeswaran, Sungryull Sohn, Honglak Lee
- πŸ›οΈ Institutions: LG AI Research, Field AI, University of Michigan
- πŸ“… Date: October 29, 2024
- πŸ“‘ Publisher: EMNLP 2024 (Findings)
- πŸ’» Env: [Web]
- πŸ”‘ Key: [framework], [Auto-Intent]
- πŸ“– TLDR: The paper presents Auto-Intent, a method to adapt pre-trained large language models for web navigation tasks without direct fine-tuning. It discovers underlying intents from domain demonstrations and trains an intent predictor to enhance decision-making. Auto-Intent improves the performance of GPT-3.5, GPT-4, and Llama-3.1 agents on benchmarks like Mind2Web and WebArena.

- [AutoGLM: Autonomous Foundation Agents for GUIs](https://xiao9905.github.io/AutoGLM/)
- Xiao Liu, Bo Qin, Dongzhu Liang, Guang Dong, Hanyu Lai, Hanchen Zhang, Hanlin Zhao, Iat Long Iong, Jiadai Sun, Jiaqi Wang, Junjie Gao, Junjun Shan, Kangning Liu, Shudan Zhang, Shuntian Yao, Siyi Cheng, Wentao Yao, Wenyi Zhao, Xinghan Liu, Xinyi Liu, Xinying Chen, Xinyue Yang, Yang Yang, Yifan Xu, Yu Yang, Yujia Wang, Yulin Xu, Zehan Qi, Yuxiao Dong, Jie Tang
- πŸ›οΈ Institutions: Zhipu AI, Tsinghua University
- πŸ“… Date: October 25, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [GUI]
- πŸ”‘ Key: [framework], [model], [learning], [AutoGLM]
- πŸ“– TLDR: This paper introduces AutoGLM, a new series in the ChatGLM family, designed as foundation agents for autonomous control of digital devices through GUIs. It addresses the challenges foundation models face in decision-making within dynamic environments by developing agents capable of learning through autonomous interactions. Focusing on web browsers and Android devices, AutoGLM integrates various techniques to create deployable agent systems. Key insights include the importance of designing an appropriate "intermediate interface" for GUI control and a novel progressive training framework for self-evolving online curriculum reinforcement learning. Evaluations demonstrate AutoGLM's effectiveness across multiple domains, achieving notable success rates in web browsing and Android device control tasks.

- [EDGE: Enhanced Grounded GUI Understanding with Enriched Multi-Granularity Synthetic Data](https://doi.org/10.48550/arXiv.2410.19461)
- Xuetian Chen, Hangcheng Li, Jiaqing Liang, Sihang Jiang, Deqing Yang
- πŸ›οΈ Institutions: Fudan University
- πŸ“… Date: October 25, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [GUI]
- πŸ”‘ Key: [dataset], [framework], [synthetic data]
- πŸ“– TLDR: The *EDGE* framework proposes an innovative approach to improve GUI understanding and interaction capabilities in vision-language models through large-scale, multi-granularity synthetic data generation. By leveraging webpage data, EDGE minimizes the need for manual annotations and enhances the adaptability of models across desktop and mobile GUI environments. Evaluations show its effectiveness in diverse GUI-related tasks, contributing significantly to autonomous agent development in GUI navigation and interaction.

- [OpenWebVoyager: Building Multimodal Web Agents via Iterative Real-World Exploration, Feedback and Optimization](https://doi.org/10.48550/arXiv.2410.19609)
- Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Hongming Zhang, Tianqing Fang, Zhenzhong Lan, Dong Yu
- πŸ›οΈ Institutions: Zhejiang University, Tencent AI Lab, Westlake University
- πŸ“… Date: October 25, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Web]
- πŸ”‘ Key: [framework], [learning], [imitation learning], [exploration], [AI feedback]
- πŸ“– TLDR: The paper presents **OpenWebVoyager**, an open-source framework for training web agents that explore real-world online environments autonomously. The framework employs a cycle of exploration, feedback, and optimization, enhancing agent capabilities through multimodal perception and iterative learning. Initial skills are acquired through imitation learning, followed by real-world exploration, where the agent’s performance is evaluated and refined through feedback loops.

- [VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks](https://doi.org/10.48550/arXiv.2410.19100)
- Lawrence Jang, Yinheng Li, Charles Ding, Justin Lin, Paul Pu Liang, Dan Zhao, Rogerio Bonatti, Kazuhito Koishida
- πŸ›οΈ Institutions: CMU, MIT, NYU, Microsoft
- πŸ“… Date: October 24, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Web]
- πŸ”‘ Key: [benchmark], [dataset], [video understanding], [long-context], [VideoWA]
- πŸ“– TLDR: This paper introduces **VideoWebArena (VideoWA)**, a benchmark assessing multimodal agents in video-based tasks. It features over 2,000 tasks focused on skill and factual retention, using video tutorials to simulate long-context environments. Results highlight current challenges in agentic abilities, providing a critical testbed for long-context video understanding improvements.

- [Beyond Browsing: API-Based Web Agents](https://arxiv.org/pdf/2410.16464)
- Yueqi Song, Frank Xu, Shuyan Zhou, Graham Neubig
- πŸ›οΈ Institutions: CMU
- πŸ“… Date: October 24, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Web]
- πŸ”‘ Key: [API-based agent], [hybrid agent], [benchmark], [WebArena], [SOTA performance]
- πŸ“– TLDR: This paper introduces API-based and hybrid agents designed to execute online tasks by accessing both APIs and traditional web browsing interfaces. In evaluations using WebArena, a benchmark for web navigation, the API-based agent achieves higher performance than browser-based agents, and the hybrid model achieves a success rate of 35.8%, setting a new state-of-the-art (SOTA) in task-agnostic web navigation. The findings highlight the efficiency and reliability gains of API interactions for web agents.

- [AgentStore: Scalable Integration of Heterogeneous Agents As Specialized Generalist Computer Assistant](https://arxiv.org/abs/2410.18603)
- Chengyou Jia, Minnan Luo, Zhuohang Dang, Qiushi Sun, Fangzhi Xu, Junlin Hu, Tianbao Xie, Zhiyong Wu
- πŸ›οΈ Institutions: XJTU, Shanghai AI Lab, HKU
- πŸ“… Date: October 24, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [GUI]
- πŸ”‘ Key: [framework], [multi-agent systems], [specialized generalist agent], [OSWorld benchmark]
- πŸ“– TLDR: AgentStore introduces a scalable platform to integrate and manage heterogeneous agents, designed to enhance generalist assistant capabilities for diverse computer tasks. Using a MetaAgent and AgentToken strategy, AgentStore shows improved generalization on the OSWorld benchmark.

- [MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control](https://arxiv.org/abs/2410.17520)
- Juyong Lee, Dongyoon Hahm, June Suk Choi, W. Bradley Knox, Kimin Lee
- πŸ›οΈ Institutions: KAIST, UT at Austin
- πŸ“… Date: October 23, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Mobile]
- πŸ”‘ Key: [benchmark], [safety], [evaluation], [Android emulator]
- πŸ“– TLDR: *MobileSafetyBench* introduces a benchmark for evaluating the safety of large language model (LLM)-based autonomous agents in mobile device control. Using Android emulators, the benchmark simulates real-world tasks in apps such as messaging and banking to assess agents' safety and helpfulness. The safety-focused tasks test for privacy risk management and robustness against adversarial prompt injections. Experiments show agents perform well in helpful tasks but struggle with safety-related challenges, underscoring the need for continued advancements in mobile safety mechanisms for autonomous agents.

- [Lightweight Neural App Control](https://arxiv.org/abs/2410.17883)
- Filippos Christianos, Georgios Papoudakis, Thomas Coste, Jianye Hao, Jun Wang, Kun Shao
- πŸ›οΈ Institutions: Huawei Noah's Ark Lab, UCL
- πŸ“… Date: October 23, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Mobile]
- πŸ”‘ Key: [framework], [vision language model], [Action Transformer], [app agent], [Android control], [multi-modal]
- πŸ“– TLDR: This paper introduces LiMAC, a mobile control framework for Android that integrates an Action Transformer and fine-tuned vision-language models to execute precise actions in mobile apps. Tested on open-source datasets, LiMAC improves action accuracy by up to 42% over traditional prompt engineering baselines, demonstrating enhanced efficiency and accuracy in mobile app control tasks.

- [Large Language Models Empowered Personalized Web Agents](https://ar5iv.org/abs/2410.17236)
- Hongru Cai, Yongqi Li, Wenjie Wang, Fengbin Zhu, Xiaoyu Shen, Wenjie Li, Tat-Seng Chua
- πŸ›οΈ Institutions: HK PolyU, NTU Singapore
- πŸ“… Date: Oct 22, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Web]
- πŸ”‘ Key: [framework], [benchmark], [personalized web agent], [user behavior alignment], [memory-enhanced alignment]
- πŸ“– TLDR: This paper proposes a novel framework, *Personalized User Memory-enhanced Alignment (PUMA)*, enabling large language models to serve as personalized web agents by incorporating user-specific data and historical web interactions. The authors also introduce a benchmark, *PersonalWAB*, to evaluate these agents on various personalized web tasks. Results show that PUMA improves web agent performance by optimizing action execution based on user-specific preferences.

- [Dissecting Adversarial Robustness of Multimodal LM Agents](https://openreview.net/forum?id=LjVIGva5Ct)
- Chen Henry Wu, Rishi Rajesh Shah, Jing Yu Koh, Russ Salakhutdinov, Daniel Fried, Aditi Raghunathan
- πŸ›οΈ Institutions: CMU, Stanford
- πŸ“… Date: October 21, 2024
- πŸ“‘ Publisher: NeurIPS 2024 Workshop
- πŸ’» Env: [Web]
- πŸ”‘ Key: [dataset], [attack], [ARE], [safety]
- πŸ“– TLDR: This paper introduces the Agent Robustness Evaluation (ARE) framework to assess the adversarial robustness of multimodal language model agents in web environments. By creating 200 targeted adversarial tasks within VisualWebArena, the study reveals that minimal perturbations can significantly compromise agent performance, even in advanced systems utilizing reflection and tree-search mechanisms. The findings highlight the need for enhanced safety measures in deploying such agents.

- [AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?](https://arxiv.org/abs/2407.15711)
- Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, Jonathan Berant
- πŸ›οΈ Institutions: Tel Aviv University
- πŸ“… Date: October 21, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Web]
- πŸ”‘ Key: [benchmark], [dataset], [planning and reasoning]
- πŸ“– TLDR: AssistantBench is a benchmark designed to test the abilities of web agents in completing time-intensive, realistic web-based tasks. Covering 214 tasks across various domains, the benchmark introduces the SPA (See-Plan-Act) framework to handle multi-step planning and memory retention. AssistantBench emphasizes realistic task completion, showing that current agents achieve only modest success, with significant improvements needed for complex information synthesis and execution across multiple web domains.

- [SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation](https://ai-agents-2030.github.io/SPA-Bench/)
- Jingxuan Chen, Derek Yuen, Bin Xie, Yuhao Yang, Gongwei Chen, Zhihao Wu, Li Yixing, Xurui Zhou, Weiwen Liu, Shuai Wang, Rui Shao, Liqiang Nie, Yasheng Wang, Jianye Hao, Jun Wang, Kun Shao
- πŸ›οΈ Institutions: Huawei Noah’s Ark Lab, Harbin Institute of Technology, Shenzhen, UCL
- πŸ“… Date: October 19, 2024
- πŸ“‘ Publisher: ICLR 2025
- πŸ’» Env: [Mobile]
- πŸ”‘ Key: [benchmark], [AI agent], [smartphone control], [framework]
- πŸ“– TLDR: SPA-Bench is introduced as a benchmark designed to evaluate multimodal large language model (MLLM)-based smartphone agents, offering a task set that spans common smartphone functionalities across system and third-party applications. It includes a plug-and-play framework for real-time agent interactions on Android, integrating over ten agents with an adaptable evaluation pipeline measuring success across diverse metrics. Through this, the benchmark exposes challenges such as UI interpretation, action grounding, and memory retention in mobile environments, advancing research in smartphone-based agent applications.

- [DistRL: An Asynchronous Distributed Reinforcement Learning Framework for On-Device Control Agents](https://ai-agents-2030.github.io/DistRL/)
- Taiyi Wang, Zhihao Wu, Jianheng Liu, Jianye Hao, Jun Wang, Kun Shao
- πŸ›οΈ Institutions: Univ. of Cambridge, Huawei Noah's Ark Lab, Univ. College London
- πŸ“… Date: October 18, 2024
- πŸ“‘ Publisher: ICLR 2025
- πŸ’» Env: [Mobile]
- πŸ”‘ Key: [framework], [reinforcement learning], [distributed training], [A-RIDE], [on-device control]
- πŸ“– TLDR: This paper introduces **DistRL**, a novel framework designed to enhance the efficiency of online reinforcement learning fine-tuning for mobile device control agents. DistRL employs centralized training and decentralized data acquisition to ensure efficient fine-tuning in dynamic online interactions. The framework is supported by a custom RL algorithm, A-RIDE, which balances exploration with prioritized data utilization to ensure stable and robust training. Experiments demonstrate that DistRL improves training efficiency by 3Γ— and accelerates data collection by 2.4Γ— compared to leading synchronous multi-machine methods. Notably, after training, DistRL achieves a 20% relative improvement in success rate on general Android tasks from an open benchmark, outperforming existing approaches while maintaining the same training time.

- [Harnessing Webpage UIs for Text-Rich Visual Understanding](https://arxiv.org/abs/2410.13824)
- Junpeng Liu, Tianyue Ou, Yifan Song, Yuxiao Qu, Wai Lam, Chenyan Xiong, Wenhu Chen, Graham Neubig, Xiang Yue
- πŸ›οΈ Institutions: CMU
- πŸ“… Date: October 17, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Web], [Doc]
- πŸ”‘ Key: [dataset], [model], [text-rich visual understanding], [web UI comprehension]
- πŸ“– TLDR: This paper introduces *MultiUI*, a large-scale dataset containing 7.3 million annotated samples from 1 million websites, specifically designed to enhance multimodal large language models’ (MLLMs) capabilities in text-rich visual understanding. Utilizing webpage UI structures as a training resource, MultiUI provides robust accessibility tree data paired with UI screenshots, significantly improving MLLMs’ grounding, OCR, and interaction performance. Models trained with MultiUI achieve up to a 48% performance boost on VisualWebBench and demonstrate enhanced generalization across non-web tasks, setting a new standard for structured, visually integrated web data modeling.

- [AutoWebGLM: A Large Language Model-based Web Navigating Agent](https://arxiv.org/abs/2404.03648)
- Hanyu Lai, Xiao Liu, Iat Long Iong, Shuntian Yao, Yuxuan Chen, Pengbo Shen, Hao Yu, Hanchen Zhang, Xiaohan Zhang, Yuxiao Dong, Jie Tang
- πŸ›οΈ Institutions: THU, OSU
- πŸ“… Date: October 12, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Web]
- πŸ”‘ Key: [framework], [dataset], [benchmark], [reinforcement learning]
- πŸ“– TLDR: AutoWebGLM introduces a web navigation agent based on ChatGLM3-6B, designed to autonomously navigate and interact with webpages for complex tasks. The paper highlights a two-phase data construction approach using a hybrid human-AI methodology for diverse, curriculum-based web task training. It also presents AutoWebBench, a benchmark for evaluating agent performance in web tasks, and uses reinforcement learning to fine-tune operations, addressing complex webpage interaction and grounding.

- [Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents](https://arxiv.org/abs/2410.13886)
- Priyanshu Kumar, Elaine Lau, Saranya Vijayakumar, Tu Trinh, Scale Red Team, Elaine Chang, Vaughn Robinson, Sean Hendryx, Shuyan Zhou, Matt Fredrikson, Summer Yue, Zifan Wang
- πŸ›οΈ Institutions: CMU, GraySwan AI, Scale AI
- πŸ“… Date: October 11, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Web]
- πŸ”‘ Key: [attack], [BrowserART], [jailbreaking], [safety]
- πŸ“– TLDR: This paper introduces **Browser Agent Red teaming Toolkit (BrowserART)**, a comprehensive test suite for evaluating the safety of LLM-based browser agents. The study reveals that while refusal-trained LLMs decline harmful instructions in chat settings, their corresponding browser agents often comply with such instructions, indicating a significant safety gap. The authors call for collaboration among developers and policymakers to enhance agent safety.

- [Agent S: An Open Agentic Framework that Uses Computers Like a Human](https://arxiv.org/abs/2410.08164)
- Saaket Agashe, Jiuzhou Han, Shuyu Gan, Jiachen Yang, Ang Li, Xin Eric Wang
- πŸ›οΈ Institutions: Simular Research
- πŸ“… Date: October 10, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [GUI]
- πŸ”‘ Key: [framework], [autonomous GUI interaction], [experience-augmented hierarchical planning]
- πŸ“– TLDR: This paper introduces Agent S, an open agentic framework that enables autonomous interaction with computers through a Graphical User Interface (GUI). The system addresses key challenges in automating computer tasks through experience-augmented hierarchical planning and an Agent-Computer Interface (ACI). Agent S demonstrates significant improvements over baselines on the OSWorld benchmark, achieving a 20.58% success rate (83.6% relative improvement). The framework shows generalizability across different operating systems and provides insights for developing more effective GUI agents.

- [TinyClick: Single-Turn Agent for Empowering GUI Automation](https://arxiv.org/abs/2410.11871)
- Pawel Pawlowski, Krystian Zawistowski, Wojciech Lapacz, Marcin Skorupa, Adam Wiacek, Sebastien Postansque, Jakub Hoscilowicz
- πŸ›οΈ Institutions: Samsung R&D Poland, Warsaw University of Technology
- πŸ“… Date: October 9, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [GUI]
- πŸ”‘ Key: [framework], [vision language model], [Screenspot], [OmniAct]
- πŸ“– TLDR: TinyClick is a compact, single-turn agent designed to automate GUI tasks by precisely locating screen elements via the Vision-Language Model Florence-2-Base. Trained with multi-task strategies and MLLM-based data augmentation, TinyClick achieves high accuracy on Screenspot and OmniAct, outperforming specialized GUI interaction models and general MLLMs like GPT-4V. The model's lightweight design (0.27B parameters) ensures fast processing and minimal latency, making it efficient for real-world applications on multiple platforms.

- [ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents](https://sites.google.com/view/st-webagentbench/home)
- Ido Levy, Ben Wiesel, Sami Marreed, Alon Oved, Avi Yaeli, Segev Shlomov
- πŸ›οΈ Institutions: IBM Research
- πŸ“… Date: October 9, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Web]
- πŸ”‘ Key: [benchmark], [safety], [trustworthiness], [ST-WebAgentBench]
- πŸ“– TLDR: This paper introduces **ST-WebAgentBench**, a benchmark designed to evaluate the safety and trustworthiness of web agents in enterprise contexts. It defines safe and trustworthy agent behavior, outlines the structure of safety policies, and introduces the "Completion under Policies" metric to assess agent performance. The study reveals that current state-of-the-art agents struggle with policy adherence, highlighting the need for improved policy awareness and compliance in web agents.

- [ClickAgent: Enhancing UI Location Capabilities of Autonomous Agents](https://arxiv.org/abs/2410.11872)
- Jakub Hoscilowicz, Bartosz Maj, Bartosz Kozakiewicz, Oleksii Tymoschuk, Artur Janicki
- πŸ›οΈ Institutions: Samsung R&D Poland, Warsaw University of Technology
- πŸ“… Date: October 9, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Mobile]
- πŸ”‘ Key: [framework], [model], [SeeClick], [AITW benchmark]
- πŸ“– TLDR: The paper introduces *ClickAgent*, a framework that enhances autonomous agents' interaction with mobile UIs by improving their ability to locate interface elements accurately. This is achieved through a dual-component system where an MLLM performs reasoning and action planning, while a dedicated UI location model (e.g., SeeClick) handles element identification. ClickAgent, evaluated on the AITW benchmark and tested on both emulators and real Android devices, surpasses other agents like CogAgent and AppAgent in task success rate, advancing automation reliability on mobile platforms.

- [Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents](https://osu-nlp-group.github.io/UGround/)
- Boyu Gou, Ruochen Wang, Boyuan Zheng, Yucheng Xie, Cheng Chang, Yiheng Shu, Haotian Sun, Yu Su
- πŸ›οΈ Institutions: OSU, Orby AI
- πŸ“… Date: October 7, 2024
- πŸ“‘ Publisher: ICLR 2025
- πŸ’» Env: [GUI]
- πŸ”‘ Key: [framework], [model], [dataset], [visual grounding], [GUI agents], [cross-platform generalization], [UGround], [SeeAct-V], [synthetic data]
- πŸ“– TLDR: This paper introduces UGround, a universal visual grounding model for GUI agents that enables human-like navigation of digital interfaces. The authors advocate for GUI agents with human-like embodiment that perceive the environment entirely visually and take pixel-level actions. UGround is trained on a large-scale synthetic dataset of 10M GUI elements across 1.3M screenshots. Evaluated on six benchmarks spanning grounding, offline, and online agent tasks, UGround significantly outperforms existing visual grounding models by up to 20% absolute. Agents using UGround achieve comparable or better performance than state-of-the-art agents that rely on additional textual input, demonstrating the feasibility of vision-only GUI agents.

- [ExACT: Teaching AI Agents to Explore with Reflective-MCTS and Exploratory Learning](https://agent-e3.github.io/ExACT/)
- Xiao Yu, Baolin Peng, Vineeth Vajipey, Hao Cheng, Michel Galley, Jianfeng Gao, Zhou Yu
- πŸ›οΈ Institutions: Columbia Univ., MSR
- πŸ“… Date: Oct 2, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Web]
- πŸ”‘ Key: [framework], [learning], [R-MCTS], [Exploratory Learning], [VisualWebArena]
- πŸ“– TLDR: This paper introduces ExACT, an approach that combines Reflective Monte Carlo Tree Search (R-MCTS) and Exploratory Learning to enhance AI agents' exploration and decision-making capabilities in complex web environments. R-MCTS incorporates contrastive reflection and multi-agent debate for improved search efficiency and reliable state evaluation. Evaluated on the VisualWebArena benchmark, the GPT-4o-based R-MCTS agent demonstrates significant performance improvements over previous state-of-the-art methods. Additionally, knowledge gained from test-time search is effectively transferred back to GPT-4o through fine-tuning, enabling the model to explore, evaluate, and backtrack without external search algorithms, achieving 87% of R-MCTS's performance with reduced computational resources.

- [Dynamic Planning for LLM-based Graphical User Interface Automation](https://arxiv.org/abs/2410.00467)
- Shaoqing Zhang, Zhuosheng Zhang, Kehai Chen, Xinbei Ma, Muyun Yang, Tiejun Zhao, Min Zhang
- πŸ›οΈ Institutions: SJTU
- πŸ“… Date: October 1, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Mobile]
- πŸ”‘ Key: [framework], [dynamic planning]
- πŸ“– TLDR: This paper introduces a novel method called Dynamic Planning of Thoughts (D-PoT) aimed at enhancing LLM-based agents for GUI tasks. It addresses the challenges of task execution by dynamically adjusting planning based on environmental feedback and action history, outperforming existing methods such as ReAct by improving accuracy significantly in navigating GUI environments. The study emphasizes the importance of integrating execution history and contextual cues to optimize decision-making processes for autonomous agents.

- [MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning](https://arxiv.org/abs/2409.20566)
- Haotian Zhang, Mingfei Gao, Zhe Gan, Philipp Dufter, Nina Wenzel, Forrest Huang, Dhruti Shah, Xianzhi Du, Bowen Zhang, Yanghao Li, Sam Dodge, Keen You, Zhen Yang, Aleksei Timofeev, Mingze Xu, Hong-You Chen, Jean-Philippe Fauconnier, Zhengfeng Lai, Haoxuan You, Zirui Wang, Afshin Dehghan, Peter Grasch, Yinfei Yang
- πŸ›οΈ Institutions: Apple
- πŸ“… Date: September 30, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Misc]
- πŸ”‘ Key: [model], [MM1.5], [vision language model], [visual grounding], [reasoning], [data-centric], [analysis]
- πŸ“– TLDR: This paper introduces MM1.5, a family of multimodal large language models (MLLMs) ranging from 1B to 30B parameters, including dense and mixture-of-experts variants. MM1.5 enhances capabilities in text-rich image understanding, visual referring and grounding, and multi-image reasoning. The authors employ a data-centric training approach, utilizing high-quality OCR data and synthetic captions for continual pre-training, alongside an optimized visual instruction-tuning data mixture for supervised fine-tuning. Specialized variants, MM1.5-Video and MM1.5-UI, are designed for video understanding and mobile UI comprehension, respectively. Extensive empirical studies provide insights into the training processes, offering guidance for future MLLM development.

- [AdvWeb: Controllable Black-box Attacks on VLM-powered Web Agents](https://ai-secure.github.io/AdvWeb/)
- Chejian Xu, Mintong Kang, Jiawei Zhang, Zeyi Liao, Lingbo Mo, Mengqi Yuan, Huan Sun, Bo Li
- πŸ›οΈ Institutions: UIUC, OSU
- πŸ“… Date: September 27, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Web]
- πŸ”‘ Key: [safety], [black-box attack], [adversarial prompter model], [Direct Policy Optimization]
- πŸ“– TLDR: This paper presents AdvWeb, a black-box attack framework that exploits vulnerabilities in vision-language model (VLM)-powered web agents by injecting adversarial prompts directly into web pages. Using Direct Policy Optimization (DPO), AdvWeb trains an adversarial prompter model that can mislead agents into executing harmful actions, such as unauthorized financial transactions, while maintaining high stealth and control. Extensive evaluations reveal that AdvWeb achieves high success rates across multiple real-world tasks, emphasizing the need for stronger security measures in web agent deployments.

- [Synatra: Turning Indirect Knowledge into Direct Demonstrations for Digital Agents at Scale](https://arxiv.org/abs/2409.15637)
- Tianyue Ou, Frank F. Xu, Aman Madaan, Jiarui Liu, Robert Lo, Abishek Sridhar, Sudipta Sengupta, Dan Roth, Graham Neubig, Shuyan Zhou
- πŸ›οΈ Institutions: CMU, Amazon AWS AI
- πŸ“… Date: September 27, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Misc]
- πŸ”‘ Key: [synthetic data]
- πŸ“– TLDR: Synatra introduces a scalable framework for digital agents, enabling them to convert indirect knowledge sources into actionable demonstrations. This approach enhances the ability of agents to learn tasks without extensive labeled data, leveraging insights from indirect observations to scale practical implementations in digital environments.

- [Turn Every Application into an Agent: Towards Efficient Human-Agent-Computer Interaction with API-First LLM-Based Agents](https://arxiv.org/abs/2409.17140)
- Junting Lu, Zhiyang Zhang, Fangkai Yang, Jue Zhang, Lu Wang, Chao Du, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang
- πŸ›οΈ Institutions: Peking University, Microsoft
- πŸ“… Date: September 26, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [GUI]
- πŸ”‘ Key: [framework], [API interaction], [HACI], [Agent OS]
- πŸ“– TLDR: This paper proposes an API-centered framework called **AXIS**, enhancing the efficiency and reliability of LLM-based agents by prioritizing API interactions over UI-based actions. This approach aims to reduce the high latency and error rates of traditional UI-interaction models. AXIS not only supports the rapid creation and extension of APIs through automated application exploration but also contributes to a new **Human-Agent-Computer Interaction (HACI)** framework. The paper outlines the development of an agent-centric operating system (Agent OS), which improves task completion times by up to 70% and reduces cognitive load on users while maintaining high accuracy across complex multi-application tasks.

- [Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models](https://molmo.allenai.org/blog)
- Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, Yvonne Chou, Arnavi Chheda, Jenna Sparks, Sam Skjonsberg, Michael Schmitz, Aaron Sarnat, Byron Bischoff, Pete Walsh, Chris Newell, Piper Wolters, Tanmay Gupta, Kuo-Hao Zeng, Jon Borchardt, Dirk Groeneveld, Crystal Nam, Sophie Lebrecht, Caitlin Wittlif, Carissa Schoenick, Oscar Michel, Ranjay Krishna, Luca Weihs, Noah A. Smith, Hannaneh Hajishirzi, Ross Girshick, Ali Farhadi, Aniruddha Kembhavi
- πŸ›οΈ Institutions: AI2, UW
- πŸ“… Date: September 25, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Misc]
- πŸ”‘ Key: [model], [dataset], [PixMo], [Molmo], [vision language model], [foundation model]
- πŸ“– TLDR: This paper introduces *Molmo*, a family of state-of-the-art open vision-language models (VLMs), and *PixMo*, a collection of new datasets including detailed image captions, free-form image Q&A, and innovative 2D pointing data, all collected without reliance on proprietary VLMs. The authors demonstrate that careful model design, a well-tuned training pipeline, and high-quality open datasets can produce VLMs that outperform existing open models and rival proprietary systems. The model weights, datasets, and source code are made publicly available to advance research in this field.

- [MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understanding](https://arxiv.org/abs/2409.14818)
- Qinzhuo Wu, Weikai Xu, Wei Liu, Tao Tan, Jianfeng Liu, Ang Li, Jian Luan, Bin Wang, Shuo Shang
- πŸ›οΈ Institutions: XiaoMi AI Lab, University of Electronic Science and Technology of China, Renmin University of China
- πŸ“… Date: September 23, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Mobile]
- πŸ”‘ Key: [model], [dataset], [MobileVLM], [Mobile3M], [UI understanding]
- πŸ“– TLDR: This paper introduces *MobileVLM*, a vision-language model designed to enhance both intra- and inter-UI understanding for mobile applications. The authors propose two additional pre-training stages with four specific UI-based tasks to improve the model's perception of fine-grained elements and capture page transition actions. To support this, they constructed *Mobile3M*, a large-scale Chinese mobile dataset comprising 3 million UI pages and real-world transition actions, organized into directed graphs. Experimental results demonstrate that MobileVLM outperforms existing vision-language models on both in-house test sets and public mobile benchmarks.

- [Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution](https://qwenlm.github.io/blog/qwen2-vl/)
- Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, Junyang Lin
- πŸ›οΈ Institutions: Alibaba Cloud
- πŸ“… Date: September 18, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Misc]
- πŸ”‘ Key: [foundation model], [MLLM], [Qwen2-VL]
- πŸ“– TLDR: Qwen2-VL introduces an advanced vision-language framework that enables dynamic resolution handling for images and videos through its Naive Dynamic Resolution mechanism and Multimodal Rotary Position Embedding (M-RoPE). This structure allows the model to convert images of varying resolutions into diverse token counts for improved visual comprehension. With model sizes up to 72B parameters, Qwen2-VL demonstrates competitive performance across multiple benchmarks, achieving results on par with or better than prominent multimodal models like GPT-4o and Claude3.5-Sonnet. This work represents a significant step forward in scalable vision-language learning for multimodal tasks.

- [EIA: Environmental Injection Attack on Generalist Web Agents for Privacy Leakage](https://arxiv.org/abs/2409.11295)
- Zeyi Liao, Lingbo Mo, Chejian Xu, Mintong Kang, Jiawei Zhang, Chaowei Xiao, Yuan Tian, Bo Li, Huan Sun
- πŸ›οΈ Institutions: OSU, UCLA, UChicago, UIUC, UW-Madison
- πŸ“… Date: September 17, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Web]
- πŸ”‘ Key: [safety], [privacy attack], [environmental injection], [stealth attack]
- πŸ“– TLDR: This paper introduces the Environmental Injection Attack (EIA), a privacy attack targeting generalist web agents by embedding malicious yet concealed web elements to trick agents into leaking users' PII. Utilizing 177 action steps within realistic web scenarios, EIA demonstrates a high success rate in extracting specific PII and whole user requests. Through its detailed threat model and defense suggestions, the work underscores the challenge of detecting and mitigating privacy risks in autonomous web agents.

- [Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale](https://microsoft.github.io/WindowsAgentArena/)
- Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Jang, Zack Hui
- πŸ›οΈ Institutions: Microsoft
- πŸ“… Date: September 13, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Desktop]
- πŸ”‘ Key: [framework], [benchmark], [Navi]
- πŸ“– TLDR: This paper introduces the *Windows Agent Arena (WAA)*, a scalable platform for testing and benchmarking multi-modal AI agents within a realistic Windows OS environment. WAA enables researchers to evaluate agentic workflows across diverse tasks and supports large-scale deployment using Azure ML. The study also presents *Navi*, a multi-modal agent achieving a 19.5% success rate on Windows tasks, highlighting the platform's potential for advancing AI agent development.

- [Agent Workflow Memory](https://arxiv.org/abs/2409.07429)
- Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, Graham Neubig
- πŸ›οΈ Institutions: CMU, MIT
- πŸ“… Date: September 11, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Web]
- πŸ”‘ Key: [framework], [memory], [AWM]
- πŸ“– TLDR: The paper proposes *Agent Workflow Memory (AWM)*, a method enabling language model-based agents to induce and utilize reusable workflows from past experiences to guide future actions in web navigation tasks. AWM operates in both offline and online settings, significantly improving performance on benchmarks like Mind2Web and WebArena, and demonstrating robust generalization across tasks, websites, and domains.

- [From Grounding to Planning: Benchmarking Bottlenecks in Web Agents](https://arxiv.org/abs/2409.01927)
- Segev Shlomov, Ben Wiesel, Aviad Sela, Ido Levy, Liane Galanti, Roy Abitbol
- πŸ›οΈ Institutions: IBM
- πŸ“… Date: September 3, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Web]
- πŸ”‘ Key: [benchmark], [planning], [grounding], [Mind2Web dataset], [web navigation]
- πŸ“– TLDR: This paper analyzes performance bottlenecks in web agents by separately evaluating grounding and planning tasks, isolating their individual impacts on navigation efficacy. Using an enhanced version of the Mind2Web dataset, the study reveals planning as a significant bottleneck, with advancements in grounding and task-specific benchmarking for elements like UI component recognition. Through experimental adjustments, the authors propose a refined evaluation framework, aiming to enhance web agents' contextual adaptability and accuracy in complex web environments.

- [TinyAgent: Function Calling at the Edge](https://github.com/SqueezeAILab/TinyAgent)
- Lutfi Eren Erdogan, Nicholas Lee, Siddharth Jha, Sehoon Kim, Ryan Tabrizi, Suhong Moon, Coleman Hooper, Gopala Anumanchipalli, Kurt Keutzer, Amir Gholami
- πŸ›οΈ Institutions: UC Berkeley, ICSI
- πŸ“… Date: September 1, 2024
- πŸ“‘ Publisher: EMNLP 2024
- πŸ’» Env: [Desktop]
- πŸ”‘ Key: [framework], [dataset], [quantization], [LLMCompiler], [TinyAgent-1.1B], [TinyAgent-7B]
- πŸ“– TLDR: This paper introduces TinyAgent, an end-to-end framework for training and deploying task-specific small language model agents capable of function calling at the edge. By fine-tuning small models with curated datasets and employing techniques like quantization and a novel tool retrieval method, TinyAgent enables efficient, real-time execution of user commands on local devices without relying on cloud infrastructure. The framework demonstrates that these small models can match or even surpass the function-calling capabilities of larger models like GPT-4-Turbo while operating entirely on edge devices.

- [WebPilot: A Versatile and Autonomous Multi-Agent System for Web Task Execution with Strategic Exploration](https://arxiv.org/abs/2408.15978)
- Yao Zhang, Zijian Ma, Yunpu Ma, Zhen Han, Yu Wu, Volker Tresp
- πŸ›οΈ Institutions: LMU Munich, Technical University of Munich, Munich Center for Machine Learning (MCML)
- πŸ“… Date: August 28, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Web]
- πŸ”‘ Key: [framework], [Monte Carlo Tree Search], [reinforcement learning], [WebPilot]
- πŸ“– TLDR: This paper introduces **WebPilot**, a multi-agent system designed to execute complex web tasks requiring dynamic interaction. By employing a dual optimization strategy grounded in Monte Carlo Tree Search (MCTS), WebPilot enhances adaptability in complex web environments. The system's Global Optimization phase generates high-level plans by decomposing tasks into manageable subtasks, while the Local Optimization phase executes each subtask using a tailored MCTS approach. Experimental results on WebArena and MiniWoB++ demonstrate WebPilot's effectiveness, achieving state-of-the-art performance with GPT-4 and marking a significant advancement in autonomous web agent capabilities. :contentReference[oaicite:0]{index=0}

- [Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents](https://arxiv.org/abs/2408.07199)
- Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents
- πŸ›οΈ Institutions: MultiOn, Stanford
- πŸ“… Date: August 13, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Web]
- πŸ”‘ Key: [framework], [MCTS], [Tree Search], [DPO], [Reinforcement Learning]
- πŸ“– TLDR: TBD

- [VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents](https://arxiv.org/abs/2408.06327)
- Xiao Liu, Tianjie Zhang, Yu Gu, Iat Long Iong, Yifan Xu, Xixuan Song, Shudan Zhang, Hanyu Lai, Xinyi Liu, Hanlin Zhao, Jiadai Sun, Xinyue Yang, Yu Yang, Zehan Qi, Shuntian Yao, Xueqiao Sun, Siyi Cheng, Qinkai Zheng, Hao Yu, Hanchen Zhang, Wenyi Hong, Ming Ding, Lihang Pan, Xiaotao Gu, Aohan Zeng, Zhengxiao Du, Chan Hee Song, Yu Su, Yuxiao Dong, Jie Tang
- πŸ›οΈ Institutions: Tsinghua University, MSRA, The Ohio State University
- πŸ“… Date: August 12, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [GUI]
- πŸ”‘ Key: [benchmark], [dataset], [VisualAgentBench], [VAB]
- πŸ“– TLDR: The authors introduce *VisualAgentBench (VAB)*, a comprehensive benchmark designed to train and evaluate large multimodal models (LMMs) as visual foundation agents across diverse scenarios, including embodied tasks, graphical user interfaces, and visual design. VAB comprises five distinct environments that systematically challenge LMMs' understanding and interaction capabilities. Additionally, the benchmark offers supervised fine-tuning trajectory data for behavior cloning training, demonstrating the potential to improve open LMMs for serving as visual foundation agents.

- [AppAgent v2: Advanced Agent for Flexible Mobile Interactions](https://arxiv.org/abs/2408.11824)
- Yanda Li, Chi Zhang, Wanqi Yang, Bin Fu, Pei Cheng, Xin Chen, Ling Chen, Yunchao Wei
- πŸ›οΈ Institutions: University of Technology Sydney, Tencent, Beijing Jiaotong University, Westlake University
- πŸ“… Date: August 5, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Mobile]
- πŸ”‘ Key: [framework], [AppAgent v2]
- πŸ“– TLDR: This work presents *AppAgent v2*, a novel LLM-based multimodal agent framework for mobile devices capable of navigating applications by emulating human-like interactions such as tapping and swiping. The agent constructs a flexible action space that enhances adaptability across various applications, including parsing text and vision descriptions. It operates through two main phases: exploration and deployment, utilizing retrieval-augmented generation (RAG) technology to efficiently retrieve and update information from a knowledge base, thereby empowering the agent to perform tasks effectively and accurately.

- [CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation](https://aclanthology.org/2024.findings-acl.539)
- Xinbei Ma, Zhuosheng Zhang, Hai Zhao
- πŸ›οΈ Institutions: SJTU
- πŸ“… Date: August 2024
- πŸ“‘ Publisher: ACL 2024
- πŸ’» Env: [Mobile]
- πŸ”‘ Key: [model], [framework], [benchmark]
- πŸ“– TLDR: This paper presents CoCo-Agent, a multimodal large language model (MLLM) designed for smartphone GUI automation. It introduces two novel approaches: Comprehensive Environment Perception (CEP) for enhanced GUI understanding, and Conditional Action Prediction (CAP) to improve action response accuracy. The proposed agent achieves state-of-the-art performance on GUI automation benchmarks such as AITW and META-GUI, showcasing its capabilities in realistic scenarios.

- [OmniParser for Pure Vision Based GUI Agent](https://microsoft.github.io/OmniParser/)
- Yadong Lu, Jianwei Yang, Yelong Shen, Ahmed Awadallah
- πŸ›οΈ Institutions: MSR, Microsoft Gen AI
- πŸ“… Date: August 1, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [GUI]
- πŸ”‘ Key: [framework], [dataset], [OmniParser]
- πŸ“– TLDR: This paper introduces **OmniParser**, a method for parsing user interface screenshots into structured elements, enhancing the ability of models like GPT-4V to generate actions accurately grounded in corresponding UI regions. The authors curated datasets for interactable icon detection and icon description, fine-tuning models to parse interactable regions and extract functional semantics of UI elements.

- [Caution for the Environment: Multimodal Agents are Susceptible to Environmental Distractions](https://arxiv.org/abs/2408.02544)
- Xinbei Ma, Yiting Wang, Yao Yao, Tongxin Yuan, Aston Zhang, Zhuosheng Zhang, Hai Zhao
- πŸ›οΈ Institutions: SJTU, Meta
- πŸ“… Date: August 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Misc]
- πŸ”‘ Key: [multimodal agents], [environmental distractions], [robustness]
- πŸ“– TLDR: This paper highlights the vulnerability of multimodal agents to environmental distractions. The researchers demonstrate that these agents, which process multiple types of input (e.g., text, images, audio), can be significantly impacted by irrelevant or misleading environmental cues. The study provides insights into the limitations of current multimodal systems and emphasizes the need for more robust architectures that can filter out distractions and maintain focus on relevant information in complex, real-world environments.

- [OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation](https://arxiv.org/abs/2407.19056)
- Zilong Wang, Yuedong Cui, Li Zhong, Zimin Zhang, Da Yin, Bill Yuchen Lin, Jingbo Shang
- πŸ›οΈ Institutions: UCSD, UCLA, AI2
- πŸ“… Date: July 26, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Desktop]
- πŸ”‘ Key: [benchmark], [multi-application], [office automation]
- πŸ“– TLDR: OfficeBench introduces a benchmark that evaluates language models' ability to automate office tasks across a range of applications like Word, Excel, and email. The benchmark tests agents’ skills in task-switching, planning, and decision-making by simulating realistic office workflows. Current models, including GPT-4, demonstrate significant gaps in task accuracy and efficiency, revealing areas for improvement in managing complex, multi-application tasks in office environments.

- [Agent-E: From Autonomous Web Navigation to Foundational Design Principles in Agentic Systems](https://arxiv.org/abs/2407.13032)
- Aditya Vempaty, [Other authors not provided in the search results]
- πŸ›οΈ Institutions: Emergence AI
- πŸ“… Date: July 17, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Web]
- πŸ”‘ Key: [framework], [autonomous web navigation], [hierarchical architecture], [DOM distillation]
- πŸ“– TLDR: This paper presents Agent-E, a novel web agent that introduces several architectural improvements over previous state-of-the-art systems. Key features include a hierarchical architecture, flexible DOM distillation and denoising methods, and a "change observation" concept for improved performance. Agent-E outperforms existing text and multi-modal web agents by 10-30% on the WebVoyager benchmark. The authors synthesize their findings into general design principles for developing agentic systems, including the use of domain-specific primitive skills, hierarchical architectures, and agentic self-improvement.

- [Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?](https://spider2-v.github.io/)
- Ruisheng Cao, Fangyu Lei, Haoyuan Wu, Jixuan Chen, Yeqiao Fu, Hongcheng Gao, Xinzhuang Xiong, Hanchong Zhang, Yuchen Mao, Wenjing Hu, Tianbao Xie, Hongsheng Xu, Danyang Zhang, Sida Wang, Ruoxi Sun, Pengcheng Yin, Caiming Xiong, Ansong Ni, Qian Liu, Victor Zhong, Lu Chen, Kai Yu, Tao Yu
- πŸ›οΈ Institutions: HKU, SJTU, Google Cloud AI Research, Google DeepMind, Salesforce Research, Yale University, Sea AI Lab, University of Waterloo
- πŸ“… Date: July 15, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Desktop]
- πŸ”‘ Key: [benchmark], [dataset], [data science], [engineering workflows], [Spider2-V]
- πŸ“– TLDR: This paper introduces **Spider2-V**, a multimodal agent benchmark designed to evaluate the capability of agents in automating professional data science and engineering workflows. It comprises 494 real-world tasks across 20 enterprise-level applications, assessing agents' proficiency in code generation and GUI operations within authentic computer environments.

- [AUITestAgent: Automatic Requirements Oriented GUI Function Testing](https://github.com/bz-lab/AUITestAgent)
- Yongxiang Hu, Xuan Wang, Yingchuan Wang, Yu Zhang, Shiyu Guo, Chaoyi Chen, Xin Wang, Yangfan Zhou
- πŸ›οΈ Institutions: Fudan University, Meituan
- πŸ“… Date: July 12, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Mobile]
- πŸ”‘ Key: [framework], [GUI testing], [AUITestAgent]
- πŸ“– TLDR: This paper presents **AUITestAgent**, the first automatic, natural language-driven GUI testing tool for mobile apps. It automates the entire process of GUI interaction and function verification by extracting GUI interactions from test requirements via dynamically organized agents and employing a multi-dimensional data extraction strategy for verification.

- [Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence](https://luffyzm3d2y.github.io/publication/IoA)
- Weize Chen, Ziming You, Ran Li, Yitong Guan, Chen Qian, Chenyang Zhao, Cheng Yang, Ruobing Xie, Zhiyuan Liu, Maosong Sun
- πŸ›οΈ Institutions: Tsinghua University, Peking University, BUPT, Tencent
- πŸ“… Date: July 7, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Misc]
- πŸ”‘ Key: [framework], [IoA]
- πŸ“– TLDR: The paper proposes the **Internet of Agents (IoA)**, a framework inspired by the Internet to facilitate collaboration among diverse autonomous agents. IoA introduces an agent integration protocol, dynamic teaming mechanisms, and conversation flow control, enabling flexible and scalable multi-agent collaboration. Experiments demonstrate IoA's superior performance across various tasks, highlighting its effectiveness in integrating heterogeneous agents.

- [WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks](https://arxiv.org/abs/2407.05291)
- LΓ©o Boisvert, Megh Thakkar, Maxime Gasse, Massimo Caccia, Thibault Le Sellier De Chezelles, Quentin Cappart, Nicolas Chapados, Alexandre Lacoste, Alexandre Drouin
- πŸ›οΈ Institutions: ServiceNow Research, Mila, Polytechnique MontrΓ©al, UniversitΓ© de MontrΓ©al
- πŸ“… Date: July 7, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Web]
- πŸ”‘ Key: [benchmark], [planning], [reasoning], [WorkArena++]
- πŸ“– TLDR: This paper introduces **WorkArena++**, a benchmark comprising 682 tasks that simulate realistic workflows performed by knowledge workers. It evaluates web agents' capabilities in planning, problem-solving, logical/arithmetic reasoning, retrieval, and contextual understanding. The study reveals challenges faced by current large language models and vision-language models in serving as effective workplace assistants, providing a resource to advance autonomous agent development. [oai_citation_attribution:0‑arXiv](https://arxiv.org/abs/2407.05291?utm_source=chatgpt.com)

- [MobileFlow: A Multimodal LLM For Mobile GUI Agent](https://arxiv.org/abs/2407.04346)
- Songqin Nong, Jiali Zhu, Rui Wu, Jiongchao Jin, Shuo Shan, Xiutian Huang, Wenhao Xu
- πŸ›οΈ Institutions: Ant Group
- πŸ“… Date: July 5, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Mobile]
- πŸ”‘ Key: [model], [framework], [MobileFlow]
- πŸ“– TLDR: This paper introduces *MobileFlow*, a multimodal large language model tailored for mobile GUI agents. With approximately 21 billion parameters and hybrid visual encoders, it supports variable image resolutions and multilingual GUIs, enhancing the model's ability to interpret image data and comprehend user instructions for GUI interaction tasks.

- [MobileExperts: A Dynamic Tool-Enabled Agent Team in Mobile Devices](https://arxiv.org/abs/2407.03913)
- Jiayi Zhang, Chuang Zhao, Yihan Zhao, Zhaoyang Yu, Ming He, Jianping Fan
- πŸ›οΈ Institutions: HKUST, Ant Group
- πŸ“… Date: July 4, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Mobile]
- πŸ”‘ Key: [framework], [tool formulation], [multi-agent collaboration], [MobileExperts]
- πŸ“– TLDR: This paper introduces *MobileExperts*, a framework that enhances autonomous operations on mobile devices by dynamically assembling agent teams based on user requirements. Each agent independently explores and formulates tools to evolve into an expert, improving efficiency and reducing reasoning costs.

- [AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents](https://yuxiangchai.github.io/AMEX/)
- Yuxiang Chai, Siyuan Huang, Yazhe Niu, Han Xiao, Liang Liu, Dingyu Zhang, Peng Gao, Shuai Ren, Hongsheng Li
- πŸ›οΈ Institutions: CUHK, SJTU, Shanghai AI Lab, vivo AI Lab
- πŸ“… Date: July 3, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Mobile]
- πŸ”‘ Key: [dataset], [benchmark], [AMEX]
- πŸ“– TLDR: This paper introduces the **Android Multi-annotation EXpo (AMEX)**, a comprehensive dataset designed for training and evaluating mobile GUI-control agents. AMEX comprises over 104K high-resolution screenshots from 110 popular mobile applications, annotated at multiple levels, including GUI interactive element grounding, functionality descriptions, and complex natural language instructions. The dataset aims to advance research on AI agents capable of completing complex tasks by interacting directly with mobile device GUIs.

- [Vision-driven Automated Mobile GUI Testing via Multimodal Large Language Model](https://arxiv.org/abs/2407.03037)
- Zhe Liu, Cheng Li, Chunyang Chen, Junjie Wang, Boyu Wu, Yawen Wang, Jun Hu, Qing Wang
- πŸ›οΈ Institutions: Institute of Software, Chinese Academy of Sciences, Monash University, Beijing Institute of Technology, University of Chinese Academy of Sciences
- πŸ“… Date: July 3, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Mobile]
- πŸ”‘ Key: [framework], [VisionDroid]
- πŸ“– TLDR: The paper presents **VisionDroid**, a vision-driven automated GUI testing approach utilizing Multimodal Large Language Models (MLLM) to detect non-crash functional bugs in mobile applications. By extracting GUI text information and aligning it with screenshots, VisionDroid enables MLLM to understand GUI context, facilitating deeper and function-oriented exploration. The approach segments exploration history into logically cohesive parts, prompting MLLM for bug detection, demonstrating superior performance over existing methods.

- [CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents](https://arxiv.org/abs/2407.01511)
- Tianqi Xu, Linyao Chen, Dai-Jie Wu, Yanjun Chen, Zecheng Zhang, Xiang Yao, Zhiqiang Xie, Yongchao Chen, Shilong Liu, Bochen Qian, Philip Torr, Bernard Ghanem, Guohao Li
- πŸ›οΈ Institutions: KAUST, UTokyo, CMU, Stanford, Harvard, Tsinghua University, SUSTech, Oxford
- πŸ“… Date: July 3, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [GUI]
- πŸ”‘ Key: [benchmark], [framework], [evaluation], [CRAB]
- πŸ“– TLDR: The authors present *CRAB*, a benchmark framework designed to evaluate Multimodal Language Model agents across multiple environments. It features a graph-based fine-grained evaluation method and supports automatic task generation, addressing limitations in existing benchmarks.

- [Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding](https://screen-point-and-read.github.io/)
- Yue Fan, Lei Ding, Ching-Chen Kuo, Shan Jiang, Yang Zhao, Xinze Guan, Jie Yang, Yi Zhang, Xin Eric Wang
- πŸ›οΈ Institutions: UCSC, MSR
- πŸ“… Date: June 27, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [GUI]
- πŸ”‘ Key: [framework], [dataset], [ToL], [screen reading], [accessibility]
- πŸ“– TLDR: The authors propose the Tree-of-Lens (ToL) agent to address the Screen Point-and-Read (ScreenPR) task, which involves generating natural language descriptions of screen regions based on user-indicated points. The ToL agent constructs a Hierarchical Layout Tree to comprehend the content and articulate the layout and spatial relationships between elements. The authors also introduce the ScreenPR benchmark, consisting of 650 screenshots from web, mobile, and operating system GUIs, manually annotated with 1,500 target points and regions.

- [Octo-planner: On-device Language Model for Planner-Action Agents](https://nexa.ai/octo-planner)
- Nexa AI Team
- πŸ›οΈ Institutions: Nexa AI
- πŸ“… Date: June 26, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Misc]
- πŸ”‘ Key: [model], [framework], [Octo-planner], [on-device], [planning]
- πŸ“– TLDR: This paper presents Octo-planner, an on-device planning model designed for the Planner-Action Agents Framework. Octo-planner utilizes a fine-tuned model based on Phi-3 Mini (3.8 billion parameters) for high efficiency and low power consumption. It separates planning and action execution into two distinct components: a planner agent optimized for edge devices and an action agent using the Octopus model for function execution. The model achieves a planning success rate of 98.1% on benchmark datasets, providing reliable and effective performance.

- [Identifying User Goals from UI Trajectories](https://arxiv.org/abs/2406.14314)
- Omri Berkovitch, Sapir Caduri, Noam Kahlon, Anatoly Efros, Avi Caciularu, Ido Dagan
- πŸ›οΈ Institutions: Google Research, Bar-Ilan University
- πŸ“… Date: June 20, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [GUI]
- πŸ”‘ Key: [evaluation metric], [intent identification]
- πŸ“– TLDR: This paper introduces the task of goal identification from observed UI trajectories, aiming to infer the user's intended task based on their GUI interactions. It proposes a novel evaluation metric to assess whether two task descriptions are paraphrases within a specific UI environment. Experiments utilizing the Android-In-The-Wild and Mind2Web datasets reveal that state-of-the-art models, such as GPT-4 and Gemini-1.5 Pro, underperform compared to humans, indicating significant room for improvement.

- [E-ANT: A Large-Scale Dataset for Efficient Automatic GUI NavigaTion](https://arxiv.org/abs/2406.14250)
- Ke Wang, Tianyu Xia, Zhangxuan Gu, Yi Zhao, Shuheng Shen, Changhua Meng, Weiqiang Wang, Ke Xu
- πŸ›οΈ Institutions: Ant Group, Tsinghua University
- πŸ“… Date: June 20, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [Mobile]
- πŸ”‘ Key: [dataset], [benchmark], [E-ANT]
- πŸ“– TLDR: This paper introduces **E-ANT**, the first large-scale Chinese GUI navigation dataset comprising over 40,000 real human interaction traces across more than 5,000 tiny apps. The dataset includes high-quality screenshots with annotations, facilitating the evaluation and development of GUI navigation and decision-making capabilities in multimodal large language models (MLLMs). The authors also assess various MLLMs on E-ANT, providing insights into their performance and potential improvements.

- [VLM Agents Generate Their Own Memories: Distilling Experience into Embodied Programs of Thought](https://ical-learning.github.io/)
- Gabriel Sarch, Lawrence Jang, Michael J. Tarr, William W. Cohen, Kenneth Marino, Katerina Fragkiadaki
- πŸ›οΈ Institutions: CMU, Google DeepMind
- πŸ“… Date: June 20, 2024
- πŸ“‘ Publisher: NeurIPS 2024
- πŸ’» Env: [GUI]
- πŸ”‘ Key: [framework], [memory], [in-context learning], [ICAL]
- πŸ“– TLDR: This paper introduces *In-Context Abstraction Learning (ICAL)*, a method enabling Vision-Language Models (VLMs) to generate their own examples from sub-optimal demonstrations and human feedback. By abstracting trajectories into generalized programs of thought, ICAL enhances decision-making in retrieval-augmented LLM and VLM agents, reducing reliance on manual prompt engineering and improving performance across various tasks.

- [VGA: Vision GUI Assistant -- Minimizing Hallucinations through Image-Centric Fine-Tuning](https://arxiv.org/abs/2406.14056)
- Ziyang Meng, Yu Dai, Zezheng Gong, Shaoxiong Guo, Minglong Tang, Tongquan Wei
- πŸ›οΈ Institutions: SJTU
- πŸ“… Date: June 20, 2024
- πŸ“‘ Publisher: arXiv
- πŸ’» Env: [GUI]
- πŸ”‘ Key: [model], [dataset], [framework], [VGA], [hallucinat