awesome-ui-agents
A curated list of of awesome UI agents resources, encompassing Web, App, OS, and beyond (continually updated)
https://github.com/opendilab/awesome-ui-agents
Last synced: 4 days ago
JSON representation
-
Papers
-
Models
- VSC-RL: Advancing Autonomous Vision-Language Agents with Variational Subgoal-Conditioned Reinforcement Learning
- AppVLM: A Lightweight Vision Language Model for Online App Control
- DistRL: An Asynchronous Distributed Reinforcement Learning Framework for On-Device Control Agents
- code
- Apple Intelligence Foundation Language Models
- CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation
- Apple Intelligence Foundation Language Models
- CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation
- Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration
- code
- Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception
- code
- SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
- code
- Intention-inInteraction (IN3): Tell Me More! Towards Implicit User Intention Understanding of Language Model Driven Agents
- code
- code
- code
- ScreenAI: A Vision-Language Model for UI and Infographics Understanding
- ScreenAgent: A Vision Language Model-driven Computer Control Agent
- ScreenAgent: A Vision Language Model-driven Computer Control Agent
- code
- code
- AppAgent: Multimodal Agents as Smartphone Users
- code
- AppAgent: Multimodal Agents as Smartphone Users
- Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration
- code
- Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception
- WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models
- code
- WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models
- code
- OS-Copilot: Towards Generalist Computer Agents with Self-Improvement
- code
- OS-Copilot: Towards Generalist Computer Agents with Self-Improvement
- code
- UFO: A UI-Focused Agent for Windows OS Interaction
- code
- UFO: A UI-Focused Agent for Windows OS Interaction
- code
- Octopus v2: On-device language model for super agent
- Octopus v2: On-device language model for super agent
- Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study
- Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study
- code
- code
- CogAgent: A Visual Language Model for GUI Agents
- CogAgent: A Visual Language Model for GUI Agents
- code
- code
- Octopus: Embodied Vision-Language Programmer from Environmental Feedback
- code
- You Only Look at Screens: Multimodal Chain-of-Action Agents
- code
- Octopus: Embodied Vision-Language Programmer from Environmental Feedback
- code
- You Only Look at Screens: Multimodal Chain-of-Action Agents
- code
- code
- code
- Augmenting Autotelic Agents with Large Language Models
- Language Models can Solve Computer Tasks
- code
- Augmenting Autotelic Agents with Large Language Models
- Language Models can Solve Computer Tasks
- code
- code
- code
- SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
- Intention-inInteraction (IN3): Tell Me More! Towards Implicit User Intention Understanding of Language Model Driven Agents
- code
- code
- code
- ScreenAI: A Vision-Language Model for UI and Infographics Understanding
- Autowebglm: Bootstrap and reinforce a large language model-based web navigating agent
- Dual-view visual contextualization for web navigation
- Agent-e: From autonomous web navigation to foundational design principles in agentic systems
- Tree search for language model agents
- GPT-4V(ision) is a Generalist Web Agent, if Grounded
- Agent S: an open agentic framework that uses computers like a human
- DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning
- LASER: LLM Agent with State-Space Exploration for Web Navigation
- LATS: Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models
- Openagents: An open platform for language agents in the wild
- code
- SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
- code
- Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making
- code
- On the Effects of Data Scale on UI Control Agents
- Cradle: Empowering Foundation Agents Towards General Computer Control
- Lightweight Neural App Control
- SeeAct GPT-4V(ision) is a Generalist Web Agent, if Grounded
- MMAC-Copilot: Multi-modal Agent Collaboration Operating System Copilot
- OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
- code
- Proposer-Agent-Evaluator(PAE): Autonomous Skill Discovery For Foundation Model Internet Agents
- code
- code
- code
- code
- code
- code
- code
- OpenAI operator
- OpenAI Computer-Using Agent
- Claude computer use
- Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage
- Lightweight Neural App Control
- Enhancing Software Agents with Monte Carlo Tree Search and Hindsight Feedback
-
Tools
- Opera Browser Operator: AI-based Agentic Browsing
- code
- code
- AndroidEnv: A Reinforcement Learning Platform for Android
- LEGENT: An Open Platform for Embodied Agentb Agents on Large Language Models
- code
- LlamaTouch: A Faithful and Scalable Testbed for Mobile UI Automation Task Evaluation
- code
- WebArena: A Realistic Web Environment for Building Autonomous Agents
- code
- Mobile-Env: Building Qualified Evaluation Benchmarks for LLM-GUI Interaction
- code
- AndroidEnv: A Reinforcement Learning Platform for Android
- LEGENT: An Open Platform for Embodied Agentb Agents on Large Language Models
- code
- LlamaTouch: A Faithful and Scalable Testbed for Mobile UI Automation Task Evaluation
- code
- WebArena: A Realistic Web Environment for Building Autonomous Agents
- code
- Mobile-Env: Building Qualified Evaluation Benchmarks for LLM-GUI Interaction
- code
- code
- OmniParser: Screen Parsing tool for Pure Vision Based GUI Agent
- code
- Make Websites Accessible for Agents
- code
- ToolGen: Unified Tool Retrieval and Calling via Generation
- code
-
Datasets
- SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation
- code
- code
- AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks
- code
- code
- AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks
- code
- code
- MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents
- code
- MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents
- code
- VillagerBench/VillagerAgent: A Graph-Based Multi-Agent Framework for Coordinating Complex Task Dependencies in Minecraft
- code
- VillagerBench/VillagerAgent: A Graph-Based Multi-Agent Framework for Coordinating Complex Task Dependencies in Minecraft
- code
- CToolEval: A Chinese Benchmark for LLM-Powered Agent Evaluation in Real-World API Interactions
- CToolEval: A Chinese Benchmark for LLM-Powered Agent Evaluation in Real-World API Interactions
- code
- code
- Multi-Turn Mind2Web: On the Multi-turn Instruction Following for Conversational Web Agents
- code
- Multi-Turn Mind2Web: On the Multi-turn Instruction Following for Conversational Web Agents
- code
- VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks
- VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks
- code
- code
- Android in the Wild: A Large-Scale Dataset for Android Device Control
- Android in the Zoo: Chain-of-Action-Thought for GUI Agents
- Mind2Web: Towards a Generalist Agent for the Web
- Android in the Wild: A Large-Scale Dataset for Android Device Control
- code
- WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents
- code
- Mind2Web: Towards a Generalist Agent for the Web
- code
- WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents
- code
- WebCanvas: Benchmarking Web Agents in Online Environments
- Android in the Zoo: Chain-of-Action-Thought for GUI Agents
- Rico: A Mobile App Dataset for Building Data-Driven Design Applications
- WebCanvas: Benchmarking Web Agents in Online Environments
- code
- code
-
-
Related Repositories
Programming Languages
Categories
Sub Categories
Keywords
llm
17
agent
12
nlp
7
agents
6
large-language-models
6
ai
5
android
4
automation
4
copilot
4
gui
4
generative-ai
4
multimodal
4
decision-making
4
vlm
4
minecraft
3
language-model
3
llms
3
mllm
2
grounding
2
ios
2
lmm
2
multimodality
2
personoid
2
vision-language-model
2
cross-modality
2
harmony
2
multi-modal
2
pretrained-models
2
visual-language-models
2
prompting
2
reasoning
2
reinforcement-learning
2
gpt4v
2
app
2
multimodal-large-language-models
2
multimodal-agent
2
information-ui
2
infoui
2
interaction-platform
2
mobile-agents
2
rl-environments
2
rl-platform
2
windows
2
ai-agent
2
ai-agents-framework
2
computer-control
2
mobile
2
cradle
2
foundation-agent
2
gcc
2