acu
A curated list of resources about AI agents for Computer Use, including research papers, projects, frameworks, and tools.
https://github.com/trycua/acu
Last synced: 1 day ago
JSON representation
-
Projects
-
Frameworks & Models
- Upsonic
- Multion
- AutoGen
- Auto-GPT
- Browser Use
- Surfkit
- WebMarker
- Multion
- Runner H
- Claude Computer Use Demo
- Claude Minecraft Use
- Computer Use OOTB
- Cybergod
- Grunty
- Inferable
- LaVague
- Mac Computer Use
- NatBot
- OpenAdapt
- OpenInterface
- OpenInterpreter
- Open Source Computer Use by E2B
- Self-Operating Computer
- Skyvern
- Anthropic Claude Computer Use
-
UI Grounding
-
Environment & Sandbox
-
Automation
-
-
Papers
-
Dataset
- UiPad: UI Parsing and Accessibility Dataset
- ICAL: Continual Learning of Multimodal Agents by Transforming Trajectories into Actionable Insights
- Synatra: Turning Indirect Knowledge into Direct Demonstrations for Digital Agents at Scale
- Multi-Turn Mind2Web: On the Multi-turn Instruction Following
- Code
- CToolEval: A Chinese Benchmark for LLM-Powered Agent Evaluation
- Code
- AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks
- Mind2Web: Towards a Generalist Agent for the Web
- Code
- Android in the Wild: A Large-Scale Dataset for Android Device Control
- WebShop: Towards Scalable Real-World Web Interaction
- Code
- Rico: A Mobile App Dataset for Building Data-Driven Design Applications
- OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis
- Code
- AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials
- Explorer: Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web Agents
- UiPad: UI Parsing and Accessibility Dataset
-
UI Grounding
- OmniParser for Pure Vision Based GUI Agent
- Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms
- Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents
- Code
- Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
- Code
- SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
- Code
- Code
- OS-ATLAS: Foundation Action Model for Generalist GUI Agents
- Code
- UI-Pro: A Hidden Recipe for Building Vision-Language Models for GUI Grounding
- Grounding Multimodal Large Language Model in GUI World
-
Frameworks & Models
- OS-Copilot: Towards Generalist Computer Agents with Self-Improvement
- Code
- UFO: A UI-Focused Agent for Windows OS Interaction
- Code
- CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation
- Intention-inInteraction (IN3): Tell Me More!
- Dual-view visual contextualization for web navigation
- ScreenAI: A Vision-Language Model for UI and Infographics Understanding
- GPT-4V(ision) is a Generalist Web Agent, if Grounded
- Code
- Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception
- WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models
- Code
- CogAgent: A Visual Language Model for GUI Agents
- Code
- AppAgent: Multimodal Agents as Smartphone Users
- LASER: LLM Agent with State-Space Exploration for Web Navigation
- Code
- AndroidEnv: A Reinforcement Learning Platform for Android
- Code
- Reinforcement Learning for Long-Horizon Interactive LLM Agents
- Large Action Models: From Inception to Implementation
- Website
- OpenWebAgent: An Open Toolkit to Enable Web Agents on Large Language Models
- Code
- Agent-e: From autonomous web navigation to foundational design principles in agentic systems
- Apple Intelligence Foundation Language Models
- Tree search for language model agents
- DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning
- Code
- Digi-Q: Transforming VLMs to Device-Control Agents via Value-Based Offline RL
- Magentic-One
- Agent Workflow Memory
- Code
- Simulate Before Act: Model-Based Planning for Web Agents
- Proposer-Agent-Evaluator (PAE): Autonomous Skill Discovery For Foundation Model Internet Agents
- Code
- Learning to Contextualize Web Pages for Enhanced Decision Making by LLM Agents
- The Impact of Element Ordering on LM Agent Performance
- Code
- Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents
- AutoGUI: Scaling GUI Grounding with Automatic Functionality Annotations from LLMs
- Guiding VLM Agents with Process Rewards at Inference Time for GUI Navigation
- SpiritSight Agent: Advanced GUI Agent with One Look
- Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration
- Code
- Octopus Series: On-device Language Models for Computer Control
- Website
- Code
- AutoWebGLM: Bootstrap and reinforce a large language model-based web navigating agent
- Code
- Cradle: Empowering Foundation Agents towards General Computer Control
- Code
- Android in the Zoo: Chain-of-Action-Thought for GUI Agents
- Code
- ScreenAgent: A Computer Control Agent Driven by Visual Language Large Model
- Code
-
Benchmark
- Code
- AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents
- Code
- τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
- Code
- MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents
- Code
- VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks
- Code
- Website
- Code
- OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
- Code
- A3: Android Agent Arena for Mobile GUI Agents
- AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
- Code
- Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?
- Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale
- Code
- Website
- Mobile-Env: Building Qualified Evaluation Benchmarks for LLM-GUI Interaction
- Code
- Website
-
Safety
-
Surveys
-
-
Contributing
-
Frameworks & Models
-
-
Articles
Programming Languages
Categories
Sub Categories
Keywords
llm
14
agent
9
ai
7
automation
7
gui
7
computer-use
6
python
5
windows
5
agents
4
multimodal
4
vlm
4
computer
4
claude
4
openai
4
gpt4v
3
large-language-models
3
large-action-model
3
agentic
3
ai-agents
3
ai-agent
3
artificial-intelligence
3
playwright
3
vision-language-model
3
gpt
3
chatgpt
3
browser-automation
2
decision-making
2
nlp
2
virtualization
2
ai-agents-framework
2
browser
2
rpa
2
gpt4o
2
docker
2
llm-agent
2
ai-benchmark
2
ai-research
2
desktop-agent
2
anthropic
2
llms
2
pyautogui
2
language-model
2
android
2
reinforcement-learning
2
code-generation
2
multi-modal
2
rag
2
natural-language-processing
2
copilot
2
autogen
1