https://github.com/ziuus/ui-agent
Vision-assisted desktop automation CLI for computer-use experiments.
https://github.com/ziuus/ui-agent
collaborate github-copilot
Last synced: 7 days ago
JSON representation
Vision-assisted desktop automation CLI for computer-use experiments.
- Host: GitHub
- URL: https://github.com/ziuus/ui-agent
- Owner: ziuus
- License: mit
- Created: 2026-04-20T07:34:17.000Z (about 2 months ago)
- Default Branch: master
- Last Pushed: 2026-05-21T06:47:18.000Z (25 days ago)
- Last Synced: 2026-05-21T13:36:49.695Z (25 days ago)
- Topics: collaborate, github-copilot
- Language: Python
- Size: 62.5 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
# 🖱️ UI Agent
> **Autonomous Desktop Automation — A vision-powered agentic CLI for intelligent "Computer Use" and UI orchestration.**
UI Agent is a production-ready, high-fidelity automation tool that bridges the gap between natural language intent and native desktop interaction. By leveraging **Gemini-1.5 Flash Vision**, it autonomously navigates, clicks, and types within any local application, providing a zero-friction interface for complex multi-app workflows and system management.
## ⚡ Core Features
- **Vision-Powered Detection**: Describe any UI element in natural language ("the blue submit button in the top right") and let the AI resolve exact coordinates via real-time screen analysis.
- **Agentic Orchestration**: High-fidelity execution of mouse and keyboard events with configurable kinetic delays and safety failsafes.
- **Multi-OS Intelligence**: Native support for Linux, Windows, and macOS with resolution-aware coordinate mapping.
- **Professional CLI**: Polished terminal interface powered by `typer` and `rich`, complete with status indicators, spinners, and formatted tables.
- **Safety First**: Integrated fail-safe protocols—instantly abort any action by moving the mouse to a screen corner.
## đź› Tech Stack
- **Engine**: Python 3.10+
- **Intelligence**: Google Gemini-1.5 Flash (Vision API)
- **Automation**: PyAutoGUI + MSS (Screen Capture)
- **CLI**: Typer + Rich
- **Packaging**: Modern PEP 517/518 standards
## 🚀 Getting Started
1. **Global Installation**:
```bash
pip install .
```
2. **Configure Environment**:
```bash
export GEMINI_API_KEY="your_key_here"
```
3. **Execute Command**:
```bash
ui-agent click "Open the web browser"
```
## đź“‚ Project Structure
- `ui_agent/cli.py`: Primary command interface and logic.
- `ui_agent/vision.py`: Gemini Vision integration and element detection.
- `ui_agent/automator.py`: Kinetic mouse and keyboard control.
---
*UI Agent: Connecting Human Intent to Machine Execution.*