https://github.com/moses000/aidesktoppilot
AI-driven desktop automation
https://github.com/moses000/aidesktoppilot
ai automation computer-vision desktop-automation opencv pytesseract python rpa screenshots tkinter
Last synced: 2 months ago
JSON representation
AI-driven desktop automation
- Host: GitHub
- URL: https://github.com/moses000/aidesktoppilot
- Owner: moses000
- Created: 2025-05-28T09:03:27.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-05-28T22:26:28.000Z (about 1 year ago)
- Last Synced: 2025-06-25T18:45:50.379Z (about 1 year ago)
- Topics: ai, automation, computer-vision, desktop-automation, opencv, pytesseract, python, rpa, screenshots, tkinter
- Language: Python
- Homepage:
- Size: 38.4 MB
- Stars: 0
- Watchers: 0
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# AI Desktop Mentor
AI Desktop Mentor is an advanced Python-based desktop automation tool designed to emulate human-like interactions with a computer system. It leverages cutting-edge AI technologies, including **YOLOv8** for UI element detection, **Vosk** for offline speech recognition, and **DistilBERT** for natural language processing (NLP), to perform tasks such as opening applications, navigating websites, logging in, and processing screenshots. With a **Tkinter GUI**, it supports voice commands, task scripting, and automated workflowsβideal for business automation and personal productivity.
---
## β¨ Features
- **Automation**: Open apps (e.g., Chrome, Notepad), type text, navigate URLs, and log in to websites.
- **Screenshots**: Capture manually (`Ctrl+Shift+S`) or auto (every 15 minutes).
- **AI Navigation**: YOLOv8 detects UI elements (e.g., login fields); OCR reads screen text.
- **Task Scripting**: Execute sequences defined in `tasks.json`.
- **Voice Control**: Use offline voice commands via Vosk.
- **NLP Understanding**: Parse natural language with DistilBERT.
- **Context Awareness**: Detect CAPTCHAs/pop-ups with OCR.
- **Cross-Platform**: Works on Windows, macOS, and Linux.
---
## π Directory Structure
```
AIDesktopMentor/
βββ automation/
β βββ automation\_tool.py
βββ config/
β βββ tasks.json
βββ docs/
β βββ README.md
β βββ requirements.txt
βββ models/
β βββ yolo\_ui\_model.pth
β βββ vosk-model-small-en-us/
βββ outputs/
β βββ screenshots/
βββ dataset/
β βββ images/
β β βββ train/
β β βββ val/
β βββ labels/
β β βββ train/
β β βββ val/
β βββ data.yaml
````
## π Technical Workflow
```mermaid
graph TD
A[User Input] --> B[GUI Tkinter]
A --> C[Voice Listener Vosk]
C --> D[NLP Parser DistilBERT]
D --> E[Command Processor]
B --> E
E --> F[Automation Engine PyAutoGUI]
E --> G[UI Detection YOLOv8]
E --> H[Screenshot Module]
E --> I[OCR Pytesseract]
E --> J[Check Popups]
F --> K[OS Interaction]
H --> L[Save to screenshots/]
I --> M[Context Feedback]
K --> N[Screen Output]
M --> N
```
## π Business Workflow
```mermaid
graph TD
A[Business User] --> B[Define Task]
B -->|Manual| C[GUI Interaction]
B -->|Automated| D[Configure tasks.json]
B -->|Voice| E[Voice Command]
C --> F[Execute Task]
D --> F
E --> F
F -->|Open App| G[Access System]
F -->|Login| H[Authenticate]
F -->|Navigate| I[Access Resource]
F -->|Screenshot| J[Generate Report]
H -->|YOLO Detection| I
I --> K[Perform Business Function]
J --> L[Save Output]
K --> M[Business Outcome]
L --> M
```
## β
Prerequisites
* Python **3.8+**
* Tesseract OCR
* Windows: [Install](https://github.com/tesseract-ocr/tesseract/wiki)
* macOS: `brew install tesseract`
* Linux: `sudo apt-get install tesseract-ocr`
* Vosk Model
* [Download](https://alphacephei.com/vosk/models) `vosk-model-small-en-us`
* Extract into `models/vosk-model-small-en-us/`
* YOLO Model
* Use `yolov8n.pt` or custom-trained model saved as `yolo_ui_model.pt`
* Python dependencies
```bash
pip install -r requirements.txt
```
* Microphone access + Permissions (macOS/Linux screen recording/input).
---
## βοΈ Installation
```bash
# Clone the repo
git clone https://github.com/moses000/AIDesktopMentor.git
cd AIDesktopMentor
# Set up folder structure
mkdir -p outputs/screenshots dataset/images/train dataset/images/val dataset/labels/train dataset/labels/val
# Install dependencies
pip install -r requirements.txt
```
> **Don't forget to install Tesseract OCR, Vosk model, and YOLO model.**
---
## π§ YOLO Setup
### Option 1: Pre-trained
```python
from ultralytics import YOLO
model = YOLO("yolov8n.pt")
```
```bash
mv yolov8n.pt models/yolo_ui_model.pt
```
> Accuracy for UI tasks may be limited.
---
### Option 2: Train Your Own
1. **Capture screenshots**
```python
import pyautogui, time
for i in range(100):
pyautogui.screenshot(f"dataset/images/train/login_{i}.png")
time.sleep(2)
```
2. **Label with LabelImg**
```bash
pip install labelImg
labelImg dataset/images/train dataset/labels/train
```
3. **Create `data.yaml`**
```yaml
train: dataset/images/train/
val: dataset/images/val/
nc: 3
names: ['username_field', 'password_field', 'login_button']
```
4. **Train**
```python
from ultralytics import YOLO
model = YOLO("yolov8n.pt")
model.train(data="dataset/data.yaml", epochs=50, imgsz=640, batch=16)
```
5. **Save model**
```bash
cp runs/train/exp/weights/best.pt models/yolo_ui_model.pt
```
---
## π Usage
```bash
python automation/automation_tool.py
```
### GUI Tasks:
* Open Notepad & type
* Execute `tasks.json`
* Take screenshot
* OCR read screen
* Login via GUI
* Enable voice commands
### Voice Commands:
* "open Chrome"
* "go to example.com"
* "log in to example.com"
* "type hello world"
* "take screenshot"
* "read text"
* "execute tasks"
* "stop listening"
---
## π§Ύ Sample `tasks.json`
```json
[
{
"action": "open",
"app": "chrome"
},
{
"action": "navigate",
"url": "https://example.com"
},
{
"action": "login",
"url": "https://example.com"
},
{
"action": "screenshot",
"prefix": "login_task"
}
]
```
---
## π Notes
* **Permissions**: macOS/Linux may need screen/microphone/input access.
* **YOLO**: Required for login automation.
* **Vosk**: Ensure correct folder structure in `models/`.
* **Performance Tip**: Keep automation interval β₯ 5s to avoid resource strain.
---
## π¦ Deployment
```bash
pip install pyinstaller
pyinstaller --onefile automation/automation_tool.py
```
---
## π§― Troubleshooting
* **YOLO Errors**: Check `yolo_ui_model.pt` & class IDs
* **Vosk Errors**: Confirm model directory/mic permissions
* **GUI Not Working**: Verify Python/Tkinter setup
---
## π§ Future Improvements
* Expand YOLO UI detection classes
* Add reinforcement learning for adaptive workflows
* CAPTCHA solvers
* Larger NLP models (e.g., BERT)
* GUI task builder for `tasks.json`
---
## π License
MIT License
---
## π€ Contributing
Open issues or submit pull requests on GitHub.
---
## π¬ Contact
For support, [create an issue](https://github.com/moses000/AIDesktopMentor/issues) or email [im.imoleayomoses@gmail.com](mailto:im.imoleayomoses@gmail.com)
---
## π Acknowledgements
* [Ultralytics](https://github.com/ultralytics/ultralytics) for YOLOv8
* [Vosk](https://alphacephei.com/vosk/) for speech recognition
* [Hugging Face](https://huggingface.co/) for transformers
```