{"id":26914829,"url":"https://github.com/bytebot-ai/bytebot","last_synced_at":"2025-04-01T17:40:24.684Z","repository":{"id":280740141,"uuid":"926709003","full_name":"bytebot-ai/bytebot","owner":"bytebot-ai","description":"A containerized framework for computer use agents with a virtual desktop environment.","archived":false,"fork":false,"pushed_at":"2025-03-26T22:37:56.000Z","size":9225,"stargazers_count":45,"open_issues_count":0,"forks_count":3,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-26T23:29:24.687Z","etag":null,"topics":["ai-agents","anthropic","computer-use","docker","llm","openai","qemu"],"latest_commit_sha":null,"homepage":"","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bytebot-ai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-02-03T18:18:50.000Z","updated_at":"2025-03-26T03:10:45.000Z","dependencies_parsed_at":"2025-03-26T23:35:41.062Z","dependency_job_id":null,"html_url":"https://github.com/bytebot-ai/bytebot","commit_stats":null,"previous_names":["bytebot-ai/bytebot"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bytebot-ai%2Fbytebot","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bytebot-ai%2Fbytebot/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bytebot-ai%2Fbytebot/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bytebot-ai%2Fbytebot/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bytebot-ai","download_url":"https://codeload.github.com/bytebot-ai/bytebot/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246681683,"owners_count":20816931,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-agents","anthropic","computer-use","docker","llm","openai","qemu"],"created_at":"2025-04-01T17:40:16.716Z","updated_at":"2025-04-01T17:40:24.673Z","avatar_url":"https://github.com/bytebot-ai.png","language":"TypeScript","readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"bytebot-logo.png\" width=\"300\" alt=\"Bytebot Logo\"\u003e\n\u003c/p\u003e\n\n# Bytebot\n\nA containerized framework for computer use agents with a virtual desktop environment.\n\n## Overview\n\nBytebot provides a complete, self-contained environment for developing and deploying computer use agents. It encapsulates a lightweight Linux desktop environment with pre-installed tools inside a Docker container, making it easy to deploy across different platforms.\n\n## Features\n\n- **Containerized Desktop Environment**: Runs a lightweight Lubuntu 22.04 virtual machine with QEMU\n- **VNC Access**: View and interact with the desktop through VNC or browser-based noVNC\n- **Agent API**: Control the desktop environment programmatically through a NestJS-based hypervisor\n- **Pre-installed Tools**: Comes with Chrome and other essential tools pre-installed\n- **Cross-Platform**: Works on any system that supports Docker\n\n## Computer Use Models and Agent Development\n\nBytebot provides the infrastructure for computer use agents, but the intelligence driving these agents can come from various sources. Developers have complete flexibility in how they build and deploy their agents.\n\n![Bytebot Architecture Diagram](bytebot-diagram.png)\n\n## Desktop Environment\n\n### Default Desktop Image\n\nBytebot comes with a default Lubuntu 22.04 desktop image that includes:\n\n- Pre-installed Google Chrome browser\n- Default user account: `agent` with password `password`\n- Lightweight LXDE desktop environment\n- Basic utilities and tools\n\n\u003e **⚠️ Security Warning**: The default desktop image is intended for development and testing purposes only. It uses a known username and password combination and should **not** be used in production environments.\n\n### Creating Custom Desktop Images\n\nDevelopers are encouraged to create their own custom QEMU-compatible desktop images for production use. You can:\n\n1. Build a custom QEMU disk image with your preferred:\n\n   - Operating system (any Linux distribution, Windows, etc.)\n   - Pre-installed software and tools\n   - User accounts with secure credentials\n   - System configurations and optimizations\n\n2. Replace the default image by:\n   - Hosting your custom image on your preferred storage (S3, GCS, etc.)\n   - Modifying the Dockerfile to download your image instead of the default one\n   - Or mounting your local image when running the container\n\n#### Example: Using a Custom Image\n\n```bash\n# Modify the Dockerfile to use your custom image\n# In docker/Dockerfile, change:\nRUN wget https://your-storage-location.com/your-custom-image.qcow2 -P /opt/ \u0026\u0026 \\\n    chmod 777 /opt/your-custom-image.qcow2\n\n# Or mount your local image when running the container\ndocker run -d --privileged \\\n  -p 3000:3000 \\\n  -p 5900:5900 \\\n  -p 6080:6080 \\\n  -p 6081:6081 \\\n  -v /path/to/your/custom-image.qcow2:/opt/bytebot-lubuntu-22.04.5.qcow2 \\\n  bytebot:latest\n```\n\n#### QEMU Image Compatibility\n\nYour custom images must be:\n\n- In QCOW2 format for optimal performance\n- Compatible with QEMU/KVM virtualization\n- Configured with appropriate drivers for virtual hardware\n- Sized appropriately for your use case (recommended minimum: 10GB)\n\n## Quick Start\n\n### Prerequisites\n\n- Docker installed on your system\n\n### Building the Image\n\n```bash\n./build.sh\n```\n\nOr with custom options:\n\n```bash\n./build.sh --tag custom-tag --no-cache\n```\n\n### Running the Container\n\n```bash\ndocker run -d --privileged \\\n  -p 3000:3000 \\\n  -p 5900:5900 \\\n  -p 6080:6080 \\\n  -p 6081:6081 \\\n  bytebot:latest\n```\n\n### Accessing the Desktop\n\n- **VNC Client**: Connect to `localhost:5900`\n- **Web Browser**: Navigate to `http://localhost:3000/vnc`\n\n### Using the Agent API\n\nThe hypervisor exposes a REST API on port 3000 that allows you to programmatically control the desktop environment.\n\n## Computer Use API\n\nBytebot provides a unified computer action API that allows granular control over all aspects of the virtual desktop environment through a single endpoint, `http://localhost:3000/computer-use`.\n\n### Unified Endpoint\n\n| Endpoint        | Method | Description                                    |\n| --------------- | ------ | ---------------------------------------------- |\n| `/computer-use` | POST   | Unified endpoint for all computer interactions |\n\n### Available Actions\n\nThe unified API supports the following actions:\n\n| Action                | Description                                        | Parameters                                                                                                                     |\n| --------------------- | -------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------ |\n| `move_mouse`          | Move the mouse cursor to a specific position       | `coordinates: { x: number, y: number }`                                                                                        |\n| `trace_mouse`         | Moves the mouse along a specified path             | `path: { x: number, y: number }[]`, `holdKeys?: string[]`                                                                      |\n| `click_mouse`         | Perform a mouse click                              | `coordinates?: { x: number, y: number }`, `button: 'left' \\| 'right' \\| 'middle'`, `numClicks?: number`, `holdKeys?: string[]` |\n| `press_mouse`         | Press or release a mouse button                    | `coordinates?: { x: number, y: number }`, `button: 'left' \\| 'right' \\| 'middle'`, `press: 'down' \\| 'up'`                     |\n| `drag_mouse`          | Click and drag the mouse from one point to another | `path: { x: number, y: number }[]`, `button: 'left' \\| 'right' \\| 'middle'`, `holdKeys?: string[]`                             |\n| `scroll`              | Scroll vertically or horizontally                  | `coordinates?: { x: number, y: number }`, `axis: 'vertical' \\| 'horizontal'`, `distance: number`, `holdKeys?: string[]`        |\n| `type_keys`           | Type one or more keyboard keys                     | `keys: string[]`, `delay?: number`                                                                                             |\n| `press_keys`          | Press or release keyboard keys                     | `keys: string[]`, `press: 'down' \\| 'up'`                                                                                      |\n| `type_text`           | Type a text string                                 | `text: string`, `delay?: number`                                                                                               |\n| `wait`                | Wait for a specified duration                      | `duration: number` (milliseconds)                                                                                              |\n| `screenshot`          | Capture a screenshot of the desktop                | None                                                                                                                           |\n| `get_cursor_position` | Get the current cursor position                    | None                                                                                                                           |\n\n### Example Usage\n\n```bash\n# Move the mouse to coordinates (100, 200)\ncurl -X POST http://localhost:3000/computer-use \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"action\": \"move_mouse\", \"coordinates\": {\"x\": 100, \"y\": 200}}'\n\n# Trace mouse movement along a path\ncurl -X POST http://localhost:3000/computer-use \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"action\": \"trace_mouse\", \"path\": [{\"x\": 100, \"y\": 100}, {\"x\": 200, \"y\": 100}, {\"x\": 200, \"y\": 200}]}'\n\n# Click the mouse with the left button\ncurl -X POST http://localhost:3000/computer-use \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"action\": \"click_mouse\", \"button\": \"left\"}'\n\n# Press the mouse button down at specific coordinates\ncurl -X POST http://localhost:3000/computer-use \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"action\": \"press_mouse\", \"coordinates\": {\"x\": 100, \"y\": 100}, \"button\": \"left\", \"press\": \"down\"}'\n\n# Release the mouse button\ncurl -X POST http://localhost:3000/computer-use \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"action\": \"press_mouse\", \"button\": \"left\", \"press\": \"up\"}'\n\n# Type text with a 50ms delay between keystrokes\ncurl -X POST http://localhost:3000/computer-use \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"action\": \"type_text\", \"text\": \"Hello, Bytebot!\", \"delay\": 50}'\n\n# Take a screenshot\ncurl -X POST http://localhost:3000/computer-use \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"action\": \"screenshot\"}'\n\n# Double-click at specific coordinates\ncurl -X POST http://localhost:3000/computer-use \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"action\": \"click_mouse\", \"coordinates\": {\"x\": 150, \"y\": 250}, \"button\": \"left\", \"numClicks\": 2}'\n\n# Press and hold multiple keys simultaneously (e.g., Alt+Tab)\ncurl -X POST http://localhost:3000/computer-use \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"action\": \"press_keys\", \"keys\": [\"alt\", \"tab\"], \"press\": \"down\"}'\n\n# Release the held keys\ncurl -X POST http://localhost:3000/computer-use \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"action\": \"press_keys\", \"keys\": [\"alt\", \"tab\"], \"press\": \"up\"}'\n\n# Drag from one position to another\ncurl -X POST http://localhost:3000/computer-use \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"action\": \"drag_mouse\", \"path\": [{\"x\": 100, \"y\": 100}, {\"x\": 200, \"y\": 200}], \"button\": \"left\"}'\n\n# Get the current cursor position\ncurl -X POST http://localhost:3000/computer-use \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"action\": \"get_cursor_position\"}'\n\n# Wait for 2 seconds\ncurl -X POST http://localhost:3000/computer-use \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"action\": \"wait\", \"duration\": 2000}'\n```\n\n## Supported Keys\n\nBytebot supports a wide range of keyboard inputs through the QEMU key codes. Here are the supported key categories:\n\n### Control Keys\n\n| Key Name           | QEMU Code   |\n| ------------------ | ----------- |\n| Escape             | `esc`       |\n| Backspace          | `backspace` |\n| Tab                | `tab`       |\n| Return/Enter       | `ret`       |\n| Caps Lock          | `caps_lock` |\n| Left Shift         | `shift`     |\n| Right Shift        | `shift_r`   |\n| Left Ctrl          | `ctrl`      |\n| Right Ctrl         | `ctrl_r`    |\n| Left Alt           | `alt`       |\n| Right Alt          | `alt_r`     |\n| Left Meta/Windows  | `meta_l`    |\n| Right Meta/Windows | `meta_r`    |\n| Space              | `spc`       |\n| Insert             | `insert`    |\n| Delete             | `delete`    |\n| Home               | `home`      |\n| End                | `end`       |\n| Page Up            | `pgup`      |\n| Page Down          | `pgdn`      |\n\n### Arrow Keys\n\n| Key Name    | QEMU Code |\n| ----------- | --------- |\n| Up Arrow    | `up`      |\n| Down Arrow  | `down`    |\n| Left Arrow  | `left`    |\n| Right Arrow | `right`   |\n\n### Function Keys\n\n| Key Name | QEMU Code          |\n| -------- | ------------------ |\n| F1 - F12 | `f1` through `f12` |\n\n### Key Combinations\n\nYou can send key combinations by using the `/computer-use` endpoint with special syntax:\n\n```bash\n# Send Ctrl+C\ncurl -X POST http://localhost:3000/computer-use \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"action\": \"type_keys\", \"keys\": [\"ctrl\", \"c\"]}'\n\n# Send Alt+Tab\ncurl -X POST http://localhost:3000/computer-use \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"action\": \"type_keys\", \"keys\": [\"alt\", \"tab\"]}'\n```\n\n## Architecture\n\nBytebot consists of three main components:\n\n1. **QEMU Virtual Machine**: Runs a lightweight Lubuntu 22.04 desktop environment\n2. **NestJS Hypervisor**: Provides an API for controlling the desktop environment\n3. **noVNC Server**: Enables browser-based access to the desktop\n\nAll components are orchestrated using Supervisor within a single Docker container.\n\n## Development\n\n### Project Structure\n\n```\nbytebot/\n├── build.sh                  # Build script for the Docker image\n├── docker/                   # Docker configuration\n│   ├── Dockerfile            # Main Dockerfile\n│   └── supervisord.conf      # Supervisor configuration\n└── hypervisor/               # NestJS-based agent API\n    ├── src/                  # Source code\n    ├── package.json          # Dependencies\n    └── ...\n```\n\n### Extending the Hypervisor\n\nThe hypervisor is built with NestJS, making it easy to extend with additional functionality. See the hypervisor directory for more details.\n\n### Local Development\n\nDevelopers can use the Bytebot container as is for local development:\n\n- Run the container with exposed ports as shown in the Quick Start section\n- Connect to the desktop via VNC client at `localhost:5900` or web browser at `http://localhost:3000/vnc`\n- Make API requests to `http://localhost:3000/computer-use` endpoints from your local agent code\n- Iterate quickly by developing your agent logic separately from the Bytebot container\n\nThis separation of concerns allows for rapid development cycles where you can modify your agent's code without rebuilding the Bytebot container.\n\n### Deployment\n\nFor production deployments, developers can:\n\n- Bundle their agent code directly into the Bytebot container by modifying the Dockerfile\n- Add authentication to secure the API endpoints\n- Restrict port exposure to prevent unauthorized access\n- Configure logging and monitoring for production use\n\n#### Example: Bundling an Agent into the Container\n\n```bash\n# Example Dockerfile modifications to bundle a Python agent\n...\n\n# Install additional dependencies for your agent\nRUN apk add --no-cache python3 py3-pip\nWORKDIR /agent\nCOPY requirements.txt .\nRUN pip3 install -r requirements.txt\n\n# Copy your agent code\nCOPY agent/ /agent/\n\n# Modify supervisord.conf to run your agent\nCOPY custom-supervisord.conf /etc/supervisor/conf.d/supervisord.conf\n\n# Only expose VNC ports if needed, not the API\nEXPOSE 5900 6080 6081\n```\n\n#### Example: Custom Supervisor Configuration\n\n```ini\n# custom-supervisord.conf\n[supervisord]\nnodaemon=true\nlogfile=/dev/stdout\nlogfile_maxbytes=0\nloglevel=info\nuser=root\n\n# Original Bytebot services\n[program:desktop-vm]\ncommand=sh -c '...' # Original QEMU command\nautostart=true\nautorestart=true\n...\n\n[program:hypervisor]\ncommand=sh -c '...' # Original hypervisor command\ndirectory=/hypervisor\nautostart=true\nautorestart=true\n...\n\n[program:novnc-http]\ncommand=sh -c '...' # Original noVNC command\nautostart=true\nautorestart=true\n...\n\n# Add your custom agent\n[program:my-agent]\ncommand=python3 /agent/main.py\ndirectory=/agent\nautostart=true\nautorestart=true\nstdout_logfile=/dev/stdout\nstdout_logfile_maxbytes=0\nstderr_logfile=/dev/stderr\nstderr_logfile_maxbytes=0\nredirect_stderr=true\n```\n\n### Leveraging AI Models for Computer Use\n\nYou can integrate Bytebot with various AI models to create intelligent computer use agents:\n\n#### Large Language Models (LLMs)\n\n- **Anthropic Claude**: Excellent for understanding complex visual contexts and reasoning about UI elements\n- **OpenAI GPT-4V**: Strong capabilities for visual understanding and task planning\n- **Google Gemini**: Offers multimodal understanding for complex desktop interactions\n- **Mistral Large**: Provides efficient reasoning for task automation\n- **DeepSeek**: Specialized in code understanding and generation for automation scripts\n\n#### Computer Vision Models\n\n- **OmniParser**: For extracting structured data from desktop UI elements\n- **CLIP/ViT**: For identifying and classifying visual elements on screen\n- **Segment Anything Model (SAM)**: For precise identification of UI components\n\n### Integration Approaches\n\nThere are several ways to integrate AI models with Bytebot:\n\n1. **API-based Integration**: Use the model provider's API to send screenshots and receive instructions\n2. **Local Model Deployment**: Run smaller models locally alongside Bytebot\n3. **Hybrid Approaches**: Combine local processing with cloud-based intelligence\n\n### Flexible Development Options\n\nBytebot's REST API allows developers to build agents in any programming language or framework they prefer:\n\n- **Python**: Ideal for data science and ML integration with libraries like requests, Pillow, and PyTorch\n- **JavaScript/TypeScript**: Great for web-based agents using Node.js or browser environments\n- **Java/Kotlin**: Robust options for enterprise applications\n- **Go**: Excellent for high-performance, concurrent agents\n- **Rust**: For memory-safe, high-performance implementations\n- **C#/.NET**: Strong integration with Windows environments and enterprise systems\n\n### Sample Agent Implementations\n\n#### Python Example\n\n```python\nimport requests\nimport base64\nfrom PIL import Image\nimport io\nimport anthropic\n\n# Bytebot API URL\nBYTEBOT_API = \"http://localhost:3000/computer-use\"\n\n# Get screenshot\nresponse = requests.get(f\"{BYTEBOT_API}\", params={\"action\": \"screenshot\"})\nscreenshot = Image.open(io.BytesIO(response.content))\n\n# Convert to base64 for Claude\nbuffered = io.BytesIO()\nscreenshot.save(buffered, format=\"PNG\")\nimg_str = base64.b64encode(buffered.getvalue()).decode()\n\n# Ask Claude what to do\nclient = anthropic.Anthropic()\nmessage = client.messages.create(\n    model=\"claude-3-opus-20240229\",\n    max_tokens=1000,\n    messages=[\n        {\n            \"role\": \"user\",\n            \"content\": [\n                {\"type\": \"text\", \"text\": \"What should I do with this desktop screenshot?\"},\n                {\"type\": \"image\", \"source\": {\"type\": \"base64\", \"media_type\": \"image/png\", \"data\": img_str}}\n            ]\n        }\n    ]\n)\n\n# Execute Claude's suggestion\naction = message.content[0].text\nif \"click\" in action.lower():\n    # Extract coordinates from Claude's response\n    # This is a simplified example\n    x, y = 100, 200  # Replace with actual parsing\n    requests.post(f\"{BYTEBOT_API}\", json={\"action\": \"click_mouse\", \"coordinates\": {\"x\": x, \"y\": y}})\n```\n\n#### JavaScript/TypeScript Example\n\n```typescript\nimport axios from \"axios\";\nimport { OpenAI } from \"openai\";\n\nconst BYTEBOT_API = \"http://localhost:3000/computer-use\";\nconst openai = new OpenAI();\n\nasync function runAgent() {\n  // Get screenshot\n  const screenshotResponse = await axios.get(`${BYTEBOT_API}`, {\n    params: { action: \"screenshot\" },\n    responseType: \"arraybuffer\",\n  });\n  const base64Image = Buffer.from(screenshotResponse.data).toString(\"base64\");\n\n  // Ask GPT-4V for analysis\n  const gptResponse = await openai.chat.completions.create({\n    model: \"gpt-4-vision-preview\",\n    messages: [\n      {\n        role: \"user\",\n        content: [\n          { type: \"text\", text: \"What should I do with this desktop?\" },\n          {\n            type: \"image_url\",\n            image_url: { url: `data:image/png;base64,${base64Image}` },\n          },\n        ],\n      },\n    ],\n    max_tokens: 500,\n  });\n\n  // Process GPT's response and take action\n  const action = gptResponse.choices[0].message.content;\n  console.log(`GPT suggests: ${action}`);\n\n  // Example action: Type text\n  await axios.post(`${BYTEBOT_API}`, {\n    action: \"type_text\",\n    text: \"Hello from my JavaScript agent!\",\n    delay: 50,\n  });\n}\n\nrunAgent();\n```\n\n## Use Cases\n\n- **Automated Testing**: Run end-to-end tests in a consistent environment\n- **Web Scraping**: Automate web browsing and data collection\n- **UI Automation**: Create agents that interact with desktop applications\n- **AI Training**: Generate training data for computer vision and UI interaction models\n\n## License\n\nSee the [LICENSE](LICENSE) file for details.\n","funding_links":[],"categories":["📚 Projects (1974 total)","TypeScript","语音识别与合成_其他","Chatbots","Repos","AI Web Automation Tools","GUI \u0026 Computer Control AI Agents","Agent Categories","Agents \u0026 Orchestration"],"sub_categories":["MCP Clients","资源传输下载","Dev Tools","Desktop Automation","\u003ca name=\"Unclassified\"\u003e\u003c/a\u003eUnclassified"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbytebot-ai%2Fbytebot","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbytebot-ai%2Fbytebot","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbytebot-ai%2Fbytebot/lists"}