{"id":20184738,"url":"https://github.com/simular-ai/Agent-S","last_synced_at":"2025-05-07T03:30:39.648Z","repository":{"id":258313722,"uuid":"870189331","full_name":"simular-ai/Agent-S","owner":"simular-ai","description":"Agent S: an open agentic framework that uses computers like a human","archived":false,"fork":false,"pushed_at":"2025-05-03T03:41:10.000Z","size":40755,"stargazers_count":3934,"open_issues_count":6,"forks_count":383,"subscribers_count":45,"default_branch":"main","last_synced_at":"2025-05-03T04:27:31.874Z","etag":null,"topics":["agent-computer-interface","ai-agents","computer-automation","computer-use","grounding","gui-agents","in-context-reinforcement-learning","memory","mllm","planning","retrieval-augmented-generation"],"latest_commit_sha":null,"homepage":"https://www.simular.ai/agent-s2","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/simular-ai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-10-09T15:43:12.000Z","updated_at":"2025-05-03T04:19:03.000Z","dependencies_parsed_at":null,"dependency_job_id":"21719b9f-21ac-4a65-b6c9-0453611e1925","html_url":"https://github.com/simular-ai/Agent-S","commit_stats":null,"previous_names":["simular-ai/agent-s"],"tags_count":7,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simular-ai%2FAgent-S","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simular-ai%2FAgent-S/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simular-ai%2FAgent-S/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simular-ai%2FAgent-S/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/simular-ai","download_url":"https://codeload.github.com/simular-ai/Agent-S/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252806335,"owners_count":21807189,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agent-computer-interface","ai-agents","computer-automation","computer-use","grounding","gui-agents","in-context-reinforcement-learning","memory","mllm","planning","retrieval-augmented-generation"],"created_at":"2024-11-14T03:01:32.016Z","updated_at":"2025-05-07T03:30:39.638Z","avatar_url":"https://github.com/simular-ai.png","language":"Python","funding_links":[],"categories":["Web Automation and UI Interaction","智能体 Agents","Python","Papers","🌐 Browser and Desktop Agents","Repos","🧠 AI Applications \u0026 Platforms","Tools","4. Agentic AI \u0026 Multi-Agent Systems","Open Source Agents","\u003ca name=\"Python\"\u003e\u003c/a\u003ePython","🤖 AI \u0026 Machine Learning","1. Local Agents","AI开源项目","Operating System / Desktop / Mobile Environments"],"sub_categories":["UI Interaction","Models","Developer Infrastructure","Tools","Computer Use","Computer Use \u0026 OS Agents","AI Agent"],"readme":"\u003ch1 align=\"center\"\u003e\n  \u003cimg src=\"images/agent_s.png\" alt=\"Logo\" style=\"vertical-align:middle\" width=\"60\"\u003e Agent S2:\n  \u003csmall\u003eA Compositional Generalist-Specialist Framework for Computer Use Agents\u003c/small\u003e\n\u003c/h1\u003e\n\n\u003cp align=\"center\"\u003e\u0026nbsp;\n  🌐 \u003ca href=\"https://www.simular.ai/articles/agent-s2-technical-review\"\u003e[S2 blog]\u003c/a\u003e\u0026nbsp;\n  📄 \u003ca href=\"https://arxiv.org/abs/2504.00906\"\u003e[S2 Paper]\u003c/a\u003e\u0026nbsp;\n  🎥 \u003ca href=\"https://www.youtube.com/watch?v=wUGVQl7c0eg\"\u003e[S2 Video]\u003c/a\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\u0026nbsp;\n  🌐 \u003ca href=\"https://www.simular.ai/agent-s\"\u003e[S1 blog]\u003c/a\u003e\u0026nbsp;\n  📄 \u003ca href=\"https://arxiv.org/abs/2410.08164\"\u003e[S1 Paper (ICLR 2025)]\u003c/a\u003e\u0026nbsp;\n  🎥 \u003ca href=\"https://www.youtube.com/watch?v=OBDE3Knte0g\"\u003e[S1 Video]\u003c/a\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\u0026nbsp;\n\u003ca href=\"https://trendshift.io/repositories/13151\" target=\"_blank\"\u003e\u003cimg src=\"https://trendshift.io/api/badge/repositories/13151\" alt=\"simular-ai%2FAgent-S | Trendshift\" style=\"width: 250px; height: 55px;\" width=\"250\" height=\"55\"/\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://discord.gg/E2XfsK9fPV\"\u003e\n    \u003cimg src=\"https://dcbadge.limes.pink/api/server/https://discord.gg/E2XfsK9fPV?style=flat\" alt=\"Discord\"\u003e\n  \u003c/a\u003e\n  \u0026nbsp;\u0026nbsp;\n  \u003ca href=\"https://pepy.tech/projects/gui-agents\"\u003e\n    \u003cimg src=\"https://static.pepy.tech/badge/gui-agents\" alt=\"PyPI Downloads\"\u003e\n  \u003c/a\u003e\n\u003c/p\u003e\n\n## 🥳 Updates\n- [x] **2025/04/01**: Released \u003ca href=\"https://arxiv.org/abs/2504.00906\"\u003eAgent S2 paper\u003c/a\u003e with new SOTA results on OSWorld, WindowsAgentArena, and AndroidWorld!\n- [x] **2025/03/12**: Released Agent S2 along with v0.2.0 of [gui-agents](https://github.com/simular-ai/Agent-S), the new state-of-the-art for computer use agents (CUA), outperforming OpenAI's CUA/Operator and Anthropic's Claude 3.7 Sonnet Computer-Use!\n- [x] **2025/01/22**: The [Agent S paper](https://arxiv.org/abs/2410.08164) is accepted to ICLR 2025!\n- [x] **2025/01/21**: Released v0.1.2 of [gui-agents](https://github.com/simular-ai/Agent-S) library, with support for Linux and Windows!\n- [x] **2024/12/05**: Released v0.1.0 of [gui-agents](https://github.com/simular-ai/Agent-S) library, allowing you to use Agent-S for Mac, OSWorld, and WindowsAgentArena with ease!\n- [x] **2024/10/10**: Released the [Agent S paper](https://arxiv.org/abs/2410.08164) and codebase!\n\n## Table of Contents\n\n1. [💡 Introduction](#-introduction)\n2. [🎯 Current Results](#-current-results)\n3. [🛠️ Installation \u0026 Setup](#%EF%B8%8F-installation--setup) \n4. [🚀 Usage](#-usage)\n5. [🤝 Acknowledgements](#-acknowledgements)\n6. [💬 Citation](#-citation)\n\n## 💡 Introduction\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"./images/agent_s2_teaser.png\" width=\"800\"\u003e\n\u003c/p\u003e\n\nWelcome to **Agent S**, an open-source framework designed to enable autonomous interaction with computers through Agent-Computer Interface. Our mission is to build intelligent GUI agents that can learn from past experiences and perform complex tasks autonomously on your computer. \n\nWhether you're interested in AI, automation, or contributing to cutting-edge agent-based systems, we're excited to have you here!\n\n## 🎯 Current Results\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"./images/agent_s2_osworld_result.png\" width=\"600\"\u003e\n    \u003cbr\u003e\n    Results of Agent S2's Successful Rate (%) on the OSWorld full test set using Screenshot input only.\n\u003c/p\u003e\n\n\u003cdiv align=\"center\"\u003e\n  \u003ctable border=\"0\" cellspacing=\"0\" cellpadding=\"5\"\u003e\n    \u003ctr\u003e\n      \u003cth\u003eBenchmark\u003c/th\u003e\n      \u003cth\u003eAgent S2\u003c/th\u003e\n      \u003cth\u003ePrevious SOTA\u003c/th\u003e\n      \u003cth\u003eΔ improve\u003c/th\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003ctd\u003eOSWorld (15 step)\u003c/td\u003e\n      \u003ctd\u003e27.0%\u003c/td\u003e\n      \u003ctd\u003e22.7% (UI-TARS)\u003c/td\u003e\n      \u003ctd\u003e+4.3%\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003ctd\u003eOSWorld (50 step)\u003c/td\u003e\n      \u003ctd\u003e34.5%\u003c/td\u003e\n      \u003ctd\u003e32.6% (OpenAI CUA)\u003c/td\u003e\n      \u003ctd\u003e+1.9%\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003ctd\u003eWindowsAgentArena\u003c/td\u003e\n      \u003ctd\u003e29.8%\u003c/td\u003e\n      \u003ctd\u003e19.5% (NAVI)\u003c/td\u003e\n      \u003ctd\u003e+10.3%\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003ctd\u003eAndroidWorld\u003c/td\u003e\n      \u003ctd\u003e54.3%\u003c/td\u003e\n      \u003ctd\u003e46.8% (UI-TARS)\u003c/td\u003e\n      \u003ctd\u003e+7.5%\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003c/table\u003e\n\u003c/div\u003e\n\n\n## 🛠️ Installation \u0026 Setup\n\n\u003e ❗**Warning**❗: If you are on a Linux machine, creating a `conda` environment will interfere with `pyatspi`. As of now, there's no clean solution for this issue. Proceed through the installation without using `conda` or any virtual environment.\n\n\u003e ⚠️**Disclaimer**⚠️: To leverage the full potential of Agent S2, we utilize [UI-TARS](https://github.com/bytedance/UI-TARS) as a grounding model (7B-DPO or 72B-DPO for better performance). They can be hosted locally, or on Hugging Face Inference Endpoints. Our code supports Hugging Face Inference Endpoints. Check out [Hugging Face Inference Endpoints](https://huggingface.co/learn/cookbook/en/enterprise_dedicated_endpoints) for more information on how to set up and query this endpoint. However, running Agent S2 does not require this model, and you can use alternative API based models for visual grounding, such as Claude.\n\nInstall the package:\n```\npip install gui-agents\n```\n\nSet your LLM API Keys and other environment variables. You can do this by adding the following line to your .bashrc (Linux), or .zshrc (MacOS) file. \n\n```\nexport OPENAI_API_KEY=\u003cYOUR_API_KEY\u003e\nexport ANTHROPIC_API_KEY=\u003cYOUR_ANTHROPIC_API_KEY\u003e\nexport HF_TOKEN=\u003cYOUR_HF_TOKEN\u003e\n```\n\nAlternatively, you can set the environment variable in your Python script:\n\n```\nimport os\nos.environ[\"OPENAI_API_KEY\"] = \"\u003cYOUR_API_KEY\u003e\"\n```\n\nWe also support Azure OpenAI, Anthropic, Gemini, Open Router, and vLLM inference. For more information refer to [models.md](models.md).\n\n### Setup Retrieval from Web using Perplexica\nAgent S works best with web-knowledge retrieval. To enable this feature, you need to setup Perplexica: \n\n1. Ensure Docker Desktop is installed and running on your system.\n\n2. Navigate to the directory containing the project files.\n\n   ```bash\n    cd Perplexica\n    git submodule update --init\n   ```\n\n3. Rename the `sample.config.toml` file to `config.toml`. For Docker setups, you need only fill in the following fields:\n\n   - `OPENAI`: Your OpenAI API key. **You only need to fill this if you wish to use OpenAI's models**.\n   - `OLLAMA`: Your Ollama API URL. You should enter it as `http://host.docker.internal:PORT_NUMBER`. If you installed Ollama on port 11434, use `http://host.docker.internal:11434`. For other ports, adjust accordingly. **You need to fill this if you wish to use Ollama's models instead of OpenAI's**.\n   - `GROQ`: Your Groq API key. **You only need to fill this if you wish to use Groq's hosted models**.\n   - `ANTHROPIC`: Your Anthropic API key. **You only need to fill this if you wish to use Anthropic models**.\n\n     **Note**: You can change these after starting Perplexica from the settings dialog.\n\n   - `SIMILARITY_MEASURE`: The similarity measure to use (This is filled by default; you can leave it as is if you are unsure about it.)\n\n4. Ensure you are in the directory containing the `docker-compose.yaml` file and execute:\n\n   ```bash\n   docker compose up -d\n   ```\n5. Next, export your Perplexica URL. This URL is used to interact with the Perplexica API backend. The port is given by the `config.toml` in your Perplexica directory.\n\n   ```bash\n   export PERPLEXICA_URL=http://localhost:{port}/api/search\n   ```\n6. Our implementation of Agent S incorporates the Perplexica API to integrate a search engine capability, which allows for a more convenient and responsive user experience. If you want to tailor the API to your settings and specific requirements, you may modify the URL and the message of request parameters in  `agent_s/query_perplexica.py`. For a comprehensive guide on configuring the Perplexica API, please refer to [Perplexica Search API Documentation](https://github.com/ItzCrazyKns/Perplexica/blob/master/docs/API/SEARCH.md).\nFor a more detailed setup and usage guide, please refer to the [Perplexica Repository](https://github.com/ItzCrazyKns/Perplexica.git).\n\n\u003e ❗**Warning**❗: The agent will directly run python code to control your computer. Please use with care.\n\n## 🚀 Usage\n\n\n\u003e **Note**: Our best configuration uses Claude 3.7 with extended thinking and UI-TARS-72B-DPO. If you are unable to run UI-TARS-72B-DPO due to resource constraints, UI-TARS-7B-DPO can be used as a lighter alternative with minimal performance degradation.\n\n### CLI\n\nRun Agent S2 with a specific model (default is `gpt-4o`):\n\n```sh\nagent_s2 \\\n  --provider \"anthropic\" \\\n  --model \"claude-3-7-sonnet-20250219\" \\\n  --grounding_model_provider \"anthropic\" \\\n  --grounding_model \"claude-3-7-sonnet-20250219\" \\\n```\n\nOr use a custom endpoint:\n\n```bash\nagent_s2 \\\n  --provider \"anthropic\" \\\n  --model \"claude-3-7-sonnet-20250219\" \\\n  --endpoint_provider \"huggingface\" \\\n  --endpoint_url \"\u003cendpoint_url\u003e/v1/\"\n```\n\n#### Main Model Settings\n- **`--provider`**, **`--model`** \n  - Purpose: Specifies the main generation model\n  - Supports: all model providers in [models.md](models.md)\n  - Default: `--provider \"anthropic\" --model \"claude-3-7-sonnet-20250219\"`\n\n#### Grounding Configuration Options\n\nYou can use either Configuration 1 or Configuration 2:\n\n##### **(Default) Configuration 1: API-Based Models**\n- **`--grounding_model_provider`**, **`--grounding_model`**\n  - Purpose: Specifies the model for visual grounding (coordinate prediction)\n  - Supports: all model providers in [models.md](models.md)\n  - Default: `--grounding_model_provider \"anthropic\" --grounding_model \"claude-3-7-sonnet-20250219\"`\n- ❗**Important**❗ **`--grounding_model_resize_width`**\n  - Purpose:  Some API providers automatically rescale images. Therefore, the generated (x, y) will be relative to the rescaled image dimensions, instead of the original image dimensions.\n  - Supports: [Anthropic rescaling](https://docs.anthropic.com/en/docs/build-with-claude/vision#)\n  - Tips: If your grounding is inaccurate even for very simple queries, double check your rescaling width is correct for your machine's resolution.\n  - Default: `--grounding_model_resize_width 1366` (Anthropic)\n\n##### **Configuration 2: Custom Endpoint**\n- **`--endpoint_provider`**\n  - Purpose: Specifies the endpoint provider\n  - Supports: HuggingFace TGI, vLLM, Open Router\n  - Default: None\n\n- **`--endpoint_url`**\n  - Purpose: The URL for your custom endpoint\n  - Default: None\n\n\u003e **Note**: Configuration 2 takes precedence over Configuration 1.\n\nThis will show a user query prompt where you can enter your query and interact with Agent S2. You can use any model from the list of supported models in [models.md](models.md).\n\n### `gui_agents` SDK\n\nFirst, we import the necessary modules. `AgentS2` is the main agent class for Agent S2. `OSWorldACI` is our grounding agent that translates agent actions into executable python code.\n```\nimport pyautogui\nimport io\nfrom gui_agents.s2.agents.agent_s import AgentS2\nfrom gui_agents.s2.agents.grounding import OSWorldACI\n\n# Load in your API keys.\nfrom dotenv import load_dotenv\nload_dotenv()\n\ncurrent_platform = \"linux\"  # \"darwin\", \"windows\"\n```\n\nNext, we define our engine parameters. `engine_params` is used for the main agent, and `engine_params_for_grounding` is for grounding. For `engine_params_for_grounding`, we support the Claude, GPT series, and Hugging Face Inference Endpoints.\n\n```\nengine_type_for_grounding = \"huggingface\"\n\nengine_params = {\n    \"engine_type\": \"openai\",\n    \"model\": \"gpt-4o\",\n}\n\nif engine_type_for_grounding == \"huggingface\":\n  engine_params_for_grounding = {\n      \"engine_type\": \"huggingface\",\n      \"endpoint_url\": \"\u003cendpoint_url\u003e/v1/\",\n  }\nelif engine_type_for_grounding == \"claude\":\n  engine_params_for_grounding = {\n      \"engine_type\": \"claude\",\n      \"model\": \"claude-3-7-sonnet-20250219\",\n  }\nelif engine_type_for_grounding == \"gpt\":\n  engine_params_for_grounding = {\n    \"engine_type\": \"gpt\",\n    \"model\": \"gpt-4o\",\n  }\nelse:\n  raise ValueError(\"Invalid engine type for grounding\")\n```\n\nThen, we define our grounding agent and Agent S2.\n\n```\ngrounding_agent = OSWorldACI(\n    platform=current_platform,\n    engine_params_for_generation=engine_params,\n    engine_params_for_grounding=engine_params_for_grounding\n)\n\nagent = AgentS2(\n  engine_params,\n  grounding_agent,\n  platform=current_platform,\n  action_space=\"pyautogui\",\n  observation_type=\"mixed\",\n  search_engine=\"Perplexica\"  # Assuming you have set up Perplexica.\n)\n```\n\nFinally, let's query the agent!\n\n```\n# Get screenshot.\nscreenshot = pyautogui.screenshot()\nbuffered = io.BytesIO() \nscreenshot.save(buffered, format=\"PNG\")\nscreenshot_bytes = buffered.getvalue()\n\nobs = {\n  \"screenshot\": screenshot_bytes,\n}\n\ninstruction = \"Close VS Code\"\ninfo, action = agent.predict(instruction=instruction, observation=obs)\n\nexec(action[0])\n```\n\nRefer to `gui_agents/s2/cli_app.py` for more details on how the inference loop works.\n\n#### Downloading the Knowledge Base\n\nAgent S2 uses a knowledge base that continually updates with new knowledge during inference. The knowledge base is initially downloaded when initializing `AgentS2`. The knowledge base is stored as assets under our [GitHub Releases](https://github.com/simular-ai/Agent-S/releases). The `AgentS2` initialization will only download the knowledge base for your specified platform and agent version (e.g s1, s2). If you'd like to download the knowledge base programmatically, you can use the following code:\n\n```\ndownload_kb_data(\n    version=\"s2\",\n    release_tag=\"v0.2.2\",\n    download_dir=\"kb_data\",\n    platform=\"linux\"  # \"darwin\", \"windows\"\n)\n```\n\nThis will download Agent S2's knowledge base for Linux from release tag `v0.2.2` to the `kb_data` directory. Refer to our [GitHub Releases](https://github.com/simular-ai/Agent-S/releases) or release tags that include the knowledge bases.\n\n### OSWorld\n\nTo deploy Agent S2 in OSWorld, follow the [OSWorld Deployment instructions](OSWorld.md).\n\n## 🤝 Acknowledgements\n\nWe extend our sincere thanks to Tianbao Xie for developing OSWorld and discussing computer use challenges. We also appreciate the engaging discussions with Yujia Qin and Shihao Liang regarding UI-TARS.\n\n## 💬 Citations\n\nIf you find this codebase useful, please cite \n\n```\n@misc{Agent-S2,\n      title={Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents}, \n      author={Saaket Agashe and Kyle Wong and Vincent Tu and Jiachen Yang and Ang Li and Xin Eric Wang},\n      year={2025},\n      eprint={2504.00906},\n      archivePrefix={arXiv},\n      primaryClass={cs.AI},\n      url={https://arxiv.org/abs/2504.00906}, \n}\n```\n\n```\n@inproceedings{Agent-S,\n    title={{Agent S: An Open Agentic Framework that Uses Computers Like a Human}},\n    author={Saaket Agashe and Jiuzhou Han and Shuyu Gan and Jiachen Yang and Ang Li and Xin Eric Wang},\n    booktitle={International Conference on Learning Representations (ICLR)},\n    year={2025},\n    url={https://arxiv.org/abs/2410.08164}\n}\n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsimular-ai%2FAgent-S","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsimular-ai%2FAgent-S","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsimular-ai%2FAgent-S/lists"}