{"id":24791226,"url":"https://github.com/bytedance/UI-TARS","last_synced_at":"2025-10-12T15:31:11.886Z","repository":{"id":273470319,"uuid":"918932479","full_name":"bytedance/UI-TARS","owner":"bytedance","description":null,"archived":false,"fork":false,"pushed_at":"2025-01-21T05:04:08.000Z","size":34993,"stargazers_count":6,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-01-21T05:26:32.449Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bytedance.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-01-19T09:04:17.000Z","updated_at":"2025-01-21T05:04:21.000Z","dependencies_parsed_at":"2025-01-21T05:36:36.298Z","dependency_job_id":null,"html_url":"https://github.com/bytedance/UI-TARS","commit_stats":null,"previous_names":["bytedance/ui-tars"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bytedance%2FUI-TARS","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bytedance%2FUI-TARS/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bytedance%2FUI-TARS/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bytedance%2FUI-TARS/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bytedance","download_url":"https://codeload.github.com/bytedance/UI-TARS/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":236239250,"owners_count":19117154,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-01-29T19:02:03.304Z","updated_at":"2025-10-12T15:31:01.875Z","avatar_url":"https://github.com/bytedance.png","language":null,"funding_links":[],"categories":["HarmonyOS","Python","Repos","RelatedRepos","🌐 Web Development - Frontend","[Paper List](#paperlist)","UI Understanding and Computer Use"],"sub_categories":["Windows Manager","Advanced Reasoning for Agent","[GUI Agents List](#agentlist)","Projects and references"],"readme":"\u003c!-- \u003cp align=\"center\"\u003e\n  \u003cimg alt=\"UI-TARS\"  width=\"260\" src=\"figures/icon.png\"\u003e\n\u003c/p\u003e\n\n# UI-TARS: Pioneering Automated GUI Interaction with Native Agents --\u003e\n![Local Image](figures/writer.png)\n\u003cp align=\"center\"\u003e\n        🤗 \u003ca href=\"https://huggingface.co/bytedance-research/UI-TARS-7B-DPO\"\u003eHugging Face Models\u003c/a\u003e\u0026nbsp\u0026nbsp | \u0026nbsp\u0026nbsp🤖 \u003ca href=\"https://www.modelscope.cn/models/bytedance-research/UI-TARS-7B-DPO\"\u003eModelScope\u003c/a\u003e\u0026nbsp\u0026nbsp | \u0026nbsp\u0026nbsp 📑 \u003ca href=\"https://arxiv.org/abs/2501.12326\"\u003ePaper\u003c/a\u003e \u0026nbsp\u0026nbsp  |\u0026nbsp\u0026nbsp\u003c/a\u003e\n🖥️ \u003ca href=\"https://github.com/bytedance/UI-TARS-desktop\"\u003eUI-TARS-desktop\u003c/a\u003e\u0026nbsp\u0026nbsp  \u003cbr\u003e🏄 \u003ca href=\"https://github.com/web-infra-dev/Midscene\"\u003eMidscene (Browser Automation) \u003c/a\u003e\u0026nbsp\u0026nbsp | \u0026nbsp\u0026nbsp🤗 \u003ca href=\"https://huggingface.co/spaces/bytedance-research/UI-TARS\"\u003eSpace\u003c/a\u003e\u0026nbsp\u0026nbsp | \u0026nbsp\u0026nbsp🫨 \u003ca href=\"https://discord.gg/txAE43ps\"\u003eDiscord\u003c/a\u003e\u0026nbsp\u0026nbsp\n\u003c/p\u003e\n\nWe also offer a **UI-TARS-desktop** version, which can operate on your **local personal device**. To use it, please visit [https://github.com/bytedance/UI-TARS-desktop](https://github.com/bytedance/UI-TARS-desktop). To use UI-TARS in web automation, you may refer to the open-source project [Midscene.js](https://github.com/web-infra-dev/Midscene).\n### ⚠️ Important Announcement: GGUF Model Performance\n\nThe **GGUF model** has undergone quantization, but unfortunately, its performance cannot be guaranteed. As a result, we have decided to **downgrade** it. \n\n💡 **Alternative Solution**:  \nYou can use **[Cloud Deployment](#cloud-deployment)** or **[Local Deployment [vLLM]](#local-deployment-vllm)**(If you have enough GPU resources) instead. \n\nWe appreciate your understanding and patience as we work to ensure the best possible experience.\n\n## Updates\n\n- 🚀 01.25: We updated the **[Cloud Deployment](#cloud-deployment)** section in the 中文版: [GUI模型部署教程](https://bytedance.sg.larkoffice.com/docx/TCcudYwyIox5vyxiSDLlgIsTgWf#U94rdCxzBoJMLex38NPlHL21gNb) with new information related to the ModelScope platform. You can now use the ModelScope platform for deployment.\n\n\n\n## Overview\nUI-TARS is a next-generation native GUI agent model designed to interact seamlessly with graphical user interfaces (GUIs) using human-like perception, reasoning, and action capabilities. Unlike traditional modular frameworks, UI-TARS integrates all key components—perception, reasoning, grounding, and memory—within a single vision-language model (VLM), enabling end-to-end task automation without predefined workflows or manual rules.\n![Local Image](figures/UI-TARS-vs-Previous-SOTA.png)\n![Local Image](figures/UI-TARS.png)\n\n\n## Core Features\n### Perception\n- **Comprehensive GUI Understanding**: Processes multimodal inputs (text, images, interactions) to build a coherent understanding of interfaces.\n- **Real-Time Interaction**: Continuously monitors dynamic GUIs and responds accurately to changes in real-time.\n\n### Action\n- **Unified Action Space**: Standardized action definitions across platforms (desktop, mobile, and web).\n- **Platform-Specific Actions**: Supports additional actions like hotkeys, long press, and platform-specific gestures.\n\n### Reasoning\n- **System 1 \u0026 System 2 Reasoning**: Combines fast, intuitive responses with deliberate, high-level planning for complex tasks.\n- **Task Decomposition \u0026 Reflection**: Supports multi-step planning, reflection, and error correction for robust task execution.\n\n### Memory\n- **Short-Term Memory**: Captures task-specific context for situational awareness.\n- **Long-Term Memory**: Retains historical interactions and knowledge for improved decision-making.\n\n## Capabilities\n- **Cross-Platform Interaction**: Supports desktop, mobile, and web environments with a unified action framework.\n- **Multi-Step Task Execution**: Trained to handle complex tasks through multi-step trajectories and reasoning.\n- **Learning from Synthetic and Real Data**: Combines large-scale annotated and synthetic datasets for improved generalization and robustness.\n\n## Performance\n**Perception Capabilty Evaluation**\n| Model                     | VisualWebBench | WebSRC  | SQAshort |\n|---------------------------|---------------|---------|----------|\n| Qwen2-VL-7B              | 73.3          | 81.8    | 84.9     |\n| Qwen-VL-Max              | 74.1          | 91.1    | 78.6     |\n| Gemini-1.5-Pro           | 75.4          | 88.9    | 82.2     |\n| UIX-Qwen2-7B             | 75.9          | 82.9    | 78.8     |\n| Claude-3.5-Sonnet        | 78.2          | 90.4    | 83.1     |\n| GPT-4o                   | 78.5          | 87.7    | 82.3     |\n| **UI-TARS-2B**          | 72.9          | 89.2    | 86.4     |\n| **UI-TARS-7B**          | 79.7          | **93.6** | 87.7     |\n| **UI-TARS-72B**         | **82.8**      | 89.3    | **88.6** |\n\n**Grounding Capability Evaluation**\n- **ScreenSpot Pro**\n\n| Agent Model              | Dev-Text | Dev-Icon | Dev-Avg | Creative-Text | Creative-Icon | Creative-Avg | CAD-Text | CAD-Icon | CAD-Avg | Scientific-Text | Scientific-Icon | Scientific-Avg | Office-Text | Office-Icon | Office-Avg | OS-Text | OS-Icon | OS-Avg | Avg-Text | Avg-Icon | Avg |\n|--------------------------|----------|----------|----------|--------------|--------------|--------------|---------|---------|---------|---------------|---------------|---------------|------------|------------|------------|--------|--------|--------|---------|---------|------|\n| QwenVL-7B               | 0.0      | 0.0      | 0.0      | 0.0          | 0.0          | 0.0          | 0.0     | 0.0     | 0.0     | 0.7           | 0.0           | 0.4           | 0.0        | 0.0        | 0.0        | 0.0    | 0.0    | 0.0    | 0.1     | 0.0     | **0.1**  |\n| GPT-4o                  | 1.3      | 0.0      | 0.7      | 1.0          | 0.0          | 0.6          | 2.0     | 0.0     | 1.5     | 2.1           | 0.0           | 1.2           | 1.1        | 0.0        | 0.9        | 0.0    | 0.0    | 0.0    | 1.3     | 0.0     | **0.8**  |\n| SeeClick                | 0.6      | 0.0      | 0.3      | 1.0          | 0.0          | 0.6          | 2.5     | 0.0     | 1.9     | 3.5           | 0.0           | 2.0           | 1.1        | 0.0        | 0.9        | 2.8    | 0.0    | 1.5    | 1.8     | 0.0     | **1.1**  |\n| Qwen2-VL-7B             | 2.6      | 0.0      | 1.3      | 1.5          | 0.0          | 0.9          | 0.5     | 0.0     | 0.4     | 6.3           | 0.0           | 3.5           | 3.4        | 1.9        | 3.0        | 0.9    | 0.0    | 0.5    | 2.5     | 0.2     | **1.6**  |\n| OS-Atlas-4B            | 7.1      | 0.0      | 3.7      | 3.0          | 1.4          | 2.3          | 2.0     | 0.0     | 1.5     | 9.0           | 5.5           | 7.5           | 5.1        | 3.8        | 4.8        | 5.6    | 0.0    | 3.1    | 5.0     | 1.7     | **3.7**  |\n| ShowUI-2B              | 16.9     | 1.4      | 9.4      | 9.1          | 0.0          | 5.3          | 2.5     | 0.0     | 1.9     | 13.2          | 7.3           | 10.6          | 15.3       | 7.5        | 13.5       | 10.3   | 2.2    | 6.6    | 10.8    | 2.6     | **7.7**  |\n| CogAgent-18B           | 14.9     | 0.7      | 8.0      | 9.6          | 0.0          | 5.6          | 7.1     | 3.1     | 6.1     | 22.2          | 1.8           | 13.4          | 13.0       | 0.0        | 10.0       | 5.6    | 0.0    | 3.1    | 12.0    | 0.8     | **7.7**  |\n| Aria-UI                | 16.2     | 0.0      | 8.4      | 23.7         | 2.1          | 14.7         | 7.6     | 1.6     | 6.1     | 27.1          | 6.4           | 18.1          | 20.3       | 1.9        | 16.1       | 4.7    | 0.0    | 2.6    | 17.1    | 2.0     | **11.3**  |\n| UGround-7B             | 26.6     | 2.1      | 14.7     | 27.3         | 2.8          | 17.0         | 14.2    | 1.6     | 11.1    | 31.9          | 2.7           | 19.3          | 31.6       | 11.3       | 27.0       | 17.8   | 0.0    | 9.7    | 25.0    | 2.8     | **16.5**  |\n| Claude Computer Use      | 22.0  | 3.9   | 12.6  | 25.9  | 3.4   | 16.8  | 14.5  | 3.7   | 11.9  | 33.9  | 15.8  | 25.8  | 30.1  | 16.3  | 26.9  | 11.0  | 4.5   | 8.1   | 23.4  | 7.1  | **17.1**  |\n| OS-Atlas-7B              | 33.1  | 1.4   | 17.7  | 28.8  | 2.8   | 17.9  | 12.2  | 4.7   | 10.3  | 37.5  | 7.3   | 24.4  | 33.9  | 5.7   | 27.4  | 27.1  | 4.5   | 16.8  | 28.1  | 4.0  | **18.9**  |\n| UGround-V1-7B            | -     | -     | 35.5  | -     | -     | 27.8  | -     | -     | 13.5  | -     | -     | 38.8  | -     | -     | 48.8  | -     | -     | 26.1  | -     | -    | **31.1**  |\n| **UI-TARS-2B**        | 47.4     | 4.1      | 26.4     | 42.9         | 6.3          | 27.6         | 17.8    | 4.7     | 14.6    | 56.9          | 17.3          | 39.8          | 50.3       | 17.0       | 42.6       | 21.5   | 5.6    | 14.3   | 39.6    | 8.4     | **27.7**  |\n| **UI-TARS-7B**        | 58.4     | 12.4     | 36.1     | 50.0         | 9.1          | 32.8         | **20.8**| 9.4     | **18.0**| 63.9          | **31.8**      | **50.0**      | **63.3**   | 20.8       | 53.5       | 30.8   | **16.9**| 24.5   | 47.8    | 16.2    | **35.7**  |\n| **UI-TARS-72B**       | **63.0** | **17.3** | **40.8** | **57.1**     | **15.4**     | **39.6**     | 18.8    | **12.5**| 17.2    | **64.6**      | 20.9          | 45.7          | **63.3**   | **26.4**   | **54.8**   | **42.1**| 15.7    | **30.1**| **50.9**| **17.5**| **38.1**  |\n\n\n- **ScreenSpot**\n\n| Method |  Mobile-Text | Mobile-Icon/Widget | Desktop-Text | Desktop-Icon/Widget | Web-Text | Web-Icon/Widget | Avg |\n|--------|-------------|-------------|-------------|-------------|-------------|---------|---------|\n| **Agent Framework**  | | | | | | | |\n| GPT-4 (SeeClick) |  76.6 | 55.5 | 68.0 | 28.6 | 40.9 | 23.3 | **48.8** |\n| GPT-4 (OmniParser)  | 93.9 | 57.0 | 91.3 | 63.6 | 81.3 | 51.0 | **73.0** |\n| GPT-4 (UGround-7B)  | 90.1 | 70.3 | 87.1 | 55.7 | 85.7 | 64.6 | **75.6** |\n| GPT-4o (SeeClick)  | 81.0 | 59.8 | 69.6 | 33.6 | 43.9 | 26.2 | **52.3** |\n| GPT-4o (UGround-7B)  | 93.4 | 76.9 | 92.8 | 67.9 | 88.7 | 68.9 | **81.4** |\n| **Agent Model**   | | | | | | | |\n| GPT-4  | 22.6 | 24.5 | 20.2 | 11.8 | 9.2 | 8.8 | **16.2** |\n| GPT-4o  | 20.2 | 24.9 | 21.1 | 23.6 | 12.2 | 7.8 | **18.3** |\n| CogAgent  | 67.0 | 24.0 | 74.2 | 20.0 | 70.4 | 28.6 | **47.4** |\n| SeeClick  | 78.0 | 52.0 | 72.2 | 30.0 | 55.7 | 32.5 | **53.4** |\n| Qwen2-VL  | 75.5 | 60.7 | 76.3 | 54.3 | 35.2 | 25.7 | **55.3** |\n| UGround-7B  | 82.8 | 60.3 | 82.5 | 63.6 | 80.4 | 70.4 | **73.3** |\n| Aguvis-G-7B  | 88.3 | 78.2 | 88.1 | 70.7 | 85.7 | 74.8 | **81.8** |\n| OS-Atlas-7B | 93.0 | 72.9 | 91.8 | 62.9 | 90.9 | 74.3 | **82.5** |\n| Claude Computer Use  | - | - | - | - | - | - | **83.0** |\n| Gemini 2.0 (Project Mariner)  | - | - | - | - | - | - | **84.0** |\n| Aguvis-7B  | **95.6** | 77.7 | 93.8 | 67.1 | 88.3 | 75.2 | **84.4** |\n| Aguvis-72B  | 94.5 | **85.2** | 95.4 | 77.9 | **91.3** | **85.9** | **89.2** |\n| **Our Model**   | | | | | | | |\n| **UI-TARS-2B**  | 93.0 | 75.5 | 90.7 | 68.6 | 84.3 | 74.8 | **82.3** |\n| **UI-TARS-7B**  | 94.5 | **85.2** | **95.9** | 85.7 | 90.0 | 83.5 | **89.5** |\n| **UI-TARS-72B**  | 94.9 | 82.5 | 89.7 | **88.6** | 88.7 | 85.0 | **88.4** |\n\n\n- **ScreenSpot v2**\n\n| Method |  Mobile-Text | Mobile-Icon/Widget | Desktop-Text | Desktop-Icon/Widget | Web-Text | Web-Icon/Widget | Avg |\n|--------|-------------|-------------|-------------|-------------|-------------|---------|---------|\n| **Agent Framework**  | | | | | | | |\n| GPT-4o (SeeClick)  | 85.2 | 58.8 | 79.9 | 37.1 | 72.7 | 30.1 | **63.6** |\n| GPT-4o (OS-Atlas-4B)  | 95.5 | 75.8 | 79.4 | 49.3 | 90.2 | 66.5 | **79.1** |\n| GPT-4o (OS-Atlas-7B)  | 96.2 | 83.4 | 89.7 | 69.3 | **94.0** | 79.8 | **87.1** |\n| **Agent Model**  | | | | | | | |\n| SeeClick  | 78.4 | 50.7 | 70.1 | 29.3 | 55.2 | 32.5 | **55.1** |\n| OS-Atlas-4B  | 87.2 | 59.7 | 72.7 | 46.4 | 85.9 | 63.1 | **71.9** |\n| OS-Atlas-7B  | 95.2 | 75.8 | 90.7 | 63.6 | 90.6 | 77.3 | **84.1** |\n| **Our Model**  | | | | | | | |\n| **UI-TARS-2B**  | 95.2 | 79.1 | 90.7 | 68.6 | 87.2 | 78.3 | **84.7** |\n| **UI-TARS-7B** | **96.9** | **89.1** | **95.4** | 85.0 | 93.6 | 85.2 | **91.6** |\n| **UI-TARS-72B**  | 94.8 | 86.3 | 91.2 | **87.9** | 91.5 | **87.7** | **90.3** |\n\n\n**Offline Agent Capability Evaluation**\n- **Multimodal Mind2Web**\n\n| Method |  Cross-Task Ele.Acc | Cross-Task Op.F1 | Cross-Task Step SR | Cross-Website Ele.Acc | Cross-Website Op.F1 | Cross-Website Step SR | Cross-Domain Ele.Acc | Cross-Domain Op.F1 | Cross-Domain Step SR |\n|--------|----------------------|-------------------|--------------------|----------------------|--------------------|-------------------|--------------------|-------------------|-------------------|\n| **Agent Framework**  | | | | | | | | | |\n| GPT-4o (SeeClick)  | 32.1 | - | - | 33.1 | - | - | 33.5 | - | - |\n| GPT-4o (UGround)  | 47.7 | - | - | 46.0 | - | - | 46.6 | - | - |\n| GPT-4o (Aria-UI)  | 57.6 | - | - | 57.7 | - | - | 61.4 | - | - |\n| GPT-4V (OmniParser)  | 42.4 | 87.6 | 39.4 | 41.0 | 84.8 | 36.5 | 45.5 | 85.7 | 42.0 |\n| **Agent Model** |  | | | | | | | | |\n| GPT-4o  | 5.7 | 77.2 | 4.3 | 5.7 | 79.0 | 3.9 | 5.5 | 86.4 | 4.5 |\n| GPT-4 (SOM)  | 29.6 | - | 20.3 | 20.1 | - | 13.9 | 27.0 | - | 23.7 |\n| GPT-3.5 (Text-only)  | 19.4 | 59.2 | 16.8 | 14.9 | 56.5 | 14.1 | 25.2 | 57.9 | 24.1 |\n| GPT-4 (Text-only)  | 40.8 | 63.1 | 32.3 | 30.2 | 61.0 | 27.0 | 35.4 | 61.9 | 29.7 |\n| Claude  | 62.7 | 84.7 | 53.5 | 59.5 | 79.6 | 47.7 | 64.5 | 85.4 | 56.4 |\n| Aguvis-7B  | 64.2 | 89.8 | 60.4 | 60.7 | 88.1 | 54.6 | 60.4 | 89.2 | 56.6 |\n| CogAgent  | - | - | 62.3 | - | - | 54.0 | - | - | 59.4 |\n| Aguvis-72B  | 69.5 | 90.8 | 64.0 | 62.6 | 88.6 | 56.5 | 63.5 | 88.5 | 58.2 |\n| **Our Model**  | | | | | | | | | |\n| **UI-TARS-2B**  | 62.3 | 90.0 | 56.3 | 58.5 | 87.2 | 50.8 | 58.8 | 89.6 | 52.3 |\n| **UI-TARS-7B**  | 73.1 | 92.2 | 67.1 | 68.2 | 90.9 | 61.7 | 66.6 | 90.9 | 60.5 |\n| **UI-TARS-72B**  | **74.7** | **92.5** | **68.6** | **72.4** | **91.2** | **63.5** | **68.9** | **91.8** | **62.1** |\n\n\n- **Android Control and GUI Odyssey**\n\n| Agent Models        | AndroidControl-Low Type | AndroidControl-Low Grounding | AndroidControl-Low SR | AndroidControl-High Type | AndroidControl-High Grounding | AndroidControl-High SR | GUIOdyssey Type | GUIOdyssey Grounding | GUIOdyssey SR |\n|---------------------|----------------------|----------------------|----------------|----------------------|----------------------|----------------|----------------|----------------|----------------|\n| Claude             | 74.3                 | 0.0                  | 19.4           | 63.7                 | 0.0                  | 12.5           | 60.9           | 0.0            | 3.1            |\n| GPT-4o             | 74.3                 | 0.0                  | 19.4           | 66.3                 | 0.0                  | 20.8           | 34.3           | 0.0            | 3.3            |\n| SeeClick           | 93.0                 | 73.4                 | 75.0           | 82.9                 | 62.9                 | 59.1           | 71.0           | 52.4           | 53.9           |\n| InternVL-2-4B      | 90.9                 | 84.1                 | 80.1           | 84.1                 | 72.7                 | 66.7           | 82.1           | 55.5           | 51.5           |\n| Qwen2-VL-7B       | 91.9                 | 86.5                 | 82.6           | 83.8                 | 77.7                 | 69.7           | 83.5           | 65.9           | 60.2           |\n| Aria-UI           | --                   | 87.7                 | 67.3           | --                   | 43.2                 | 10.2           | --             | 86.8           | 36.5           |\n| OS-Atlas-4B       | 91.9                 | 83.8                 | 80.6           | 84.7                 | 73.8                 | 67.5           | 83.5           | 61.4           | 56.4           |\n| OS-Atlas-7B       | 93.6                 | 88.0                 | 85.2           | 85.2                 | 78.5                 | 71.2           | 84.5           | 67.8           | 62.0           |\n| Aguvis-7B         | --                   | --                   | 80.5           | --                   | --                   | 61.5           | --             | --             | --             |\n| Aguvis-72B        | --                   | --                   | 84.4           | --                   | --                   | 66.4           | --             | --             | --             |\n| **UI-TARS-2B**   | **98.1**             | 87.3                 | 89.3           | 81.2                 | 78.4                 | 68.9           | 93.9           | 86.8           | 83.4           |\n| **UI-TARS-7B**   | 98.0                 | 89.3                 | 90.8           | 83.7                 | 80.5                 | 72.5           | 94.6           | 90.1           | 87.0           |\n| **UI-TARS-72B**  | **98.1**             | **89.9**             | **91.3**       | **85.2**             | **81.5**             | **74.7**       | **95.4**       | **91.4**       | **88.6**       |\n\n**Online Agent Capability Evaluation**\n\n| Method |  OSWorld (Online) | AndroidWorld (Online) |\n|--------|-------------------|------------------|\n| **Agent Framework**  | | |\n| GPT-4o (UGround)  | - | 32.8 |\n| GPT-4o (Aria-UI)  | 15.2 | 44.8 |\n| GPT-4o (Aguvis-7B)  | 14.8 | 37.1 |\n| GPT-4o (Aguvis-72B)  | 17.0 | - |\n| GPT-4o (OS-Atlas-7B)  | 14.6 | - |\n| **Agent Model**  | | |\n| GPT-4o  | 5.0 | 34.5 (SoM) |\n| Gemini-Pro-1.5  | 5.4 | 22.8 (SoM) |\n| Aguvis-72B  | 10.3 | 26.1 |\n| Claude Computer-Use  | 14.9 (15 steps) | 27.9 |\n| Claude Computer-Use  | 22.0 (50 steps) | - |\n| **Our Model**  | | |\n| **UI-TARS-7B-SFT**  | 17.7 (15 steps) | 33.0 |\n| **UI-TARS-7B-DPO**  | 18.7 (15 steps) | - |\n| **UI-TARS-72B-SFT**  | 18.8 (15 steps) | **46.6** |\n| **UI-TARS-72B-DPO**  | **22.7** (15 steps) | - |\n| **UI-TARS-72B-DPO**  | **24.6** (50 steps) | - |\n\n## Deployment\n\n### Cloud Deployment\nWe recommend using HuggingFace Inference Endpoints for fast deployment.\nWe provide two docs for reference:\n\nEnglish version: [GUI Model Deployment Guide](https://juniper-switch-f10.notion.site/GUI-Model-Deployment-Guide-17b5350241e280058e98cea60317de71)\n\n中文版: [GUI模型部署教程](https://bytedance.sg.larkoffice.com/docx/TCcudYwyIox5vyxiSDLlgIsTgWf#U94rdCxzBoJMLex38NPlHL21gNb)\n\n### Local Deployment [Transformers]\nWe follow the same way as Qwen2-VL. Check this [tutorial](https://github.com/QwenLM/Qwen2-VL?tab=readme-ov-file#using---transformers-to-chat) for more details.\n\n### Local Deployment [vLLM]\nWe recommend using vLLM for fast deployment and inference. You need to use `vllm\u003e=0.6.1`.\n```bash\npip install -U transformers\nVLLM_VERSION=0.6.6\nCUDA_VERSION=cu124\npip install vllm==${VLLM_VERSION} --extra-index-url https://download.pytorch.org/whl/${CUDA_VERSION}\n\n```\n#### Download the Model\nWe provide three model sizes on Hugging Face: **2B**, **7B**, and **72B**. To achieve the best performance, we recommend using the **7B-DPO** or **72B-DPO** model (depends on your GPU configuration):\n\n- [2B-SFT](https://huggingface.co/bytedance-research/UI-TARS-2B-SFT)\n- [7B-SFT](https://huggingface.co/bytedance-research/UI-TARS-7B-SFT)\n- [7B-DPO](https://huggingface.co/bytedance-research/UI-TARS-7B-DPO)\n- [72B-SFT](https://huggingface.co/bytedance-research/UI-TARS-72B-SFT)\n- [72B-DPO](https://huggingface.co/bytedance-research/UI-TARS-72B-DPO)\n\n\n#### Start an OpenAI API Service\nRun the command below to start an OpenAI-compatible API service. It is recommended to set the tensor parallel size `-tp=1` for 7B models and `-tp=4` for 72B models.\n\n```bash\npython -m vllm.entrypoints.openai.api_server --served-model-name ui-tars \\\n    --model \u003cpath to your model\u003e --limit-mm-per-prompt image=5 -tp \u003ctp\u003e\n```\n\nThen you can use the chat API as below with the gui prompt (choose from mobile or computer) and base64-encoded local images (see [OpenAI API protocol document](https://platform.openai.com/docs/guides/vision/uploading-base-64-encoded-images) for more details), you can also use it in [UI-TARS-desktop](https://github.com/bytedance/UI-TARS-desktop):\n```python\nimport base64\nfrom openai import OpenAI\n\n\ninstruction = \"search for today's weather\"\nscreenshot_path = \"screenshot.png\"\nclient = OpenAI(\n    base_url=\"http://127.0.0.1:8000/v1\",\n    api_key=\"empty\",\n)\n\n## Below is the prompt for mobile\nprompt = r\"\"\"You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task. \n\n## Output Format\n```\\nThought: ...\nAction: ...\\n```\n\n## Action Space\n\nclick(start_box='\u003c|box_start|\u003e(x1,y1)\u003c|box_end|\u003e')\nleft_double(start_box='\u003c|box_start|\u003e(x1,y1)\u003c|box_end|\u003e')\nright_single(start_box='\u003c|box_start|\u003e(x1,y1)\u003c|box_end|\u003e')\ndrag(start_box='\u003c|box_start|\u003e(x1,y1)\u003c|box_end|\u003e', end_box='\u003c|box_start|\u003e(x3,y3)\u003c|box_end|\u003e')\nhotkey(key='')\ntype(content='') #If you want to submit your input, use \\\"\\\n\\\" at the end of `content`.\nscroll(start_box='\u003c|box_start|\u003e(x1,y1)\u003c|box_end|\u003e', direction='down or up or right or left')\nwait() #Sleep for 5s and take a screenshot to check for any changes.\nfinished()\ncall_user() # Submit the task and call the user when the task is unsolvable, or when you need the user's help.\n\n\n## Note\n- Use Chinese in `Thought` part.\n- Summarize your next action (with its target element) in one sentence in `Thought` part.\n\n## User Instruction\n\"\"\"\n\nwith open(screenshot_path, \"rb\") as image_file:\n    encoded_string = base64.b64encode(image_file.read()).decode(\"utf-8\")\nresponse = client.chat.completions.create(\n    model=\"ui-tars\",\n    messages=[\n        {\n            \"role\": \"user\",\n            \"content\": [\n                {\"type\": \"text\", \"text\": prompt + instruction},\n                {\"type\": \"image_url\", \"image_url\": {\"url\": f\"data:image/png;base64,{encoded_string}\"}},\n            ],\n        },\n    ],\n    frequency_penalty=1,\n    max_tokens=128,\n)\nprint(response.choices[0].message.content)\n```\n\nFor single step grounding task or inference on grounding dataset such as Seeclick, kindly refer to the following script:\n```python\nimport base64\nfrom openai import OpenAI\n\n\ninstruction = \"search for today's weather\"\nscreenshot_path = \"screenshot.png\"\nclient = OpenAI(\n    base_url=\"http://127.0.0.1:8000/v1\",\n    api_key=\"empty\",\n)\n\n## Below is the prompt for mobile\nprompt = r\"\"\"Output only the coordinate of one point in your response. What element matches the following task: \"\"\"\n\nwith open(screenshot_path, \"rb\") as image_file:\n    encoded_string = base64.b64encode(image_file.read()).decode(\"utf-8\")\nresponse = client.chat.completions.create(\n    model=\"ui-tars\",\n    messages=[\n        {\n            \"role\": \"user\",\n            \"content\": [\n                {\"type\": \"image_url\", \"image_url\": {\"url\": f\"data:image/png;base64,{encoded_string}\"}},\n                {\"type\": \"text\", \"text\": prompt + instruction}\n            ],\n        },\n    ],\n    frequency_penalty=1,\n    max_tokens=128,\n)\nprint(response.choices[0].message.content)\n```\n\n### Prompt Templates\nWe provide two prompt templates currently for stable running and performance, one for mobile scene and one for personal computer scene.\n- Prompt template for mobile:\n```python\n## Below is the prompt for mobile\nprompt = r\"\"\"You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task. \n\n## Output Format\n```\\nThought: ...\nAction: ...\\n```\n\n## Action Space\nclick(start_box='\u003c|box_start|\u003e(x1,y1)\u003c|box_end|\u003e')\nlong_press(start_box='\u003c|box_start|\u003e(x1,y1)\u003c|box_end|\u003e', time='')\ntype(content='')\nscroll(start_box='\u003c|box_start|\u003e(x1,y1)\u003c|box_end|\u003e', end_box='\u003c|box_start|\u003e(x3,y3)\u003c|box_end|\u003e')\npress_home()\npress_back()\nfinished(content='') # Submit the task regardless of whether it succeeds or fails.\n\n## Note\n- Use English in `Thought` part.\n\n- Write a small plan and finally summarize your next action (with its target element) in one sentence in `Thought` part.\n\n## User Instruction\n\"\"\"\n```\n\n- Prompt template for computer:\n```python\n## Below is the prompt for computer\nprompt = r\"\"\"You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task. \n\n## Output Format\n```\\nThought: ...\nAction: ...\\n```\n\n## Action Space\n\nclick(start_box='\u003c|box_start|\u003e(x1,y1)\u003c|box_end|\u003e')\nleft_double(start_box='\u003c|box_start|\u003e(x1,y1)\u003c|box_end|\u003e')\nright_single(start_box='\u003c|box_start|\u003e(x1,y1)\u003c|box_end|\u003e')\ndrag(start_box='\u003c|box_start|\u003e(x1,y1)\u003c|box_end|\u003e', end_box='\u003c|box_start|\u003e(x3,y3)\u003c|box_end|\u003e')\nhotkey(key='')\ntype(content='') #If you want to submit your input, use \\\"\\\n\\\" at the end of `content`.\nscroll(start_box='\u003c|box_start|\u003e(x1,y1)\u003c|box_end|\u003e', direction='down or up or right or left')\nwait() #Sleep for 5s and take a screenshot to check for any changes.\nfinished()\ncall_user() # Submit the task and call the user when the task is unsolvable, or when you need the user's help.\n\n\n## Note\n- Use Chinese in `Thought` part.\n- Summarize your next action (with its target element) in one sentence in `Thought` part.\n\n## User Instruction\n\"\"\"\n```\n\n### Local Deployment [Ollama]\n\u003c!-- Ollama can deploy the model via gguf format. Bugs exist for safetensors. --\u003eOllama will be coming soon. Please be patient and wait~ 😊\n\u003c!-- #### Get the model in GGUF format\nWe provide 2B and 7B model in [GGUF](https://huggingface.co/docs/hub/en/gguf) format:\n\n2B: https://huggingface.co/bytedance-research/UI-TARS-2B-gguf\n\n7B: https://huggingface.co/bytedance-research/UI-TARS-7B-gguf\n\nUsers can convert the model into GGUF format by using the script from [llama.cpp](https://github.com/ggerganov/llama.cpp/blob/master/convert_hf_to_gguf.py):\n\n```bash\npython3 convert_hf_to_gguf.py \u003cpath to your model\u003e\n```\n\nThe GGUF file will be generated under the path provided.\n\n#### Deploy GGUF model\nWe deploy the model by following Ollama [tutorial](https://github.com/ollama/ollama?tab=readme-ov-file#customize-a-model).\n\n```bash\n# Create Modelfile, Windows users can just create a file named Modelfile\necho \"FROM ./path/to/model.gguf\" \u003e Modelfile\n\n# Create model in Ollama\nollama create ui-tars -f Modelfile\n\n# Run the model\nollama run ui-tars\n\n```\n\nTest script is same as vLLM except two changes:\n\n```python\n...\nclient = OpenAI(\n    base_url=\"http://127.0.0.1:11434/v1/\",\n    ...\n)\n...\nresponse = client.chat.completions.create(\n    model=\"ui-tars\" # the name we create via Ollama cli\n    ...\n)\n\n``` --\u003e\n\n### Explanation of Inference Results\n\n#### Coordinate Mapping\nThe model generates a 2D coordinate output that represents relative positions. To convert these values to image-relative coordinates, divide each component by 1000 to obtain values in the range [0,1]. The absolute coordinates required by the Action can be calculated by: \n- X absolute = X relative × image width\n- Y absolute = Y relative × image height\n\nFor example, given a screen size: 1920 × 1080, and the model generates a coordinate output of (235, 512). The X absolute is `round(1920*235/1000)=451`. The Y absolute is `round(1080*512/1000)=553`. The absolute coordinate is (451, 553)\n\n## Use in desktop and web automation\n\nTo experience UI-TARS agent in desktop, you may refer to [UI-TARS-desktop](https://github.com/bytedance/UI-TARS-desktop). We recommend using the **7B/72B DPO model** on desktop.\n\n[Midscene.js](https://github.com/web-infra-dev/Midscene) is an open-source web automation SDK that has supported UI-TARS model. Developers can use javascript and natural language to control the browser. See [this guide](https://midscenejs.com/choose-a-model) for more details about setting up the model.\n\n## License\n\nUI-TARS is licensed under the Apache License 2.0.\n\n## Acknowledgements\nThis project builds upon and extends the capabilities of Qwen2-VL, a powerful vision-language model, which serves as the foundational architecture for UI-TARS. We would like to acknowledge the contributions of the developers and researchers behind Qwen2-VL for their groundbreaking work in the field of multimodal AI and for providing a robust base for further advancements.\n\nAdditionally, we thank the broader open-source community for their datasets, tools, and insights that have facilitated the development of UI-TARS. These collaborative efforts continue to push the boundaries of what GUI automation and AI-driven agents can achieve.\n\n## Citation\nIf you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil:\n\n```BibTeX\n@article{qin2025ui,\n  title={UI-TARS: Pioneering Automated GUI Interaction with Native Agents},\n  author={Qin, Yujia and Ye, Yining and Fang, Junjie and Wang, Haoming and Liang, Shihao and Tian, Shizuo and Zhang, Junda and Li, Jiahao and Li, Yunxin and Huang, Shijue and others},\n  journal={arXiv preprint arXiv:2501.12326},\n  year={2025}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbytedance%2FUI-TARS","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbytedance%2FUI-TARS","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbytedance%2FUI-TARS/lists"}