{"id":19520131,"url":"https://github.com/osu-nlp-group/uground","last_synced_at":"2025-05-16T07:07:40.108Z","repository":{"id":251689941,"uuid":"837299429","full_name":"OSU-NLP-Group/UGround","owner":"OSU-NLP-Group","description":"[ICLR'25 Oral] UGround: Universal GUI Visual Grounding for GUI Agents","archived":false,"fork":false,"pushed_at":"2025-05-02T07:43:45.000Z","size":82384,"stargazers_count":214,"open_issues_count":3,"forks_count":11,"subscribers_count":7,"default_branch":"main","last_synced_at":"2025-05-02T08:41:36.012Z","etag":null,"topics":["artificial-intelligence","gui-agents","web-agents"],"latest_commit_sha":null,"homepage":"https://osu-nlp-group.github.io/UGround/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/OSU-NLP-Group.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-08-02T16:37:08.000Z","updated_at":"2025-05-02T07:43:49.000Z","dependencies_parsed_at":"2024-08-05T04:44:05.998Z","dependency_job_id":"226b1510-f161-4712-95e3-f6d7ce8b7a03","html_url":"https://github.com/OSU-NLP-Group/UGround","commit_stats":{"total_commits":68,"total_committers":4,"mean_commits":17.0,"dds":0.3529411764705882,"last_synced_commit":"965535c6951b28a16694fed21b2ed3369a14d242"},"previous_names":["osu-nlp-group/uground"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OSU-NLP-Group%2FUGround","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OSU-NLP-Group%2FUGround/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OSU-NLP-Group%2FUGround/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OSU-NLP-Group%2FUGround/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/OSU-NLP-Group","download_url":"https://codeload.github.com/OSU-NLP-Group/UGround/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254485065,"owners_count":22078767,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["artificial-intelligence","gui-agents","web-agents"],"created_at":"2024-11-11T00:23:57.399Z","updated_at":"2025-05-16T07:07:35.093Z","avatar_url":"https://github.com/OSU-NLP-Group.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n\n\n# UGround\nThis is the official code repository for the project: *Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents* [**ICLR'25 Oral**]. This work is a collaboration between [OSU NLP Group](https://x.com/osunlp) and [Orby AI](https://www.orby.ai/).\n\u003cimg width=\"1556\" alt=\"image\" src=\"https://github.com/user-attachments/assets/18c6a9f4-31cc-4817-a252-bfd0dbaf3fd6\"\u003e\n\n- [🏠Homepage](https://osu-nlp-group.github.io/UGround)\n- [📖Paper](https://arxiv.org/abs/2410.05243)\n- [😊Model Weights](https://huggingface.co/collections/osunlp/uground-677824fc5823d21267bc9812)\n- [😊Demo](https://huggingface.co/spaces/orby-osu/UGround)\n- [😊Training Data](https://huggingface.co/datasets/osunlp/UGround-V1-Data)\n\n\u003ch3\u003eUpdates\u003c/h3\u003e\n\n- 2025/01/23: Our [training data](https://huggingface.co/datasets/osunlp/UGround-V1-Data) for the UGround-V1 series (Initial/Qwen2-VL) has been released. We also have provided a comprehensive evaluation suite packed with meaningful resources to help researchers test GUI Agents and grounding models with ease. Try them out! The performance of Qwen2-VL-based UGround-V1 on several benchmarks are also updated on the [homepage](https://osu-nlp-group.github.io/UGround) (e.g., AndroidWorld: 33-\u003e44). \n\n- 2025/01/05: Qwen2-VL-based UGround-V1 acheives SOTA results on a new and comprehensive GUI grounding benchmark ScreenSpot-Pro, substaintially outperforms prior models (18.9-\u003e31.1). Check the [results](https://gui-agent.github.io/grounding-leaderboard/) and [our tweet](https://x.com/BoyuGouNLP/status/1876299190889742391).\n\n- 2025/01/03: Qwen2-VL-based UGround-V1 has been released ([2B](https://huggingface.co/osunlp/UGround-V1-2B), [7B](https://huggingface.co/osunlp/UGround-V1-7B), [72B](https://huggingface.co/osunlp/UGround-V1-72B)). Check thier performance in [Main Results](#main-results).\n\n- 2024/10/07: Preprint is arXived. Demo is live. Code coming soon.\n\n- 2024/08/06: Website is live. The initial manuscript and results are available.\n\n\n\n\u003ch3\u003eRelease Plans:\u003c/h3\u003e\n\n- [x] [Model Weights](https://huggingface.co/collections/osunlp/uground-677824fc5823d21267bc9812)\n  - [x] Initial Version (the one used in the paper)\n  - [x] Qwen2-VL-Based V1 (2B, 7B, 72B)\n- [x] Code\n  - [x] [Training and Inference](https://github.com/OSU-NLP-Group/UGround/tree/main/train)\n  - [x] Offline Experiments (Code, Results, and Useful Resources)\n    - [x] [ScreenSpot](https://github.com/OSU-NLP-Group/UGround/tree/main/offline_evaluation/ScreenSpot)\n    - [x] [Multimodal-Mind2Web](https://github.com/OSU-NLP-Group/UGround/tree/main/offline_evaluation/Multimodal-Mind2Web)\n    - [x] [OmniAct](https://github.com/OSU-NLP-Group/UGround/tree/main/offline_evaluation/OmniACT)\n    - [x] [Android Control](https://github.com/OSU-NLP-Group/UGround/tree/main/offline_evaluation/AndroidControl)\n  - [x] Online Experiments\n    - [x] [Mind2Web-Live-SeeAct-V](https://github.com/boyugou/Mind2Web_Live_SeeAct_V)\n    - [x] [AndroidWorld-SeeAct-V](https://github.com/boyugou/android_world_seeact_v)\n  - [ ] Data Synthesis Pipeline (Coming Soon)\n- [x] [Training Data (V1)](https://huggingface.co/datasets/osunlp/UGround-V1-Data)\n- [x] Online Demo (HF Spaces)\n\n\n# Main Results\n\n## GUI Visual Grounding: ScreenSpot (Standard Setting)\n\n![image](https://github.com/user-attachments/assets/d608c189-2cac-4fd9-9b25-d60847916159)\n\n| ScreenSpot (Standard)         | Arch             | SFT data         | Mobile-Text | Mobile-Icon | Desktop-Text | Desktop-Icon | Web-Text | Web-Icon | Avg      |\n| ----------------------------- | ---------------- | ---------------- | ----------- | ----------- | ------------ | ------------ | -------- | -------- | -------- |\n| InternVL-2-4B                 | InternVL-2       |                  | 9.2         | 4.8         | 4.6          | 4.3          | 0.9      | 0.1      | 4.0      |\n| Groma                         | Groma            |                  | 10.3        | 2.6         | 4.6          | 4.3          | 5.7      | 3.4      | 5.2      |\n| Qwen-VL                       | Qwen-VL          |                  | 9.5         | 4.8         | 5.7          | 5.0          | 3.5      | 2.4      | 5.2      |\n| MiniGPT-v2                    | MiniGPT-v2       |                  | 8.4         | 6.6         | 6.2          | 2.9          | 6.5      | 3.4      | 5.7      |\n| GPT-4                         |                  |                  | 22.6        | 24.5        | 20.2         | 11.8         | 9.2      | 8.8      | 16.2     |\n| GPT-4o                        |                  |                  | 20.2        | 24.9        | 21.1         | 23.6         | 12.2     | 7.8      | 18.3     |\n| Fuyu                          | Fuyu             |                  | 41.0        | 1.3         | 33.0         | 3.6          | 33.9     | 4.4      | 19.5     |\n| Qwen-GUI                      | Qwen-VL          | GUICourse        | 52.4        | 10.9        | 45.9         | 5.7          | 43.0     | 13.6     | 28.6     |\n| Ferret-UI-Llama8b             | Ferret-UI        |                  | 64.5        | 32.3        | 45.9         | 11.4         | 28.3     | 11.7     | 32.3     |\n| Qwen2-VL                      | Qwen2-VL         |                  | 61.3        | 39.3        | 52.0         | 45.0         | 33.0     | 21.8     | 42.1     |\n| CogAgent                      | CogAgent         |                  | 67.0        | 24.0        | 74.2         | 20.0         | 70.4     | 28.6     | 47.4     |\n| SeeClick                      | Qwen-VL          | SeeClick         | 78.0        | 52.0        | 72.2         | 30.0         | 55.7     | 32.5     | 53.4     |\n| OS-Atlas-Base-4B              | InternVL-2       | OS-Atlas         | 85.7        | 58.5        | 72.2         | 45.7         | 82.6     | 63.1     | 68.0     |\n| OmniParser                    |                  |                  | 93.9        | 57.0        | 91.3         | 63.6         | 81.3     | 51.0     | 73.0     |\n| **UGround (Initial)**         | LLaVA-UGround-V1 | UGround-V1       | 82.8        | 60.3        | 82.5         | 63.6         | 80.4     | 70.4     | 73.3     |\n| Iris                          | Iris             | SeeClick         | 85.3        | 64.2        | 86.7         | 57.5         | 82.6     | 71.2     | 74.6     |\n| ShowUI-G                      | ShowUI           | ShowUI           | 91.6        | 69.0        | 81.8         | 59.0         | 83.0     | 65.5     | 75.0     |\n| ShowUI                        | ShowUI           | ShowUI           | 92.3        | 75.5        | 76.3         | 61.1         | 81.7     | 63.6     | 75.1     |\n| Molmo-7B-D                    |                  |                  | 85.4        | 69.0        | 79.4         | 70.7         | 81.3     | 65.5     | 75.2     |\n| **UGround-V1-2B (Qwen2-VL)**  | Qwen2-VL         | UGround-V1       | 89.4        | 72.0        | 88.7         | 65.7         | 81.3     | 68.9     | 77.7     |\n| Molmo-72B                     |                  |                  | 92.7        | 79.5        | 86.1         | 64.3         | 83.0     | 66.0     | 78.6     |\n| Aguvis-G-7B                   | Qwen2-VL         | Aguvis-Stage-1   | 88.3        | 78.2        | 88.1         | 70.7         | 85.7     | 74.8     | 81.0     |\n| OS-Atlas-Base-7B              | Qwen2-VL         | OS-Atlas         | 93.0        | 72.9        | 91.8         | 62.9         | 90.9     | 74.3     | 81.0     |\n| Aria-UI                       | Aria             | Aria-UI          | 92.3        | 73.8        | 93.3         | 64.3         | 86.5     | 76.2     | 81.1     |\n| Claude (Computer-Use)         |                  |                  | **98.2**    | **85.6**    | 79.9         | 57.1         | **92.2** | 84.5     | 82.9     |\n| Aguvis-7B                     | Qwen2-VL         | Aguvis-Stage-1\u00262 | 95.6        | 77.7        | 93.8         | 67.1         | 88.3     | 75.2     | 83.0     |\n| Project Mariner               |                  |                  |             |             |              |              |          |          | 84.0     |\n| CogAgent-9B-20241220          | GLM-4V-9B        |                  |             |             |              |              |          |          | 85.4     |\n| **UGround-V1-7B (Qwen2-VL)**  | Qwen2-VL         | UGround-V1       | 93.0        | 79.9        | 93.8         | 76.4         | 90.9     | 84.0     | 86.3     |\n| AGUVIS-72B                    | Qwen2-VL         | Aguvis-Stage-1\u00262 | 94.5        | 85.2        | **95.4**     | 77.9         | 91.3     | 85.9     | 88.4     |\n| **UGround-V1-72B (Qwen2-VL)** | Qwen2-VL         | UGround-V1       | 94.1        | 83.4        | 94.9         | **85.7**     | 90.4     | **87.9** | **89.4** |\n\n\n\n\n\n\n\n\n\n## GUI Visual Grounding: ScreenSpot (Agent Setting)\n\n\n\n\n\n\n| Planner | Agent-Screenspot         | arch             | SFT data         | Mobile-Text | Mobile-Icon | Desktop-Text | Desktop-Icon | Web-Text | Web-Icon | Avg  |\n| ------- | ------------------------ | ---------------- | ---------------- | ----------- | ----------- | ------------ | ------------ | -------- | -------- | ---- |\n| GPT-4o  | Qwen-VL                  | Qwen-VL          |                  | 21.3        | 21.4        | 18.6         | 10.7         | 9.1      | 5.8      | 14.5 |\n| GPT-4o  | Qwen-GUI                 | Qwen-VL          | GUICourse        | 67.8        | 24.5        | 53.1         | 16.4         | 50.4     | 18.5     | 38.5 |\n| GPT-4o  | SeeClick                 | Qwen-VL          | Web, Mobile, ... | 81.0        | 59.8        | 69.6         | 33.6         | 43.9     | 26.2     | 52.4 |\n| GPT-4o  | OS-Atlas-Base-4B         | InternVL         | OS-Atlas         | 94.1        | 73.8        | 77.8         | 47.1         | 86.5     | 65.3     | 74.1 |\n| GPT-4o  | UGround (Initial)        | LLaVA-UGround-V1 | UGround-V1       | 93.4        | 76.9        | 92.8         | 67.9         | 88.7     | 68.9     | 81.4 |\n| GPT-4o  | UGround-V1-2B (Qwen2-VL) | Qwen2-VL         | UGround-V1       | 94.1        | 77.7        | 92.8         | 63.6         | 90.0     | 70.9     | 81.5 |\n| GPT-4o  | Molmo-72B                |                  |                  | 94.1        | 79.0        | 92.3         | 70.0         | 88.7     | 67.0     | 81.9 |\n| GPT-4o  | Molmo-7B-D               |                  |                  | 93.4        | 80.8        | 91.2         | 72.9         | 88.7     | 69.4     | 82.7 |\n| GPT-4o  | OS-Atlas-Base-7B         | Qwen2-VL         | OS-Atlas         | 93.8        | 79.9        | 90.2         | 66.4         | 92.6     | 79.1     | 83.7 |\n| GPT-4o  | UGround-V1-7B (Qwen2-VL) | Qwen2-VL         | UGround-V1       | 94.1        | 79.9        | 93.3         | 73.6         | 89.6     | 73.3     | 84.0 |\n| GPT-4o  | UGround-V1-72B (Qwen2-VL)| Qwen2-VL         | UGround-V1       | 94.5        | 79.9        | 93.8         | 75.0         | 88.7     | 75.2     | 84.5 |\n\n\n\n## Inference of Qwen2-VL-Based UGround\n\n### Python Environment (followed from Qwen2-VL's official repo)\n\n```bash\n#inference\npip install git+https://github.com/huggingface/transformers@21fac7abba2a37fae86106f87fcf9974fd1e3830\npip install accelerate\npip install qwen-vl-utils\npip install 'vllm==0.6.1' \n```\n\n\n### vLLM server\n\n```bash\nvllm serve osunlp/UGround-V1-7B  --api-key token-abc123 --dtype float16\n```\nor\n\n```bash\npython -m vllm.entrypoints.openai.api_server --served-model-name osunlp/UGround-V1-7B --model osunlp/UGround-V1-7B --dtype float16 \n```\nYou can find more instruction about training and inference in [Qwen2-VL's Official Repo](https://github.com/QwenLM/Qwen2-VL).\n\nHere we use float16 instead of bfloat16 for more stable decoding (See details in [vLLM's doc](https://docs.vllm.ai/en/latest/usage/faq.html#:~:text=Mitigation%20Strategies))\n\n### Visual Grounding Prompt\n```python\ndef format_openai_template(description: str, base64_image):\n    return [\n        {\n            \"role\": \"user\",\n            \"content\": [\n                {\n                    \"type\": \"image_url\",\n                    \"image_url\": {\"url\": f\"data:image/jpeg;base64,{base64_image}\"},\n                },\n                {\n                    \"type\": \"text\",\n                    \"text\": f\"\"\"\n  Your task is to help the user identify the precise coordinates (x, y) of a specific area/element/object on the screen based on a description.\n\n  - Your response should aim to point to the center or a representative point within the described area/element/object as accurately as possible.\n  - If the description is unclear or ambiguous, infer the most relevant area or element based on its likely context or purpose.\n  - Your answer should be a single string (x, y) corresponding to the point of the interest.\n\n  Description: {description}\n\n  Answer:\"\"\"\n                },\n            ],\n        },\n    ]\n\n\nmessages = format_openai_template(description, base64_image)\n\ncompletion = await client.chat.completions.create(\n    model=args.model_path,\n    messages=messages,\n    temperature=0  # REMEMBER to set temperature to ZERO!\n# REMEMBER to set temperature to ZERO!\n# REMEMBER to set temperature to ZERO!\n)\n\n# The output will be in the range of [0,1000), which is compatible with the original Qwen2-VL\n# So the actual coordinates should be (x/1000*width, y/1000*height)\n\n```\n\n\n![Untitled design](https://github.com/user-attachments/assets/31758aff-7fc8-4c83-a259-86dc27a5b90a)\n\n\n## Citation Information\n\n\nIf you find this work useful, please consider starring our repo and citing our papers: \n\n```\n@inproceedings{gou2025uground,\ntitle={Navigating the Digital World as Humans Do: Universal Visual Grounding for {GUI} Agents},\nauthor={Boyu Gou and Ruohan Wang and Boyuan Zheng and Yanan Xie and Cheng Chang and Yiheng Shu and Huan Sun and Yu Su},\nbooktitle={The Thirteenth International Conference on Learning Representations},\nyear={2025},\nurl={https://openreview.net/forum?id=kxnoqaisCT}\n}\n\n@inproceedings{zheng2024seeact,\n  title={GPT-4V(ision) is a Generalist Web Agent, if Grounded},\n  author={Boyuan Zheng and Boyu Gou and Jihyung Kil and Huan Sun and Yu Su},\n  booktitle={Forty-first International Conference on Machine Learning},\n  year={2024},\n  url={https://openreview.net/forum?id=piecKJ2DlB},\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fosu-nlp-group%2Fuground","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fosu-nlp-group%2Fuground","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fosu-nlp-group%2Fuground/lists"}