https://github.com/osu-nlp-group/uground
[ICLR'25 Oral] UGround: Universal GUI Visual Grounding for GUI Agents
https://github.com/osu-nlp-group/uground
artificial-intelligence gui-agents web-agents
Last synced: about 1 year ago
JSON representation
[ICLR'25 Oral] UGround: Universal GUI Visual Grounding for GUI Agents
- Host: GitHub
- URL: https://github.com/osu-nlp-group/uground
- Owner: OSU-NLP-Group
- Created: 2024-08-02T16:37:08.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2025-05-02T07:43:45.000Z (about 1 year ago)
- Last Synced: 2025-05-02T08:41:36.012Z (about 1 year ago)
- Topics: artificial-intelligence, gui-agents, web-agents
- Language: Python
- Homepage: https://osu-nlp-group.github.io/UGround/
- Size: 78.6 MB
- Stars: 214
- Watchers: 7
- Forks: 11
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# UGround
This is the official code repository for the project: *Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents* [**ICLR'25 Oral**]. This work is a collaboration between [OSU NLP Group](https://x.com/osunlp) and [Orby AI](https://www.orby.ai/).

- [🏠Homepage](https://osu-nlp-group.github.io/UGround)
- [📖Paper](https://arxiv.org/abs/2410.05243)
- [😊Model Weights](https://huggingface.co/collections/osunlp/uground-677824fc5823d21267bc9812)
- [😊Demo](https://huggingface.co/spaces/orby-osu/UGround)
- [😊Training Data](https://huggingface.co/datasets/osunlp/UGround-V1-Data)
Updates
- 2025/01/23: Our [training data](https://huggingface.co/datasets/osunlp/UGround-V1-Data) for the UGround-V1 series (Initial/Qwen2-VL) has been released. We also have provided a comprehensive evaluation suite packed with meaningful resources to help researchers test GUI Agents and grounding models with ease. Try them out! The performance of Qwen2-VL-based UGround-V1 on several benchmarks are also updated on the [homepage](https://osu-nlp-group.github.io/UGround) (e.g., AndroidWorld: 33->44).
- 2025/01/05: Qwen2-VL-based UGround-V1 acheives SOTA results on a new and comprehensive GUI grounding benchmark ScreenSpot-Pro, substaintially outperforms prior models (18.9->31.1). Check the [results](https://gui-agent.github.io/grounding-leaderboard/) and [our tweet](https://x.com/BoyuGouNLP/status/1876299190889742391).
- 2025/01/03: Qwen2-VL-based UGround-V1 has been released ([2B](https://huggingface.co/osunlp/UGround-V1-2B), [7B](https://huggingface.co/osunlp/UGround-V1-7B), [72B](https://huggingface.co/osunlp/UGround-V1-72B)). Check thier performance in [Main Results](#main-results).
- 2024/10/07: Preprint is arXived. Demo is live. Code coming soon.
- 2024/08/06: Website is live. The initial manuscript and results are available.
Release Plans:
- [x] [Model Weights](https://huggingface.co/collections/osunlp/uground-677824fc5823d21267bc9812)
- [x] Initial Version (the one used in the paper)
- [x] Qwen2-VL-Based V1 (2B, 7B, 72B)
- [x] Code
- [x] [Training and Inference](https://github.com/OSU-NLP-Group/UGround/tree/main/train)
- [x] Offline Experiments (Code, Results, and Useful Resources)
- [x] [ScreenSpot](https://github.com/OSU-NLP-Group/UGround/tree/main/offline_evaluation/ScreenSpot)
- [x] [Multimodal-Mind2Web](https://github.com/OSU-NLP-Group/UGround/tree/main/offline_evaluation/Multimodal-Mind2Web)
- [x] [OmniAct](https://github.com/OSU-NLP-Group/UGround/tree/main/offline_evaluation/OmniACT)
- [x] [Android Control](https://github.com/OSU-NLP-Group/UGround/tree/main/offline_evaluation/AndroidControl)
- [x] Online Experiments
- [x] [Mind2Web-Live-SeeAct-V](https://github.com/boyugou/Mind2Web_Live_SeeAct_V)
- [x] [AndroidWorld-SeeAct-V](https://github.com/boyugou/android_world_seeact_v)
- [ ] Data Synthesis Pipeline (Coming Soon)
- [x] [Training Data (V1)](https://huggingface.co/datasets/osunlp/UGround-V1-Data)
- [x] Online Demo (HF Spaces)
# Main Results
## GUI Visual Grounding: ScreenSpot (Standard Setting)

| ScreenSpot (Standard) | Arch | SFT data | Mobile-Text | Mobile-Icon | Desktop-Text | Desktop-Icon | Web-Text | Web-Icon | Avg |
| ----------------------------- | ---------------- | ---------------- | ----------- | ----------- | ------------ | ------------ | -------- | -------- | -------- |
| InternVL-2-4B | InternVL-2 | | 9.2 | 4.8 | 4.6 | 4.3 | 0.9 | 0.1 | 4.0 |
| Groma | Groma | | 10.3 | 2.6 | 4.6 | 4.3 | 5.7 | 3.4 | 5.2 |
| Qwen-VL | Qwen-VL | | 9.5 | 4.8 | 5.7 | 5.0 | 3.5 | 2.4 | 5.2 |
| MiniGPT-v2 | MiniGPT-v2 | | 8.4 | 6.6 | 6.2 | 2.9 | 6.5 | 3.4 | 5.7 |
| GPT-4 | | | 22.6 | 24.5 | 20.2 | 11.8 | 9.2 | 8.8 | 16.2 |
| GPT-4o | | | 20.2 | 24.9 | 21.1 | 23.6 | 12.2 | 7.8 | 18.3 |
| Fuyu | Fuyu | | 41.0 | 1.3 | 33.0 | 3.6 | 33.9 | 4.4 | 19.5 |
| Qwen-GUI | Qwen-VL | GUICourse | 52.4 | 10.9 | 45.9 | 5.7 | 43.0 | 13.6 | 28.6 |
| Ferret-UI-Llama8b | Ferret-UI | | 64.5 | 32.3 | 45.9 | 11.4 | 28.3 | 11.7 | 32.3 |
| Qwen2-VL | Qwen2-VL | | 61.3 | 39.3 | 52.0 | 45.0 | 33.0 | 21.8 | 42.1 |
| CogAgent | CogAgent | | 67.0 | 24.0 | 74.2 | 20.0 | 70.4 | 28.6 | 47.4 |
| SeeClick | Qwen-VL | SeeClick | 78.0 | 52.0 | 72.2 | 30.0 | 55.7 | 32.5 | 53.4 |
| OS-Atlas-Base-4B | InternVL-2 | OS-Atlas | 85.7 | 58.5 | 72.2 | 45.7 | 82.6 | 63.1 | 68.0 |
| OmniParser | | | 93.9 | 57.0 | 91.3 | 63.6 | 81.3 | 51.0 | 73.0 |
| **UGround (Initial)** | LLaVA-UGround-V1 | UGround-V1 | 82.8 | 60.3 | 82.5 | 63.6 | 80.4 | 70.4 | 73.3 |
| Iris | Iris | SeeClick | 85.3 | 64.2 | 86.7 | 57.5 | 82.6 | 71.2 | 74.6 |
| ShowUI-G | ShowUI | ShowUI | 91.6 | 69.0 | 81.8 | 59.0 | 83.0 | 65.5 | 75.0 |
| ShowUI | ShowUI | ShowUI | 92.3 | 75.5 | 76.3 | 61.1 | 81.7 | 63.6 | 75.1 |
| Molmo-7B-D | | | 85.4 | 69.0 | 79.4 | 70.7 | 81.3 | 65.5 | 75.2 |
| **UGround-V1-2B (Qwen2-VL)** | Qwen2-VL | UGround-V1 | 89.4 | 72.0 | 88.7 | 65.7 | 81.3 | 68.9 | 77.7 |
| Molmo-72B | | | 92.7 | 79.5 | 86.1 | 64.3 | 83.0 | 66.0 | 78.6 |
| Aguvis-G-7B | Qwen2-VL | Aguvis-Stage-1 | 88.3 | 78.2 | 88.1 | 70.7 | 85.7 | 74.8 | 81.0 |
| OS-Atlas-Base-7B | Qwen2-VL | OS-Atlas | 93.0 | 72.9 | 91.8 | 62.9 | 90.9 | 74.3 | 81.0 |
| Aria-UI | Aria | Aria-UI | 92.3 | 73.8 | 93.3 | 64.3 | 86.5 | 76.2 | 81.1 |
| Claude (Computer-Use) | | | **98.2** | **85.6** | 79.9 | 57.1 | **92.2** | 84.5 | 82.9 |
| Aguvis-7B | Qwen2-VL | Aguvis-Stage-1&2 | 95.6 | 77.7 | 93.8 | 67.1 | 88.3 | 75.2 | 83.0 |
| Project Mariner | | | | | | | | | 84.0 |
| CogAgent-9B-20241220 | GLM-4V-9B | | | | | | | | 85.4 |
| **UGround-V1-7B (Qwen2-VL)** | Qwen2-VL | UGround-V1 | 93.0 | 79.9 | 93.8 | 76.4 | 90.9 | 84.0 | 86.3 |
| AGUVIS-72B | Qwen2-VL | Aguvis-Stage-1&2 | 94.5 | 85.2 | **95.4** | 77.9 | 91.3 | 85.9 | 88.4 |
| **UGround-V1-72B (Qwen2-VL)** | Qwen2-VL | UGround-V1 | 94.1 | 83.4 | 94.9 | **85.7** | 90.4 | **87.9** | **89.4** |
## GUI Visual Grounding: ScreenSpot (Agent Setting)
| Planner | Agent-Screenspot | arch | SFT data | Mobile-Text | Mobile-Icon | Desktop-Text | Desktop-Icon | Web-Text | Web-Icon | Avg |
| ------- | ------------------------ | ---------------- | ---------------- | ----------- | ----------- | ------------ | ------------ | -------- | -------- | ---- |
| GPT-4o | Qwen-VL | Qwen-VL | | 21.3 | 21.4 | 18.6 | 10.7 | 9.1 | 5.8 | 14.5 |
| GPT-4o | Qwen-GUI | Qwen-VL | GUICourse | 67.8 | 24.5 | 53.1 | 16.4 | 50.4 | 18.5 | 38.5 |
| GPT-4o | SeeClick | Qwen-VL | Web, Mobile, ... | 81.0 | 59.8 | 69.6 | 33.6 | 43.9 | 26.2 | 52.4 |
| GPT-4o | OS-Atlas-Base-4B | InternVL | OS-Atlas | 94.1 | 73.8 | 77.8 | 47.1 | 86.5 | 65.3 | 74.1 |
| GPT-4o | UGround (Initial) | LLaVA-UGround-V1 | UGround-V1 | 93.4 | 76.9 | 92.8 | 67.9 | 88.7 | 68.9 | 81.4 |
| GPT-4o | UGround-V1-2B (Qwen2-VL) | Qwen2-VL | UGround-V1 | 94.1 | 77.7 | 92.8 | 63.6 | 90.0 | 70.9 | 81.5 |
| GPT-4o | Molmo-72B | | | 94.1 | 79.0 | 92.3 | 70.0 | 88.7 | 67.0 | 81.9 |
| GPT-4o | Molmo-7B-D | | | 93.4 | 80.8 | 91.2 | 72.9 | 88.7 | 69.4 | 82.7 |
| GPT-4o | OS-Atlas-Base-7B | Qwen2-VL | OS-Atlas | 93.8 | 79.9 | 90.2 | 66.4 | 92.6 | 79.1 | 83.7 |
| GPT-4o | UGround-V1-7B (Qwen2-VL) | Qwen2-VL | UGround-V1 | 94.1 | 79.9 | 93.3 | 73.6 | 89.6 | 73.3 | 84.0 |
| GPT-4o | UGround-V1-72B (Qwen2-VL)| Qwen2-VL | UGround-V1 | 94.5 | 79.9 | 93.8 | 75.0 | 88.7 | 75.2 | 84.5 |
## Inference of Qwen2-VL-Based UGround
### Python Environment (followed from Qwen2-VL's official repo)
```bash
#inference
pip install git+https://github.com/huggingface/transformers@21fac7abba2a37fae86106f87fcf9974fd1e3830
pip install accelerate
pip install qwen-vl-utils
pip install 'vllm==0.6.1'
```
### vLLM server
```bash
vllm serve osunlp/UGround-V1-7B --api-key token-abc123 --dtype float16
```
or
```bash
python -m vllm.entrypoints.openai.api_server --served-model-name osunlp/UGround-V1-7B --model osunlp/UGround-V1-7B --dtype float16
```
You can find more instruction about training and inference in [Qwen2-VL's Official Repo](https://github.com/QwenLM/Qwen2-VL).
Here we use float16 instead of bfloat16 for more stable decoding (See details in [vLLM's doc](https://docs.vllm.ai/en/latest/usage/faq.html#:~:text=Mitigation%20Strategies))
### Visual Grounding Prompt
```python
def format_openai_template(description: str, base64_image):
return [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{base64_image}"},
},
{
"type": "text",
"text": f"""
Your task is to help the user identify the precise coordinates (x, y) of a specific area/element/object on the screen based on a description.
- Your response should aim to point to the center or a representative point within the described area/element/object as accurately as possible.
- If the description is unclear or ambiguous, infer the most relevant area or element based on its likely context or purpose.
- Your answer should be a single string (x, y) corresponding to the point of the interest.
Description: {description}
Answer:"""
},
],
},
]
messages = format_openai_template(description, base64_image)
completion = await client.chat.completions.create(
model=args.model_path,
messages=messages,
temperature=0 # REMEMBER to set temperature to ZERO!
# REMEMBER to set temperature to ZERO!
# REMEMBER to set temperature to ZERO!
)
# The output will be in the range of [0,1000), which is compatible with the original Qwen2-VL
# So the actual coordinates should be (x/1000*width, y/1000*height)
```

## Citation Information
If you find this work useful, please consider starring our repo and citing our papers:
```
@inproceedings{gou2025uground,
title={Navigating the Digital World as Humans Do: Universal Visual Grounding for {GUI} Agents},
author={Boyu Gou and Ruohan Wang and Boyuan Zheng and Yanan Xie and Cheng Chang and Yiheng Shu and Huan Sun and Yu Su},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=kxnoqaisCT}
}
@inproceedings{zheng2024seeact,
title={GPT-4V(ision) is a Generalist Web Agent, if Grounded},
author={Boyuan Zheng and Boyu Gou and Jihyung Kil and Huan Sun and Yu Su},
booktitle={Forty-first International Conference on Machine Learning},
year={2024},
url={https://openreview.net/forum?id=piecKJ2DlB},
}
```