{"id":19287188,"url":"https://github.com/njucckevin/SeeClick","last_synced_at":"2025-04-22T04:31:52.137Z","repository":{"id":218636412,"uuid":"746548151","full_name":"njucckevin/SeeClick","owner":"njucckevin","description":"The model, data and code for the visual GUI Agent SeeClick","archived":false,"fork":false,"pushed_at":"2024-08-27T07:00:02.000Z","size":7151,"stargazers_count":173,"open_issues_count":4,"forks_count":9,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-08-27T08:24:37.338Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/njucckevin.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-01-22T08:24:55.000Z","updated_at":"2024-08-27T08:24:39.670Z","dependencies_parsed_at":"2024-01-29T14:05:28.450Z","dependency_job_id":"8eaa5e55-b6c7-4065-8d61-e1c41d7dfe68","html_url":"https://github.com/njucckevin/SeeClick","commit_stats":null,"previous_names":["njucckevin/seeclick"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/njucckevin%2FSeeClick","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/njucckevin%2FSeeClick/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/njucckevin%2FSeeClick/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/njucckevin%2FSeeClick/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/njucckevin","download_url":"https://codeload.github.com/njucckevin/SeeClick/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223888466,"owners_count":17220083,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-09T22:05:37.282Z","updated_at":"2025-04-22T04:31:52.117Z","avatar_url":"https://github.com/njucckevin.png","language":"HTML","funding_links":[],"categories":["Papers"],"sub_categories":["Models","UI Grounding"],"readme":"# SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents\n\n[![arXiv](https://img.shields.io/badge/arXiv-2401.10935-b31b1b.svg)](https://arxiv.org/abs/2401.10935) \n[![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-green.svg)](https://GitHub.com/Naereen/StrapDown.js/graphs/commit-activity) \n[![PR's Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat)](http://makeapullrequest.com)\n[![Awesome](https://awesome.re/badge.svg)](https://awesome.re)\n\nThe model, data, and code for the paper: [SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents](https://arxiv.org/abs/2401.10935)\n\nRelease Plans:\n\n- [x] GUI grounding benchmark: *ScreenSpot*\n- [x] Data for the GUI grounding Pre-training of SeeClick\n- [x] Inference code \u0026 model checkpoint\n- [x] Other code and resources\n- [x] Code for pre-training and evaluation on ScreenSpot\n- [x] Code for collecting pre-training data\n\nNews: SeeClick is accepted by ACL 2024.\n\n***\n### GUI Grounding Benchmark: *ScreenSpot*\n\n*ScreenSpot* is an evaluation benchmark for GUI grounding, comprising over 1200 instructions from iOS, Android, macOS, Windows and Web environments, along with annotated element types (Text or Icon/Widget). See details and more examples in our paper.\n\nDownload the images and annotations of [*ScreenSpot*](https://box.nju.edu.cn/d/5b8892c1901c4dbeb715/) (or download with [Google Drive](https://drive.google.com/drive/folders/1FuFT05yXOV_QxhwYft85YTLOgaIYm_fS?usp=sharing)). \n\nEach test sample contain: \n* `img_filename`: the interface screenshot file\n* `instruction`: human instruction\n* `bbox`: the bounding box of the target element corresponding to instruction\n* `data_type`: \"icon\"/\"text\", indicates the type of the target element\n* `data_souce`: interface platform, including iOS, Android, macOS, Windows and Web (Gitlab, Shop, Forum and Tool)\n\n![Examples of *ScreenSpot*](assets/screenspot.png)\n\n#### Evaluation Results\n\n| LVLMs      | Model Size | GUI Specific | Mobile Text | Mobile Icon/Widget | Desktop Text | Desktop Icon/Widget | Web Text | Web Icon/Widget | Average |\n|------------|------------|--------------|-------------|--------------------|--------------|---------------------|----------|-----------------|---------|\n| MiniGPT-v2 | 7B         | ❌            | 8.4%        | 6.6%               | 6.2%         | 2.9%                | 6.5%     | 3.4%            | 5.7%    |\n| Qwen-VL    | 9.6B       | ❌            | 9.5%        | 4.8%               | 5.7%         | 5.0%                | 3.5%     | 2.4%            | 5.2%    |\n| GPT-4V     | -          | ❌            | 22.6%       | 24.5%              | 20.2%        | 11.8%               | 9.2%     | 8.8%            | 16.2%   |\n| Fuyu       | 8B         | ✅            | 41.0%       | 1.3%               | 33.0%        | 3.6%                | 33.9%    | 4.4%            | 19.5%   |\n| CogAgent   | 18B        | ✅            | 67.0%       | 24.0%              | **74.2%**    | 20.0%               | **70.4%**| 28.6%           | 47.4%   |\n| SeeClick       | 9.6B       | ✅            | **78.0%**   | **52.0%**          | 72.2%        | **30.0%**           | 55.7%    | **32.5%**       | **53.4%**|\n\n\n\u003c!-- ![Results on *ScreenSpot*](assets/screenspot_result.png) --\u003e\n\n***\n### GUI Grounding Pre-training Data for SeeClick\nCheck [data](readme_data.md) for the GUI grounding pre-training datasets,\nincluding the first open source large-scale web GUI grounding corpus collected from Common Crawl.\n\n***\n### Inference code \u0026 model checkpoint\nSeeClick is built on [Qwen-VL](https://github.com/QwenLM/Qwen-VL) and is compatible with its Transformers 🤗 inference code.\n\nAll you need is to input a few lines of codes as the examples below.\n\nBefore running, set up the environment and install the required packages.\n```angular2html\npip install -r requirements.txt\n```\n\u003e Note: If you want to fine-tune the model, you should follow the [setup](https://github.com/njucckevin/SeeClick/blob/main/agent_tasks/readme_agent.md) and install with requirements_agent.txt.\n\nThen,\n```python\nimport torch\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\nfrom transformers.generation import GenerationConfig\n\ntokenizer = AutoTokenizer.from_pretrained(\"Qwen/Qwen-VL-Chat\", trust_remote_code=True)\nmodel = AutoModelForCausalLM.from_pretrained(\"SeeClick-ckpt-dir\", device_map=\"cuda\", trust_remote_code=True, bf16=True).eval()\nmodel.generation_config = GenerationConfig.from_pretrained(\"Qwen/Qwen-VL-Chat\", trust_remote_code=True)\n\nimg_path = \"assets/test_img.png\"\nprompt = \"In this UI screenshot, what is the position of the element corresponding to the command \\\"{}\\\" (with point)?\"\n# prompt = \"In this UI screenshot, what is the position of the element corresponding to the command \\\"{}\\\" (with bbox)?\"  # Use this prompt for generating bounding box\nref = \"add an event\"   # response (0.17,0.06)\nref = \"switch to Year\"   # response (0.59,0.06)\nref = \"search for events\"   # response (0.82,0.06)\nquery = tokenizer.from_list_format([\n    {'image': img_path}, # Either a local path or an url\n    {'text': prompt.format(ref)},\n])\nresponse, history = model.chat(tokenizer, query=query, history=None)\nprint(response)\n```\nThe SeeClick's checkpoint can be downloaded on [huggingface](https://huggingface.co/cckevinn/SeeClick/tree/main).\nPlease replace the `SeeClick-ckpt-dir` with the actual checkpoint dir. \n\nThe prediction output represents the point of `(x, y)` or the bounding box of `(left, top, right, down)`,\neach value is a [0, 1] decimal number indicating the ratio of the corresponding position to the width or height of the image.\nWe recommend using point for prediction because SeeClick is mainly trained for predicting click points on GUIs.\n\nThanks to [Qwen-VL](https://github.com/QwenLM/Qwen-VL) for their powerful model and wonderful open-sourced work.\n\n***\n### Downstream Agent Task\nCheck [here](agent_tasks/readme_agent.md) to get details of training and testing on three downstream agent tasks,\nwhich also provides a guideline for fine-tuning SeeClick.\n```\nbash finetune/finetune_lora_ds.sh --save-name SeeClick_test --max-length 704 --micro-batch-size 4 --save-interval 500 \n    --train-epochs 10 --nproc-per-node 2 --data-path xxxx/data_sft.json --learning-rate 3e-5 \n    --gradient-accumulation-steps 8 --qwen-ckpt xxxx/Qwen-VL-Chat --pretrain-ckpt xxxx/SeeClick-pretrain\n    --save-path xxxx/checkpoint_qwen\n```\n* `data-path`: generated sft data, the format can be found in [here](https://github.com/QwenLM/Qwen-VL#data-preparation)\n* `qwen-ckpt`: origin Qwen-VL ckpt path for loading tokenizer\n* `pretrain-ckpt`: base model for fine-tuning, e.g. SeeClick-pretrain or Qwen-VL\n* `save-path`: directory to save training checkpoints\n\nThe fine-tuning scripts are similar to Qwen-VL, except for we use LoRA to fine-tune customized parameters, as in `finetune/finetune.py lines 315-327`.\nThis scripts fine-tune pre-train LVLM with LoRA and multi-GPU training; for more option like full-finetuning, Q-LoRA and single-GPU training, please\nrefer to [Qwen-VL](https://github.com/QwenLM/Qwen-VL/tree/master?tab=readme-ov-file#finetuning).\n\n***\n### Pre-training and Evaluation on ScreenSpot\nYou can easily organize the above data yourself for model training and testing on ScreenSpot. \nAs an alternative, we provide a set of scripts used for data processing, pre-training, and testing on ScreenSpot.\n```\ncd pretrain\n```\n#### Data Processing for Pre-Training\n```\npython pretrain_process.py --mobile_imgs xxxx/combined --web_imgs xxxx/seeclick_web_imgs \n    --widgetcap_json xxxx/widget_captioning.json --ricosca_json xxxx/ricosca.json \n    --screensum_json xxxx/screen_captioning.json --web_json xxxx/seeclick_web.json \n    --coco_imgs xxxx/coco/train2017 --llava_json xxxx/llava_instruct_150k.jsonl\n```\nGenerate the dataset containing about 1M samples for continual pre-training at `../data/sft_train.json`.\n\n#### GUI Grounding Pre-training\n```\ncd ..\nbash finetune/finetune_lora_ds.sh --save-name seeclick_sft --max-length 768 --micro-batch-size 8 \n    --save-interval 4000 --train-epochs 3 --nproc-per-node 8 --data-path ./data/sft_train.json \n    --learning-rate 3e-5 --gradient-accumulation-steps 1 --qwen-ckpt xxxx/Qwen-VL-Chat \n    --pretrain-ckpt xxxx/Qwen-VL-Chat  --save-path xxxx/checkpoint_qwen\n```\n#### Evaluation on ScreenSpot\n```\ncd pretrain\npython screenspot_test.py --qwen_path xxxx/Qwen-VL-Chat --lora_path xxxx/checkpoint_qwen/seeclick_sft/checkpoint-20000 --screenspot_imgs xxxx/screenspot_imgs --screenspot_test xxxx/ScreenSpot --task all\n```\n***\n### Collecting Pre-training Data from Common Crawl\nWe used Selenium to crawl web pages from Common Crawl. See details in this [repo](https://github.com/chuyg1005/seeclick-crawler).\n\n***\n### Citation\n```\n@inproceedings{cheng2024seeclick,\n    title = \"{S}ee{C}lick: Harnessing {GUI} Grounding for Advanced Visual {GUI} Agents\",\n    author = \"Cheng, Kanzhi  and\n      Sun, Qiushi  and\n      Chu, Yougang  and\n      Xu, Fangzhi  and\n      YanTao, Li  and\n      Zhang, Jianbing  and\n      Wu, Zhiyong\",\n    booktitle = \"Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)\",\n    month = aug,\n    year = \"2024\",\n    address = \"Bangkok, Thailand\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://aclanthology.org/2024.acl-long.505\",\n    pages = \"9313--9332\"\n}\n```\n\n***\n### License \nThis project incorporates specific datasets and checkpoints governed by their original licenses. Users are required to adhere to all terms of these licenses. No additional restrictions are imposed by this project beyond those specified in the original licenses.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnjucckevin%2FSeeClick","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnjucckevin%2FSeeClick","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnjucckevin%2FSeeClick/lists"}