{"id":25616713,"url":"https://github.com/microsoft/omniparser","last_synced_at":"2025-05-13T20:04:42.224Z","repository":{"id":278516718,"uuid":"860252770","full_name":"microsoft/OmniParser","owner":"microsoft","description":"A simple screen parsing tool towards pure vision based GUI agent","archived":false,"fork":false,"pushed_at":"2025-03-26T20:33:47.000Z","size":44232,"stargazers_count":21932,"open_issues_count":191,"forks_count":1841,"subscribers_count":170,"default_branch":"master","last_synced_at":"2025-05-06T19:52:10.494Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"cc-by-4.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/microsoft.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-09-20T05:18:18.000Z","updated_at":"2025-05-06T15:37:30.000Z","dependencies_parsed_at":"2025-02-20T07:55:18.028Z","dependency_job_id":"6be9ac6f-840b-48a2-9ef9-584a8b8c13a7","html_url":"https://github.com/microsoft/OmniParser","commit_stats":null,"previous_names":["microsoft/omniparser"],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FOmniParser","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FOmniParser/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FOmniParser/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FOmniParser/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/microsoft","download_url":"https://codeload.github.com/microsoft/OmniParser/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254020477,"owners_count":22000750,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-02-22T04:18:09.402Z","updated_at":"2025-05-13T20:04:42.205Z","avatar_url":"https://github.com/microsoft.png","language":"Jupyter Notebook","readme":"# OmniParser: Screen Parsing tool for Pure Vision Based GUI Agent\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"imgs/logo.png\" alt=\"Logo\"\u003e\n\u003c/p\u003e\n\u003c!-- \u003ca href=\"https://trendshift.io/repositories/12975\" target=\"_blank\"\u003e\u003cimg src=\"https://trendshift.io/api/badge/repositories/12975\" alt=\"microsoft%2FOmniParser | Trendshift\" style=\"width: 250px; height: 55px;\" width=\"250\" height=\"55\"/\u003e\u003c/a\u003e --\u003e\n\n[![arXiv](https://img.shields.io/badge/Paper-green)](https://arxiv.org/abs/2408.00203)\n[![License](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\n📢 [[Project Page](https://microsoft.github.io/OmniParser/)] [[V2 Blog Post](https://www.microsoft.com/en-us/research/articles/omniparser-v2-turning-any-llm-into-a-computer-use-agent/)] [[Models V2](https://huggingface.co/microsoft/OmniParser-v2.0)] [[Models V1.5](https://huggingface.co/microsoft/OmniParser)] [[HuggingFace Space Demo](https://huggingface.co/spaces/microsoft/OmniParser-v2)]\n\n**OmniParser** is a comprehensive method for parsing user interface screenshots into structured and easy-to-understand elements, which significantly enhances the ability of GPT-4V to generate actions that can be accurately grounded in the corresponding regions of the interface. \n\n## News\n- [2025/3] We support local logging of trajecotry so that you can use OmniParser+OmniTool to build training data pipeline for your favorate agent in your domain. [Documentation WIP]\n- [2025/3] We are gradually adding multi agents orchstration and improving user interface in OmniTool for better experience.\n- [2025/2] We release OmniParser V2 [checkpoints](https://huggingface.co/microsoft/OmniParser-v2.0). [Watch Video](https://1drv.ms/v/c/650b027c18d5a573/EWXbVESKWo9Buu6OYCwg06wBeoM97C6EOTG6RjvWLEN1Qg?e=alnHGC)\n- [2025/2] We introduce OmniTool: Control a Windows 11 VM with OmniParser + your vision model of choice. OmniTool supports out of the box the following large language models - OpenAI (4o/o1/o3-mini), DeepSeek (R1), Qwen (2.5VL) or Anthropic Computer Use. [Watch Video](https://1drv.ms/v/c/650b027c18d5a573/EehZ7RzY69ZHn-MeQHrnnR4BCj3by-cLLpUVlxMjF4O65Q?e=8LxMgX)\n- [2025/1] V2 is coming. We achieve new state of the art results 39.5% on the new grounding benchmark [Screen Spot Pro](https://github.com/likaixin2000/ScreenSpot-Pro-GUI-Grounding/tree/main) with OmniParser v2 (will be released soon)! Read more details [here](https://github.com/microsoft/OmniParser/tree/master/docs/Evaluation.md).\n- [2024/11] We release an updated version, OmniParser V1.5 which features 1) more fine grained/small icon detection, 2) prediction of whether each screen element is interactable or not. Examples in the demo.ipynb. \n- [2024/10] OmniParser was the #1 trending model on huggingface model hub (starting 10/29/2024). \n- [2024/10] Feel free to checkout our demo on [huggingface space](https://huggingface.co/spaces/microsoft/OmniParser)! (stay tuned for OmniParser + Claude Computer Use)\n- [2024/10] Both Interactive Region Detection Model and Icon functional description model are released! [Hugginface models](https://huggingface.co/microsoft/OmniParser)\n- [2024/09] OmniParser achieves the best performance on [Windows Agent Arena](https://microsoft.github.io/WindowsAgentArena/)! \n\n## Install \nFirst clone the repo, and then install environment:\n```python\ncd OmniParser\nconda create -n \"omni\" python==3.12\nconda activate omni\npip install -r requirements.txt\n```\n\nEnsure you have the V2 weights downloaded in weights folder (ensure caption weights folder is called icon_caption_florence). If not download them with:\n```\n   # download the model checkpoints to local directory OmniParser/weights/\n   for f in icon_detect/{train_args.yaml,model.pt,model.yaml} icon_caption/{config.json,generation_config.json,model.safetensors}; do huggingface-cli download microsoft/OmniParser-v2.0 \"$f\" --local-dir weights; done\n   mv weights/icon_caption weights/icon_caption_florence\n```\n\n\u003c!-- ## [deprecated]\nThen download the model ckpts files in: https://huggingface.co/microsoft/OmniParser, and put them under weights/, default folder structure is: weights/icon_detect, weights/icon_caption_florence, weights/icon_caption_blip2. \n\nFor v1: \nconvert the safetensor to .pt file. \n```python\npython weights/convert_safetensor_to_pt.py\n\nFor v1.5: \ndownload 'model_v1_5.pt' from https://huggingface.co/microsoft/OmniParser/tree/main/icon_detect_v1_5, make a new dir: weights/icon_detect_v1_5, and put it inside the folder. No weight conversion is needed. \n``` --\u003e\n\n## Examples:\nWe put together a few simple examples in the demo.ipynb. \n\n## Gradio Demo\nTo run gradio demo, simply run:\n```python\npython gradio_demo.py\n```\n\n## Model Weights License\nFor the model checkpoints on huggingface model hub, please note that icon_detect model is under AGPL license since it is a license inherited from the original yolo model. And icon_caption_blip2 \u0026 icon_caption_florence is under MIT license. Please refer to the LICENSE file in the folder of each model: https://huggingface.co/microsoft/OmniParser.\n\n## 📚 Citation\nOur technical report can be found [here](https://arxiv.org/abs/2408.00203).\nIf you find our work useful, please consider citing our work:\n```\n@misc{lu2024omniparserpurevisionbased,\n      title={OmniParser for Pure Vision Based GUI Agent}, \n      author={Yadong Lu and Jianwei Yang and Yelong Shen and Ahmed Awadallah},\n      year={2024},\n      eprint={2408.00203},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV},\n      url={https://arxiv.org/abs/2408.00203}, \n}\n```\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2Fomniparser","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmicrosoft%2Fomniparser","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2Fomniparser/lists"}