{"id":18372682,"url":"https://github.com/os-copilot/os-atlas","last_synced_at":"2025-04-11T00:58:25.634Z","repository":{"id":260258302,"uuid":"880798084","full_name":"OS-Copilot/OS-Atlas","owner":"OS-Copilot","description":"OS-ATLAS: A Foundation Action Model For Generalist GUI Agents","archived":false,"fork":false,"pushed_at":"2025-02-20T08:29:25.000Z","size":3944,"stargazers_count":317,"open_issues_count":12,"forks_count":16,"subscribers_count":8,"default_branch":"main","last_synced_at":"2025-04-11T00:58:17.626Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/OS-Copilot.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-10-30T11:32:49.000Z","updated_at":"2025-04-09T15:22:17.000Z","dependencies_parsed_at":"2024-11-17T17:32:20.373Z","dependency_job_id":"e3002e80-be18-4180-b87e-b670b832c162","html_url":"https://github.com/OS-Copilot/OS-Atlas","commit_stats":null,"previous_names":["os-copilot/os-atlas"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OS-Copilot%2FOS-Atlas","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OS-Copilot%2FOS-Atlas/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OS-Copilot%2FOS-Atlas/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OS-Copilot%2FOS-Atlas/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/OS-Copilot","download_url":"https://codeload.github.com/OS-Copilot/OS-Atlas/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248322611,"owners_count":21084336,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-06T00:06:45.530Z","updated_at":"2025-04-11T00:58:25.616Z","avatar_url":"https://github.com/OS-Copilot.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# OS-Atlas: A Foundation Action Model For Generalist GUI Agents\n\n\u003cdiv align=\"center\"\u003e\n\n[\\[🏠Homepage\\]](https://osatlas.github.io) [\\[💻Code\\]](https://github.com/OS-Copilot/OS-Atlas) [\\[🚀Quick Start\\]](#quick-start) [\\[📝Paper\\]](https://arxiv.org/abs/2410.23218) [\\[🤗Models\\]](https://huggingface.co/collections/OS-Copilot/os-atlas-67246e44003a1dfcc5d0d045)[\\[🤗Data\\]](https://huggingface.co/datasets/OS-Copilot/OS-Atlas-data) [\\[🤗ScreenSpot-v2\\]](https://huggingface.co/datasets/OS-Copilot/ScreenSpot-v2) \n\n\u003c/div\u003e\n\n## Overview\n\n**OS-Atlas** paper is accepted by ICLR 2025.\n\n![os-atlas](https://github.com/user-attachments/assets/cf2ee020-5e15-4087-9a7e-75cc43662494)\n\n\u003c!-- ## TODO List\n- [] \n- [ ]\n- [ ]\n- [ ] --\u003e\n\n## Quick Start\nOS-Atlas provides two base grounding models: [OS-Atlas-Base-4B](https://huggingface.co/OS-Copilot/OS-Atlas-Base-4B) and [OS-Atlas-Base-7B](https://huggingface.co/OS-Copilot/OS-Atlas-Base-7B). OS-Atlas-Base-4B is finetuned from [InternVL2-4B](https://huggingface.co/OpenGVLab/InternVL2-4B), and OS-Atlas-Base-7B is finetuned from [Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct).\n\nThis section provides instructions on how to inference our pre-trained grounding models.\n\n**Notes:** Our models accept images of any size as input. The model outputs are normalized to relative coordinates within a 0-1000 range (either a center point or a bounding box defined by top-left and bottom-right coordinates). For visualization, please remember to convert these relative coordinates back to the original image dimensions.\n\n### OS-Atlas-Base-4B\nFirst, install the `transformers` library:\n```\npip install transformers\n```\nFor additional dependencies, please refer to the [InternVL2 documentation](https://internvl.readthedocs.io/en/latest/get_started/installation.html)\n\nInference code example:\n```python\nimport numpy as np\nimport torch\nimport torchvision.transforms as T\nfrom PIL import Image\nfrom torchvision.transforms.functional import InterpolationMode\nfrom transformers import AutoModel, AutoTokenizer\nIMAGENET_MEAN = (0.485, 0.456, 0.406)\nIMAGENET_STD = (0.229, 0.224, 0.225)\n\ndef build_transform(input_size):\n    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD\n    transform = T.Compose([\n        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),\n        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),\n        T.ToTensor(),\n        T.Normalize(mean=MEAN, std=STD)\n    ])\n    return transform\n\ndef find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):\n    best_ratio_diff = float('inf')\n    best_ratio = (1, 1)\n    area = width * height\n    for ratio in target_ratios:\n        target_aspect_ratio = ratio[0] / ratio[1]\n        ratio_diff = abs(aspect_ratio - target_aspect_ratio)\n        if ratio_diff \u003c best_ratio_diff:\n            best_ratio_diff = ratio_diff\n            best_ratio = ratio\n        elif ratio_diff == best_ratio_diff:\n            if area \u003e 0.5 * image_size * image_size * ratio[0] * ratio[1]:\n                best_ratio = ratio\n    return best_ratio\n\ndef dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):\n    orig_width, orig_height = image.size\n    aspect_ratio = orig_width / orig_height\n\n    # calculate the existing image aspect ratio\n    target_ratios = set(\n        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if\n        i * j \u003c= max_num and i * j \u003e= min_num)\n    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])\n\n    # find the closest aspect ratio to the target\n    target_aspect_ratio = find_closest_aspect_ratio(\n        aspect_ratio, target_ratios, orig_width, orig_height, image_size)\n\n    # calculate the target width and height\n    target_width = image_size * target_aspect_ratio[0]\n    target_height = image_size * target_aspect_ratio[1]\n    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]\n\n    # resize the image\n    resized_img = image.resize((target_width, target_height))\n    processed_images = []\n    for i in range(blocks):\n        box = (\n            (i % (target_width // image_size)) * image_size,\n            (i // (target_width // image_size)) * image_size,\n            ((i % (target_width // image_size)) + 1) * image_size,\n            ((i // (target_width // image_size)) + 1) * image_size\n        )\n        # split the image\n        split_img = resized_img.crop(box)\n        processed_images.append(split_img)\n    assert len(processed_images) == blocks\n    if use_thumbnail and len(processed_images) != 1:\n        thumbnail_img = image.resize((image_size, image_size))\n        processed_images.append(thumbnail_img)\n    return processed_images\n\ndef load_image(image_file, input_size=448, max_num=12):\n    image = Image.open(image_file).convert('RGB')\n    transform = build_transform(input_size=input_size)\n    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)\n    pixel_values = [transform(image) for image in images]\n    pixel_values = torch.stack(pixel_values)\n    return pixel_values\n\n# If you want to load a model using multiple GPUs, please refer to the `Multiple GPUs` section.\npath = 'OS-Copilot/OS-Atlas-Base-4B'\nmodel = AutoModel.from_pretrained(\n    path,\n    torch_dtype=torch.bfloat16,\n    low_cpu_mem_usage=True,\n    trust_remote_code=True).eval().cuda()\ntokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)\n\n# set the max number of tiles in `max_num`\npixel_values = load_image('./examples/images/web_dfacd48d-d2c2-492f-b94c-41e6a34ea99f.png', max_num=6).to(torch.bfloat16).cuda()\ngeneration_config = dict(max_new_tokens=1024, do_sample=True)\n\nquestion = \"In the screenshot of this web page, please give me the coordinates of the element I want to click on according to my instructions(with point).\\n\\\"'Champions League' link\\\"\"\nresponse, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)\nprint(f'User: {question}\\nAssistant: {response}')\n```\n\n\n### OS-Atlas-Base-7B\nFirst, ensure that the necessary dependencies are installed:\n```\npip install transformers\npip install qwen-vl-utils\n```\n\nInference code example:\n```python\nfrom transformers import Qwen2VLForConditionalGeneration, AutoProcessor\nfrom qwen_vl_utils import process_vision_info\n\n# Default: Load the model on the available device(s)\nmodel = Qwen2VLForConditionalGeneration.from_pretrained(\n    \"OS-Copilot/OS-Atlas-Base-7B\", torch_dtype=\"auto\", device_map=\"auto\"\n)\nprocessor = AutoProcessor.from_pretrained(\"OS-Copilot/OS-Atlas-Base-7B\")\n\nmessages = [\n    {\n        \"role\": \"user\",\n        \"content\": [\n            {\n                \"type\": \"image\",\n                \"image\": \"./examples/images/web_6f93090a-81f6-489e-bb35-1a2838b18c01.png\",\n            },\n            {\"type\": \"text\", \"text\": \"In this UI screenshot, what is the position of the element corresponding to the command \\\"switch language of current page\\\" (with bbox)?\"},\n        ],\n    }\n]\n\n\n# Preparation for inference\ntext = processor.apply_chat_template(\n    messages, tokenize=False, add_generation_prompt=True\n)\nimage_inputs, video_inputs = process_vision_info(messages)\ninputs = processor(\n    text=[text],\n    images=image_inputs,\n    videos=video_inputs,\n    padding=True,\n    return_tensors=\"pt\",\n)\ninputs = inputs.to(\"cuda\")\n\n# Inference: Generation of the output\ngenerated_ids = model.generate(**inputs, max_new_tokens=128)\n\ngenerated_ids_trimmed = [\n    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)\n]\n\noutput_text = processor.batch_decode(\n    generated_ids_trimmed, skip_special_tokens=False, clean_up_tokenization_spaces=False\n)\nprint(output_text)\n# \u003c|object_ref_start|\u003elanguage switch\u003c|object_ref_end|\u003e\u003c|box_start|\u003e(576,12),(592,42)\u003c|box_end|\u003e\u003c|im_end|\u003e\n```\n\n\n## Citation\nIf you find this repository helpful, feel free to cite our paper:\n```bibtex\n@article{wu2024atlas,\n        title={OS-ATLAS: A Foundation Action Model for Generalist GUI Agents},\n        author={Wu, Zhiyong and Wu, Zhenyu and Xu, Fangzhi and Wang, Yian and Sun, Qiushi and Jia, Chengyou and Cheng, Kanzhi and Ding, Zichen and Chen, Liheng and Liang, Paul Pu and others},\n        journal={arXiv preprint arXiv:2410.23218},\n        year={2024}\n      }\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fos-copilot%2Fos-atlas","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fos-copilot%2Fos-atlas","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fos-copilot%2Fos-atlas/lists"}