{"id":20216294,"url":"https://github.com/thudm/cogagent","last_synced_at":"2025-05-15T09:00:23.633Z","repository":{"id":213498564,"uuid":"724535804","full_name":"THUDM/CogAgent","owner":"THUDM","description":"An open-sourced end-to-end VLM-based GUI Agent","archived":false,"fork":false,"pushed_at":"2025-04-04T13:29:55.000Z","size":5362,"stargazers_count":875,"open_issues_count":27,"forks_count":67,"subscribers_count":19,"default_branch":"main","last_synced_at":"2025-04-08T17:16:21.636Z","etag":null,"topics":["agent","computer-use","glm","gui-agent","vlm"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/THUDM.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-11-28T09:28:08.000Z","updated_at":"2025-04-08T11:18:48.000Z","dependencies_parsed_at":"2024-12-23T15:19:49.775Z","dependency_job_id":"30eedecf-0d1d-47a2-abec-4b41898df686","html_url":"https://github.com/THUDM/CogAgent","commit_stats":null,"previous_names":["thudm/cogagent"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2FCogAgent","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2FCogAgent/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2FCogAgent/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2FCogAgent/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/THUDM","download_url":"https://codeload.github.com/THUDM/CogAgent/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254310509,"owners_count":22049467,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agent","computer-use","glm","gui-agent","vlm"],"created_at":"2024-11-14T06:27:17.639Z","updated_at":"2025-05-15T09:00:23.586Z","avatar_url":"https://github.com/THUDM.png","language":"Python","readme":"# CogAgent: An open-sourced VLM-based GUI Agent\n\n[中文文档](README_zh.md)\n\n- 🔥 🆕 **December 2024:** We open-sourced **the latest version of the CogAgent-9B-20241220 model**. Compared to the\n  previous version of CogAgent, `CogAgent-9B-20241220` features significant improvements in GUI perception, reasoning\n  accuracy, action space completeness, task universality, and generalization. It supports bilingual (Chinese and\n  English) interaction through both screen captures and natural language.\n\n- 🏆 **June 2024:** CogAgent was accepted by **CVPR 2024** and recognized as a conference Highlight (top 3%).\n\n- **December 2023:** We **open-sourced the first GUI Agent**: **CogAgent** (with the former repository\n  available [here](https://github.com/THUDM/CogVLM)) and **published the corresponding paper:\n  📖 [CogAgent Paper](https://arxiv.org/abs/2312.08914)**.\n\n## Model Introduction\n\n|        Model         |                                                                                                                                                 Model Download Links                                                                                                                                                 | Technical Documentation                                                                                                                                                                                                               | Online Demo                                                                                                                                                                                                                                     |                                                          \n|:--------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|   \n| cogagent-9b-20241220 | [🤗 HuggingFace](https://huggingface.co/THUDM/cogagent-9b-20241220)\u003cbr\u003e [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/cogagent-9b-20241220) \u003cbr\u003e [🟣 WiseModel](https://wisemodel.cn/models/ZhipuAI/cogagent-9b-20241220) \u003cbr\u003e[🧩 Modelers (Ascend)](https://modelers.cn/models/zhipuai/cogagent-9b-20241220) | [📄 Official Technical Blog](https://cogagent.aminer.cn/blog#/articles/cogagent-9b-20241220-technical-report)\u003cbr/\u003e[📘 Practical Guide (Chinese)](https://zhipu-ai.feishu.cn/wiki/MhPYwtpBhinuoikNIYYcyu8dnKv?fromScene=spaceOverview) | [🤗 HuggingFace Space](https://huggingface.co/spaces/THUDM-HF-SPACE/CogAgent-Demo)\u003cbr/\u003e[🤖 ModelScope Space](https://modelscope.cn/studios/ZhipuAI/CogAgent-Demo)\u003cbr/\u003e[🧩 Modelers Space (Ascend)](https://modelers.cn/spaces/zhipuai/CogAgent) |\n\n### Model Overview\n\n`CogAgent-9B-20241220` model is based on [GLM-4V-9B](https://huggingface.co/THUDM/glm-4v-9b), a bilingual open-source\nVLM base model. Through data collection and optimization, multi-stage training, and strategy improvements,\n`CogAgent-9B-20241220` achieves significant advancements in GUI perception, inference prediction accuracy, action space\ncompleteness, and generalizability across tasks. The model supports bilingual (Chinese and English) interaction with\nboth screenshots and language input. This version of the CogAgent model has already been applied in\nZhipuAI's [GLM-PC product](https://cogagent.aminer.cn/home). We hope the release of this model can assist researchers\nand developers in advancing the research and applications of GUI agents based on vision-language models.\n\n### Capability Demonstrations\n\nThe CogAgent-9b-20241220 model has achieved state-of-the-art results across multiple platforms and categories in GUI\nAgent tasks and GUI Grounding Benchmarks. In\nthe [CogAgent-9b-20241220 Technical Blog](https://cogagent.aminer.cn/blog#/articles/cogagent-9b-20241220-technical-report),\nwe compared it against API-based commercial models (GPT-4o-20240806, Claude-3.5-Sonnet), commercial API + GUI Grounding\nmodels (GPT-4o + UGround, GPT-4o + OS-ATLAS), and open-source GUI Agent models (Qwen2-VL, ShowUI, SeeClick). The results\ndemonstrate that **CogAgent leads in GUI localization (Screenspot), single-step operations (OmniAct), the Chinese\nstep-wise in-house benchmark (CogAgentBench-basic-cn), and multi-step operations (OSWorld)**, with only a slight\ndisadvantage in OSWorld compared to Claude-3.5-Sonnet, which specializes in Computer Use, and GPT-4o combined with\nexternal GUI Grounding models.\n\n\u003cdiv style=\"display: flex; flex-direction: column; width: 100%; align-items: center; margin-top: 20px;\"\u003e\n    \u003cdiv style=\"text-align: center; margin-bottom: 20px; width: 100%; max-width: 600px; height: auto;\"\u003e\n        \u003cvideo src=\"https://github.com/user-attachments/assets/4d39fe6a-d460-427c-a930-b7cbe0d082f5\" width=\"100%\" height=\"auto\" controls autoplay loop\u003e\u003c/video\u003e\n        \u003cp\u003eCogAgent wishes you a Merry Christmas! Let the large model automatically send Christmas greetings to your friends.\u003c/p\u003e\n    \u003c/div\u003e\n    \u003cdiv style=\"text-align: center; width: 100%; max-width: 600px; height: auto;\"\u003e\n        \u003cvideo src=\"https://github.com/user-attachments/assets/87f00f97-1c4f-4152-b7c0-d145742cb910\" width=\"100%\" height=\"auto\" controls autoplay loop\u003e\u003c/video\u003e\n        \u003cp\u003eWant to open an issue? Let CogAgent help you send an email.\u003c/p\u003e\n    \u003c/div\u003e\n\u003c/div\u003e\n\n\n**Table of Contents**\n\n- [CogAgent](#cogagent)\n    - [Model Introduction](#model-introduction)\n        - [Model Overview](#model-overview)\n        - [Capability Demonstrations](#capability-demonstrations)\n        - [Inference and Fine-tuning Costs](#inference-and-fine-tuning-costs)\n    - [Model Inputs and Outputs](#model-inputs-and-outputs)\n        - [User Input](#user-input)\n        - [Model Output](#model-output)\n        - [An Example](#an-example)\n        - [Notes](#notes)\n    - [Running the Model](#running-the-model)\n        - [Environment Setup](#environment-setup)\n        - [Running an Agent APP Example](#running-an-agent-app-example)\n        - [Fine-tuning the Model](#fine-tuning-the-model)\n    - [Previous Work](#previous-work)\n    - [License](#license)\n    - [Citation](#citation)\n    - [Research and Development Team \\\u0026 Acknowledgements](#research-and-development-team---acknowledgements)\n\n### Inference and Fine-tuning Costs\n\n+ The model requires at least 29GB of VRAM for inference at `BF16` precision. Using `INT4` precision for inference is\n  not recommended due to significant performance loss. The VRAM usage for `INT4` inference is about 8GB, while for\n  `INT8` inference it is about 15GB. In the `inference/cli_demo.py` file, we have commented out these two lines. You can\n  uncomment them and use `INT4` or `INT8` inference. This solution is only supported on NVIDIA devices.\n+ All GPU references above refer to A100 or H100 GPUs. For other devices, you need to calculate the required GPU/CPU\n  memory accordingly.\n+ During SFT (Supervised Fine-Tuning), this codebase freezes the `Vision Encoder`, uses a batch size of 1, and trains on\n  `8 * A100` GPUs. The total input tokens (including images, which account for `1600` tokens) add up to 2048 tokens.\n  This codebase cannot conduct SFT fine-tuning without freezing the `Vision Encoder`.  \n  For LoRA fine-tuning, `Vision Encoder` is **not** frozen; the batch size is 1, using `1 * A100` GPU. The total input\n  tokens (including images, `1600` tokens) also amount to 2048 tokens. In the above setup, SFT fine-tuning requires at\n  least `60GB` of GPU memory per GPU (with 8 GPUs), while LoRA fine-tuning requires at least `70GB` of GPU memory on a\n  single GPU (cannot be split).\n+ `Ascend devices` have not been tested for SFT fine-tuning. We have only tested them on the `Atlas800` training server\n  cluster. You need to modify the inference code accordingly based on the loading mechanism described in the\n  `Ascend device` download link.\n+ The online demo link does **not** support controlling computers; it only allows you to view the model's inference\n  results. We recommend deploying the model locally.\n\n## Model Inputs and Outputs\n\n`cogagent-9b-20241220` is an agent-type execution model rather than a conversational model. It does not support\ncontinuous dialogue, but it **does** support a continuous execution history. (In other words, each time a new\nconversation session needs to be started, and the past history should be provided to the model.) The workflow of\nCogAgent is illustrated as following:\n\n\u003cdiv align=\"center\"\u003e\n    \u003cimg src=assets/cogagent_workflow_en.png width=90% /\u003e\n\u003c/div\u003e\n\n**To achieve optimal GUI Agent performance, we have adopted a strict input-output format.**\nBelow is how users should format their inputs and feed them to the model, and how to interpret the model’s responses.\n\n### User Input\n\nYou can refer\nto [app/client.py#L115](https://github.com/THUDM/CogAgent/blob/e3ca6f4dc94118d3dfb749f195cbb800ee4543ce/app/client.py#L115)\nfor constructing user input prompts. A minimal example of user input concatenation code is shown below:\n\n``` python\n\ncurrent_platform = identify_os() # \"Mac\" or \"WIN\" or \"Mobile\". Pay attention to case sensitivity.\nplatform_str = f\"(Platform: {current_platform})\\n\"\nformat_str = \"(Answer in Action-Operation-Sensitive format.)\\n\" # You can use other format to replace \"Action-Operation-Sensitive\"\n\nhistory_str = \"\\nHistory steps: \"\nfor index, (grounded_op_func, action) in enumerate(zip(history_grounded_op_funcs, history_actions)):\n   history_str += f\"\\n{index}. {grounded_op_func}\\t{action}\" # start from 0. \n\nquery = f\"Task: {task}{history_str}\\n{platform_str}{format_str}\" # Be careful about the \\n\n\n```\n\nThe concatenated Python string:\n\n``` python\n\"Task: Search for doors, click doors on sale and filter by brands \\\"Mastercraft\\\".\\nHistory steps: \\n0. CLICK(box=[[352,102,786,139]], element_info='Search')\\tLeft click on the search box located in the middle top of the screen next to the Menards logo.\\n1. TYPE(box=[[352,102,786,139]], text='doors', element_info='Search')\\tIn the search input box at the top, type 'doors'.\\n2. CLICK(box=[[787,102,809,139]], element_info='SEARCH')\\tLeft click on the magnifying glass icon next to the search bar to perform the search.\\n3. SCROLL_DOWN(box=[[0,209,998,952]], step_count=5, element_info='[None]')\\tScroll down the page to see the available doors.\\n4. CLICK(box=[[280,708,710,809]], element_info='Doors on Sale')\\tClick the \\\"Doors On Sale\\\" button in the middle of the page to view the doors that are currently on sale.\\n(Platform: WIN)\\n(Answer in Action-Operation format.)\\n\"\n```\n\nPrinted prompt:\n\u003e\n\u003e Task: Search for doors, click doors on sale and filter by brands \"Mastercraft\".\n\u003e\n\u003e History steps:\n\u003e\n\u003e 0. CLICK(box=[[352,102,786,139]], element_info='Search')  Left click on the search box located in the middle top of\n     the screen next to the Menards logo.\n\u003e 1. TYPE(box=[[352,102,786,139]], text='doors', element_info='Search') In the search input box at the top, type '\n     doors'.\n\u003e 2. CLICK(box=[[787,102,809,139]], element_info='SEARCH')  Left click on the magnifying glass icon next to the search\n     bar to perform the search.\n\u003e 3. SCROLL_DOWN(box=[[0,209,998,952]], step_count=5, element_info='[None]')  Scroll down the page to see the available\n     doors.\n\u003e 4. CLICK(box=[[280,708,710,809]], element_info='Doors on Sale') Click the \"Doors On Sale\" button in the middle of the\n     page to view the doors that are currently on sale.\n\u003e    \n\u003e\n\u003e (Platform: WIN)\n\u003e\n\u003e (Answer in Action-Operation format.)\n\u003e\n\nIf you want to understand the meaning and representation of each field in detail, please continue reading or refer to\nthe [Practical Documentation (in Chinese), \"Prompt Concatenation\" section](https://zhipu-ai.feishu.cn/wiki/D9FTwQ78fitS3CkZHUjcKEWTned).\n\n1. **`task` field**  \n   The user’s task description, in text format similar to a prompt. This input instructs the `cogagent-9b-20241220`\n   model on how to carry out the user’s request. Keep it concise and clear.\n\n2. **`platform` field**  \n   `cogagent-9b-20241220` supports agent operations on multiple platforms with graphical interfaces. We currently\n   support three systems:\n    - Windows 10, 11: Use the `WIN` field.\n    - macOS 14, 15: Use the `Mac` field.\n    - Android 13, 14, 15 (and other Android UI variants with similar GUI operations): Use the `Mobile` field.\n\n   If your system is not among these, the effectiveness may be suboptimal. You can try using `Mobile` for mobile\n   devices, `WIN` for Windows, or `Mac` for Mac.\n\n3. **`format` field**  \n   The format in which the user wants `cogagent-9b-20241220` to return data. We provide several options:\n    - `Answer in Action-Operation-Sensitive format.`: The default demo return type in this repo. Returns the model’s\n      actions, corresponding operations, and the sensitivity level.\n    - `Answer in Status-Plan-Action-Operation format.`: Returns the model’s status, plan, and corresponding operations.\n    - `Answer in Status-Action-Operation-Sensitive format.`: Returns the model’s status, actions, corresponding\n      operations, and sensitivity.\n    - `Answer in Status-Action-Operation format.`: Returns the model’s status and actions.\n    - `Answer in Action-Operation format.`: Returns the model’s actions and corresponding operations.\n\n4. **`history` field**  \n   This should be concatenated in the following order:\n   ```\n   query = f'{task}{history}{platform}{format}'\n   ```\n   \n5. **`Continue` field**  \n   CogAgent allows users to let the model `continue answering`. This requires users to append the `[Continue]\\n` field after `{task}`. In such cases, the concatenation sequence and result should be as follows:\n   ```\n   query = f'{task}[Continue]\\n{history}{platform}{format}'\n   ```\n   \n### Model Output\n\n1. **Sensitive operations**: Includes `\u003c\u003c敏感操作\u003e\u003e` (“sensitive operation”) and `\u003c\u003c一般操作\u003e\u003e` (“general operation”).\n   These are only returned if you request the `Sensitive` format.\n2. **`Plan`, `Status`, `Action` fields**: Used to describe the model’s behavior and operations. Only returned if you\n   request the corresponding fields. For example, if the format includes `Action`, then the model returns the `Action`\n   field.\n3. **General answer section**: A summary that appears prior to the formatted answer.\n4. **`Grounded Operation` field**:  \n   Describes the model’s specific operations, including the location of the operation, the operation type, and the\n   action details. The `box` attribute indicates the coordinate region for execution, `element_type` indicates the\n   element type, and `element_info` describes the element. These details are wrapped within a “操作指令” (operation\n   command). For the definition of the action space, please refer to [here](Action_space.md).\n\n### An Example\n\nSuppose the user wants to mark all emails as read. The user is on a Mac, and the user wants the model to return in\n`Action-Operation-Sensitive` format. The properly **concatenated prompt** should be:\n\n```\nTask: Please mark all my emails as read\nHistory steps:\n(Platform: Mac)\n(Answer in Action-Operation-Sensitive format.)\n```\n\nNote: even if there are no historical actions, \"History steps:\" still needs to be appended in the prompt. Below are *\n*sample outputs** for different format requirements:\n\n\u003cdetails\u003e\n\u003csummary\u003eAnswer in Action-Operation-Sensitive format\u003c/summary\u003e\n\n```\nAction: Click the 'Mark all as read' button in the top toolbar of the page to mark all emails as read.\nGrounded Operation: CLICK(box=[[219,186,311,207]], element_type='Clickable text', element_info='Mark all emails as read')\n\u003c\u003c一般操作\u003e\u003e\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eAnswer in Status-Plan-Action-Operation format\u003c/summary\u003e\n\n```\nStatus: Currently in the email interface [[0, 2, 998, 905]], with the email categories on the left [[1, 216, 144, 570]], and the inbox in the center [[144, 216, 998, 903]]. The \"Mark all as read\" button has been clicked [[223, 178, 311, 210]].\nPlan: Future tasks: 1. Click the 'Mark all as read' button; 2. Task complete.\nAction: Click the \"Mark all as read\" button at the top center of the inbox page to mark all emails as read.\nGrounded Operation: CLICK(box=[[219,186,311,207]], element_type='Clickable text', element_info='Mark all emails as read')\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eAnswer in Status-Action-Operation-Sensitive format\u003c/summary\u003e\n\n```\nStatus: Currently in the email interface [[0, 2, 998, 905]], with the email categories on the left [[1, 216, 144, 570]], and the inbox in the center [[144, 216, 998, 903]]. The \"Mark all as read\" button has been clicked [[223, 178, 311, 210]].\nAction: Click the \"Mark all as read\" button at the top center of the inbox page to mark all emails as read.\nGrounded Operation: CLICK(box=[[219,186,311,207]], element_type='Clickable text', element_info='Mark all emails as read')\n\u003c\u003c一般操作\u003e\u003e\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eAnswer in Status-Action-Operation format\u003c/summary\u003e\n\n```\nStatus: Currently in the email interface [[0, 2, 998, 905]], with the email categories on the left [[1, 216, 144, 570]], and the inbox in the center [[144, 216, 998, 903]]. The \"Mark all as read\" button has been clicked [[223, 178, 311, 210]].\nAction: Click the \"Mark all as read\" button at the top center of the inbox page to mark all emails as read.\nGrounded Operation: CLICK(box=[[219,186,311,207]], element_type='Clickable text', element_info='Mark all emails as read')\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eAnswer in Action-Operation format\u003c/summary\u003e\n\n```\nAction: Right-click the first email in the left email list to open the action menu.\nGrounded Operation: RIGHT_CLICK(box=[[154,275,343,341]], element_info='[AXCell]')\n```\n\n\u003c/details\u003e\n\n### Notes\n\n1. This model is **not** a conversational model and does **not** support continuous dialogue. Please send specific\n   commands and reference our recommended method for concatenating the history.\n2. The model **requires** images as input; pure text conversation cannot achieve GUI Agent tasks.\n3. The model’s output adheres to a strict format. Please parse it strictly according to our requirements. The output is\n   in **string** format; JSON output is **not** supported.\n\n## Running the Model\n\n### Environment Setup\n\nMake sure you have installed **Python 3.10.16** or above, and then install the following dependencies:\n\n```shell\npip install -r requirements.txt\n```\n\nTo run local inference based on `transformers`, you can run the command below:\n\n```shell\npython inference/cli_demo.py --model_dir THUDM/cogagent-9b-20241220 --platform \"Mac\" --max_length 4096 --top_k 1 --output_image_path ./results --format_key status_action_op_sensitive\n```\n\nThis is a command-line interactive code. You will need to provide the path to your images. If the model returns results\ncontaining bounding boxes, it will output an image with those bounding boxes, indicating the region where the operation\nshould be executed. The image is saved to `output_image_path`, with the file name `{your_input_image_name}_{round}.png`.\nThe `format_key` indicates in which format you want the model to respond. The `platform` field specifies which platform\nyou are using (e.g., `Mac`). Therefore, all uploaded screenshots must be from macOS if `platform` is set to `Mac`.\n\nIf you want to run an online web demo, which supports continuous image uploads for interactive inference, you can run:\n\n```shell\npython inference/web_demo.py --host 0.0.0.0 --port 7860 --model_dir THUDM/cogagent-9b-20241220 --format_key status_action_op_sensitive --platform \"Mac\" --output_dir ./results\n```\n\nThis code provides the same experience as the `HuggingFace Space` online demo. The model will return the corresponding\nbounding boxes and execution categories.\n\n### Running an Agent APP Example\n\nWe have prepared a basic demo app for developers to illustrate the GUI capabilities of `cogagent-9b-20241220`. The demo\nshows how to deploy the model on a GPU-equipped server and run the `cogagent-9b-20241220` model locally to perform\nautomated GUI operations.\n\n\u003e We cannot guarantee the safety of AI behavior; please exercise caution when using it.  \n\u003e This example is only for academic reference. We assume no legal responsibility for any issues resulting from this\n\u003e example.\n\nIf you are interested in this APP, feel free to check out the [documentation](app/README.md).\n\n### Fine-tuning the Model\n\nIf you are interested in fine-tuning the `cogagent-9b-20241220` model, please refer to [here](finetune/README.md).\n\n## Previous Work\n\nIn November 2023, we released the first generation of CogAgent. You can find related code and model weights in\nthe [CogVLM \u0026 CogAgent Official Repository](https://github.com/THUDM/CogVLM).\n\n\u003cdiv align=\"center\"\u003e\n    \u003cimg src=assets/cogagent_function.jpg width=70% /\u003e\n\u003c/div\u003e\n\n\u003ctable\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e\n      \u003ch2\u003e CogVLM \u003c/h2\u003e\n      \u003cp\u003e 📖  Paper: \u003ca href=\"https://arxiv.org/abs/2311.03079\"\u003eCogVLM: Visual Expert for Pretrained Language Models\u003c/a\u003e\u003c/p\u003e\n      \u003cp\u003e\u003cb\u003eCogVLM\u003c/b\u003e is a powerful open-source Vision-Language Model (VLM). CogVLM-17B has 10B visual parameters and 7B language parameters, supporting image understanding at a resolution of 490x490, as well as multi-round dialogue.\u003c/p\u003e\n      \u003cp\u003e\u003cb\u003eCogVLM-17B\u003c/b\u003e achieves state-of-the-art performance on 10 classic multimodal benchmarks, including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA, and TDIUC.\u003c/p\u003e\n    \u003c/td\u003e\n    \u003ctd\u003e\n      \u003ch2\u003e CogAgent \u003c/h2\u003e\n      \u003cp\u003e 📖  Paper: \u003ca href=\"https://arxiv.org/abs/2312.08914\"\u003eCogAgent: A Visual Language Model for GUI Agents\u003c/a\u003e\u003c/p\u003e\n      \u003cp\u003e\u003cb\u003eCogAgent\u003c/b\u003e is an open-source vision-language model improved upon CogVLM. CogAgent-18B has 11B visual parameters and 7B language parameters. \u003cb\u003eIt supports image understanding at a resolution of 1120x1120. Building on CogVLM’s capabilities, CogAgent further incorporates a GUI image agent ability.\u003c/b\u003e\u003c/p\u003e\n      \u003cp\u003e\u003cb\u003eCogAgent-18B\u003c/b\u003e delivers state-of-the-art general performance on 9 classic vision-language benchmarks, including VQAv2, OK-VQ, TextVQA, ST-VQA, ChartQA, infoVQA, DocVQA, MM-Vet, and POPE. It also significantly outperforms existing models on GUI operation datasets such as AITW and Mind2Web.\u003c/p\u003e\n    \u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\n## License\n\n- The [Apache2.0 LICENSE](LICENSE) applies to the use of the code in this GitHub repository.\n- For the model weights, please follow the [Model License](MODEL_LICENSE).\n\n## Citation\n\nIf you find our work helpful, please consider citing the following papers\n\n```\n@misc{hong2023cogagent,\n      title={CogAgent: A Visual Language Model for GUI Agents}, \n      author={Wenyi Hong and Weihan Wang and Qingsong Lv and Jiazheng Xu and Wenmeng Yu and Junhui Ji and Yan Wang and Zihan Wang and Yuxiao Dong and Ming Ding and Jie Tang},\n      year={2023},\n      eprint={2312.08914},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV}\n}\n\n```\n\n## Research and Development Team \u0026 Acknowledgements\n\n**R\u0026D Institutions**: Tsinghua University, Zhipu AI\n\n**Team members**: Wenyi Hong, Junhui Ji, Lihang Pan, Yuanchang Yue, Changyu Pang, Siyan Xue, Guo Wang, Weihan Wang,\nJiazheng Xu, Shen Yang, Xiaotao Gu, Yuxiao Dong, Jie Tang\n\n**Acknowledgement**: We would like to thank the Zhipu AI data team for their strong support, including Xiaohan Zhang,\nZhao Xue, Lu Chen, Jingjie Du, Siyu Wang, Ying Zhang, and all annotators. They worked hard to collect and annotate the\ntraining and testing data of the CogAgent model. We also thank Yuxuan Zhang, Xiaowei Hu, and Hao Chen from the Zhipu AI\nopen source team for their engineering efforts in open sourcing the model.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthudm%2Fcogagent","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthudm%2Fcogagent","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthudm%2Fcogagent/lists"}