{"id":19287202,"url":"https://github.com/IMNearth/CoAT","last_synced_at":"2025-04-22T04:31:52.970Z","repository":{"id":223442279,"uuid":"742366348","full_name":"IMNearth/CoAT","owner":"IMNearth","description":"Android in the Zoo: Chain-of-Action-Thought for GUI Agents","archived":false,"fork":false,"pushed_at":"2024-07-20T13:08:47.000Z","size":2176,"stargazers_count":19,"open_issues_count":1,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-07-20T14:27:46.571Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/IMNearth.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-01-12T10:10:41.000Z","updated_at":"2024-07-20T14:27:54.078Z","dependencies_parsed_at":null,"dependency_job_id":"4dd8aae9-4751-4515-a825-f93afdaddb64","html_url":"https://github.com/IMNearth/CoAT","commit_stats":null,"previous_names":["imnearth/coat"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/IMNearth%2FCoAT","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/IMNearth%2FCoAT/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/IMNearth%2FCoAT/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/IMNearth%2FCoAT/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/IMNearth","download_url":"https://codeload.github.com/IMNearth/CoAT/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223888466,"owners_count":17220083,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-09T22:05:37.666Z","updated_at":"2024-11-09T22:05:39.556Z","avatar_url":"https://github.com/IMNearth.png","language":"Python","funding_links":[],"categories":["Papers"],"sub_categories":["Datasets","Frameworks \u0026 Models"],"readme":"\u003cdiv align=\"center\"\u003e\n  \u003ch1 style=\"display: inline-block; font-size: 32px;\"\u003eAndroid in the Zoo:\u003cbr\u003eChain-of-Action-Thought for GUI Agents\u003c/\\br\u003e\u003c/h1\u003e\n\u003c/div\u003e\n\u003cp align=\"center\"\u003e\u003cstrong\u003eJiwen Zhang\u003csup\u003e1,2\u003c/sup\u003e , Jihao Wu\u003csup\u003e2\u003c/sup\u003e  , Yihua Teng\u003csup\u003e2\u003c/sup\u003e  , Minghui Liao\u003csup\u003e2\u003c/sup\u003e  , Nuo Xu\u003csup\u003e2\u003c/sup\u003e  , Xiao Xiao\u003csup\u003e2\u003c/sup\u003e  , Zhongyu Wei\u003csup\u003e1\u003c/sup\u003e , Duyu  Tang\u003csup\u003e2\u003c/sup\u003e.\n \u003c/strong\u003e\u003c/p\u003e\n\u003cp align=\"center\"\u003e\u003csup\u003e1\u003c/sup\u003eFudan University      \u003csup\u003e2\u003c/sup\u003eHuawei Inc.\u003c/p\u003e \n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"https://img.shields.io/badge/Version-v1.0-Green\" /\u003e\n    \u003cimg src=\"https://img.shields.io/badge/Licence-Apache_2.0-Green\" /\u003e\n    \u003cimg src=\"https://img.shields.io/github/stars/IMNearth/CoAT?label=Stars\" /\u003e\n    \u003ca href=\"https://hits.seeyoufarm.com\"\u003e\u003cimg src=\"https://hits.seeyoufarm.com/api/count/incr/badge.svg?url=https%3A%2F%2Fgithub.com%2FIMNearth%2FCoAT\u0026count_bg=%2333E5E3\u0026title_bg=%236C6666\u0026icon=\u0026icon_color=%23E7E7E7\u0026title=visitors\u0026edge_flat=true\"/\u003e\u003c/a\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n    \u003ca href=\"https://arxiv.org/abs/2403.02713\"\u003e\u003cimg src=\"https://img.shields.io/badge/Paper-Arxiv-red\" /\u003e\u003c/a\u003e\n    \u003ca href=\"https://pan.baidu.com/s/1dHG-4L0RE1aYINzMSA4dCw?pwd=7g82\"\u003e\n      \u003cimg src=\"https://img.shields.io/badge/Baidu Disk-CoAT Dataset-violet?logo=baidu\" /\u003e\n  \t\u003c/a\u003e\n    \u003ca href=\"https://drive.google.com/file/d/12xOV2m62fBUFLhMcWIsFiC6zwV7a2RhI/view?usp=sharing\"\u003e\n      \u003cimg src=\"https://img.shields.io/badge/Google Drive-CoAT Dataset-blue?logo=googledrive\" /\u003e\n  \t\u003c/a\u003e\n\u003c/p\u003e \n\n--------------\n\nThis work presents **Chain-of-Action-Thought** (dubbed **CoAT**), which takes the description of the previous actions, the current screen, and more importantly the action thinking of what actions should be performed and the outcomes led by the chosen action. To enable an adaptive learning of CoAT process, we construct a benchmark **Android-In-The-Zoo**, which contains 18,643 screen-action pairs together with CoAT annotations.\n\n\u003cdiv align=\"center\"\u003e\n    \u003cimg src=assets/intro-total.png width=80% /\u003e\n\u003c/div\u003e\n\n\n## 📣 Update\n\n- **[2024-10-15]** Evaluation code has been released!\n\n- **[2024-09-20]** Our work has been accepted to EMNLP2024 Findings!\n\n- **[2024-07-16]** We add the demo code for using CoAT on proprietary models (GPT4V, Gemini-Pro and Qwen-VL-Max)!\n\n- **[2024-03-31]** We release the first version of our AiTZ dataset!\n\n- **[2024-03-05]** We have our paper arxived, now you can acess it by clicking [here](https://arxiv.org/abs/2403.02713) !\n\n\n\n## Android-in-the-Zoo\n\nThe data in AiTZ has 18,643 screens together with 2500+ instructions, all annotated with CoAT-driven semantic labels. The sample format for each time step is\n\n```json\n{\n  \"episode_id\": \"523638528775825151\",\n  \"episode_length\": 4,\n  \"step_id\": 0,\n  \"coat_screen_desc\":   \"[observe]\",\n  \"coat_action_think\":  \"[action think]\",\n  \"coat_action_desc\":   \"[next action description]\",\n  \"coat_action_result\": \"[action result]\",\n  ...\n}\n```\n\nYou can refer to  `data-example` folder for a more specific example.\n\n\n### Download\n\nOur dataset ([GoogleDrive](https://drive.google.com/file/d/12xOV2m62fBUFLhMcWIsFiC6zwV7a2RhI/view?usp=sharing) or [BaiduNetdisk](https://pan.baidu.com/s/1dHG-4L0RE1aYINzMSA4dCw?pwd=7g82)) contains both the screens (.png) and the annotations (.json), consuming about 2.6G device space. \n\n\n### Statistics\n\n| Subset      | Train      |           | Test       |           |\n| ----------- | ---------- | --------- | ---------- | --------- |\n|             | \\#Episodes | \\#Screens | \\#Episodes | \\#Screens |\n| General     | 323        | 2405      | 156        | 1202      |\n| Install     | 286        | 2519      | 134        | 1108      |\n| GoogleApps  | 166        | 1268      | 76         | 621       |\n| Single      | 844        | 2594      | 0          | 0         |\n| WebShopping | 379        | 5133      | 140        | 1793      |\n| **Total**   | **1998**   | **13919** | **506**    | **4724**  |\n\n\n\n## Chain-of-Action-Thought\n\n### Comparison with other context modeling methods\n\nWe validate the effectiveness of CoAT by conducting a preliminary experiment on 50 episodes randomly sampled from AITW dataset. \n\nThe compared baselines are [Chain-of-Thought](https://arxiv.org/abs/2201.11903) (CoT) and [Chain-of-Actions](https://arxiv.org/abs/2309.11436) (CoA). \n\n| Prompt | Metric | QwenVL | Gemini-PV | GPT-4V |\n| ------ | ------ | ------ | --------- | ------ |\n| CoA    | hit    | 94.5   | 99.8      | 99.3   |\n|        | acc    | 44.4   | 47.7      | 62.8   |\n| CoT    | hit    | 95.6   | 97.5      | 97.1   |\n|        | acc    | 49.4   | 52.0      | 64.1   |\n| CoAT   | hit    | 96.3   | 96.4      | 98.2   |\n|        | acc    | 52.4   | 54.5      | 73.5   |\n\nwhere “hit” means format hit rate, and “acc” means action type prediction accuracy. (One can refer to Table 8 in our paper for more details.)\n\n\n\n\n### CoAT demo usage\n\nHere we provide a demo code for anyone who wants to try the CoAT on GPT-4V, Qwen-VL-Max and Gemini-1.0-Pro-Vision.\n\nFirstly, go to `coat/config.yaml` and add your own api-keys and urls. \n\nSecondly, run the folloiwng code in commad line to generate sematic components of CoAT framework:\n\n```shell\npython run_coat.py --task \"flow\" --DEMO_MODE \"COAT\" --MODEL.NAME \"openai/gemini/qwenvl\" --num-threads 3\n```\n\nThen, you can obtain the action prediction results by\n\n```shell\npython run_coat.py --task \"predict\" --DEMO_MODE \"COAT\" --MODEL.NAME \"openai/gemini/qwenvl\" --num-threads 3\n```\n\n\n\n\n\n## Citation\n\nIf you find our work helpful, please consider citing our paper.\n\n```\n@misc{zhang2024android,\n      title={Android in the Zoo: Chain-of-Action-Thought for GUI Agents}, \n      author={Jiwen Zhang and Jihao Wu and Yihua Teng and Minghui Liao and Nuo Xu and Xiao Xiao and Zhongyu Wei and Duyu Tang},\n      year={2024},\n      eprint={2403.02713},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL}\n}\n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FIMNearth%2FCoAT","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FIMNearth%2FCoAT","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FIMNearth%2FCoAT/lists"}