{"id":20216126,"url":"https://github.com/thudm/cogcom","last_synced_at":"2025-04-11T11:03:37.251Z","repository":{"id":221276397,"uuid":"751806334","full_name":"THUDM/CogCoM","owner":"THUDM","description":null,"archived":false,"fork":false,"pushed_at":"2024-07-05T09:56:19.000Z","size":6318,"stargazers_count":165,"open_issues_count":19,"forks_count":10,"subscribers_count":8,"default_branch":"main","last_synced_at":"2025-03-25T07:23:50.767Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/THUDM.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-02-02T11:18:30.000Z","updated_at":"2025-03-20T06:47:16.000Z","dependencies_parsed_at":"2024-03-09T11:29:07.621Z","dependency_job_id":"8530bcc2-f65e-4ec7-a8b1-e022137db081","html_url":"https://github.com/THUDM/CogCoM","commit_stats":null,"previous_names":["thudm/cogcom"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2FCogCoM","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2FCogCoM/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2FCogCoM/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2FCogCoM/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/THUDM","download_url":"https://codeload.github.com/THUDM/CogCoM/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248381758,"owners_count":21094525,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-14T06:26:29.551Z","updated_at":"2025-04-11T11:03:37.213Z","avatar_url":"https://github.com/THUDM.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003ch2 align=\"center\"\u003e \u003ca href=\"https://arxiv.org/pdf/2402.04236\"\u003eCogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations\u003c/a\u003e\u003c/h2\u003e\n\u003ch5 align=\"center\"\u003e If you like our project, please give us a star ⭐ on GitHub for latest update.\u003cbr\u003e\n\n[![hf_space](https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg)](https://huggingface.co/qijimrc/CogCoM)\n[![arXiv](https://img.shields.io/badge/Arxiv-2401.15947-b31b1b.svg?logo=arXiv)](https://arxiv.org/pdf/2402.04236) \n[![License](https://img.shields.io/badge/License-Apache%202.0-yellow)](https://github.com/THUDM/CogCoM/blob/main/LICENSE)\n\u003c/h5\u003e\n\n\n\u003cdetails\u003e\u003csummary\u003e💡 We also have other vision-language projects that may interest you ✨. \u003c/summary\u003e\u003cp\u003e\n\u003c!--  may --\u003e\n\n\u003e [**CogVLM: Visual Expert for Pretrained Language Models**](https://github.com/THUDM/CogVLM) \u003cbr\u003e\n[![github](https://img.shields.io/badge/-Github-black?logo=github)](https://github.com/THUDM/CogVLM)  [![github](https://img.shields.io/github/stars/THUDM/CogVLM.svg?style=social\u0026label=Star)](https://github.com/THUDM/CogVLM) \u003cbr\u003e\n\u003e [**CogAgent: A Visual Language Model for GUI Agents**](https://github.com/THUDM/CogVLM) \u003cbr\u003e\n[![github](https://img.shields.io/badge/-Github-black?logo=github)](https://github.com/THUDM/CogVLM)  [![github](https://img.shields.io/github/stars/THUDM/CogVLM.svg?style=social\u0026label=Star)](https://github.com/THUDM/CogVLM) \u003cbr\u003e\n\n\u003c/p\u003e\u003c/details\u003e\n\n\n## 📣 News\n* **[2024/6/15]** 🎉 Release our prepared datasets, the synthesized 84K data and manually annotated 7K math data (see in [Data](/cogcom/data) or [HuggingFace](https://huggingface.co/datasets/qijimrc/CoMDataset)).\n* **[2024/2/26]** 🎉 Release the chat model CogCoM-chat-17b.\n* **[2024/2/26]**  🎉 Release the grounding model CogCoM-grounding-17b.\n* **[2024/2/4]**  🎉 Release the base model CogCoM-base-17b.\n\n\n## 😮 Highlights\n\nCogCoM enables VLMs to solve various visual problems step-by-step with evidence, without involving external tools.\n\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"assets/cases.png\" width=100%\u003e\n\u003c/p\u003e\n\n\n## 📖 Introduction to CogCoM\n\n- CogCoM is a general **open-source visual language model** (**VLM**) equipped with Chain of Manipulations (CoM), wich enables VLMs to solve complex visual problems step-by-step with evidence.\n- We formally design 6 basic manipulations upon the pilot experiments, which are capable of handling diverse visual problems.\n- We introduce a cascading data generation pipeline based on reliable large language models (e.g., LLMs, the linguistic annotators) and visual foundational models (e.g., VFMs, the visual annotators), which can automatically produce abundant error-free training data. We collect 70K CoM samples with this pipeline. \n- We then devise a multi-turn multi-image model architecture compatible with typical VLMs structures.\n- Based on a data recipe incorporating the curated corpus, we finally train a general VLM equipped with CoM reasoning mechanism, named CogCoM, which possesses capabilities of chat, captioning, grounding and reasoning.\n- Please refer to our paper for details.\n\n## 🤗 Demo\n\nWe support two GUIs for model inference, **Web demo** and **CLI**. If you want to use it in your python code, it is\neasy to modify the CLI scripts for your case.\n\n### Web Demo\n\nNow you can use the local code we have implemented with Gradio for [GUI demo](/cogcom/demo/web_demo.py). Please switch to the directory `demo/` and run:\n```bash\n# Local gradio\npython web_demo.py  --from_pretrained cogcom-base-17b --local_tokenizer path/to/tokenizer --bf16 --english\n```\n\n\n### CLI Demo\n\nWe also support interactive CLI inference using SAT. If you want to use it in your python code, it is easy to modify the CLI scripts for your case. The program will automatically download the sat model and interact in the command line (can simply using vicuna-7b-1.5 tokenizer).\n\n```bash\n# Launch an interactive environment\npython cli_demo_sat.py --from_pretrained cogcom-base-17b --local_tokenizer path/to/tokenizer --bf16 --english\n```\n\nThe program will automatically download the sat model and interact in the command line (can simply using vicuna-7b-1.5 tokenizer). You can generate replies by entering instructions and pressing enter. Enter `clear` to clear the conversation history and `stop` to stop the program.\n\nWe also support model parallel inference, which splits model to multiple (2/4/8) GPUs. `--nproc-per-node=[n]` in the\nfollowing command controls the number of used GPUs.\n\nTips:\n  - If you want to manually download the weights, you can replace the path after ``--from_pretrained`` with the model\n  path.\n\n  - Our model supports SAT's **4-bit quantization** and **8-bit quantization**. You can change ``--bf16`` to ``--fp16``, or ``--fp16 --quant 4``, or ``--fp16 --quant 8``.\n\n  For example\n\n    ```bash\n    python cli_demo_sat.py --from_pretrained cogcom-base-17b --fp16 --quant 8\n    ```\n\n  - The program provides the following hyperparameters to control the generation process:\n      ```\n      usage: cli_demo_sat.py [-h] [--max_length MAX_LENGTH] [--top_p TOP_P] [--top_k TOP_K] [--temperature TEMPERATURE]\n\n      optional arguments:\n          -h, --help                    show this help message and exit\n          --max_length MAX_LENGTH       max length of the total sequence\n          --top_p TOP_P                 top p for nucleus sampling\n          --top_k TOP_K                 top k for top k sampling\n          --temperature TEMPERATURE     temperature for sampling\n      ```\n\n\n## 🐳 Model Zoo\n\nIf you run the `demo/cli_demo*.py` from the code repository, it will automatically download SAT or Hugging Face\nweights. Alternatively, you can choose to manually download the necessary weights.\n\n  |          Model name           | Input resolution |                           Introduction                            | Huggingface model | SAT model |\n  | :-------------------------: | :----: | :-------------------------------------------------------: | :------: | :-------: |\n  |         cogcom-base-17b         |  490   |  Supports grounding, OCR, and CoM.   |  coming soon   |    [link](https://huggingface.co/qijimrc/CogCoM/blob/main/cogcom-base-17b.zip)        |\n  |         cogcom-grounding-17b         |  490   |  Supports grounding, OCR, and CoM.   |  coming soon   |    [link](https://huggingface.co/qijimrc/CogCoM/blob/main/cogcom-grounding-17b.zip)        |\n  |         cogcom-chat-17b         |  490   |  Supports chat, grounding, OCR, and CoM.   |  coming soon   |      [link](https://huggingface.co/qijimrc/CogCoM/blob/main/cogcom-chat-17b.zip)      |\n\n\n\n\n## ⚙️ Requirements and Installation\nWe recommend the requirements as follows.\n* Python == 3.11\n* SwissArmyTransformer\u003e=0.4.8\n* torch\u003e=2.1.2\n* CUDA \u003e= 11.7\n* **Transformers == 4.37.0**\n* **xformers == 0.0.24**\n* **pydantic == 1.10.1**\n* **gradio == 3.50.2**\n* Install required packages:\n```bash\npip install -r requirements.txt\npython -m spacy download en_core_web_sm\n```\n\n\u003e [!Warning]\n\u003e \u003cdiv align=\"left\"\u003e\n\u003e \u003cb\u003e\n\u003e 🚨 Please install proper version of `pydantic` for smooth inference as mentioned in [issie3](https://github.com/THUDM/CogCoM/issues/3).\n\u003e \u003c/b\u003e\n\u003e \u003c/div\u003e\n\n\n\n\n## 🗝️ Training \u0026 Validating\n\n### Finetuning CogCoM\n\nYou may want to use CogCoM in your own task, which needs a **different output style or domain knowledge**. **All code\nfor finetuning is located under at ``finetune.sh`` and ``finetune.py`` files.**\n\n\n### Hardware requirement\n\n* Model Inference:\n  - For INT4 quantization: 1 * RTX 3090(24G)\n  - For FP16: 1 * A100(80G) or 2 * RTX 3090(24G)\n\n* Finetuning:\n  - For FP16: 4 * A100(80G) *[Recommend]* or 8* RTX 3090(24G).\n\n\n### Evaluation\n\n\u003cdetails\u003e\n\u003csummary\u003eClick to view results on GQA, TallyVQA, TextVQA, ST-VQA. \u003c/summary\u003e\n\n\u003ctable\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eMethod\u003c/td\u003e\n        \u003ctd\u003eGQA\u003c/td\u003e\n        \u003ctd\u003eTallyVQA-s\u003c/td\u003e\n        \u003ctd\u003eTallyVQA-c\u003c/td\u003e\n        \u003ctd\u003eTextVQA\u003c/td\u003e\n        \u003ctd\u003eST-VQA\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eFlamingo\u003c/td\u003e\n        \u003ctd\u003e-\u003c/td\u003e\n        \u003ctd\u003e-\u003c/td\u003e\n        \u003ctd\u003e-\u003c/td\u003e\n        \u003ctd\u003e54.1\u003c/td\u003e\n        \u003ctd\u003e-\u003c/td\u003e\n    \u003c/tr\u003e\n     \u003ctr\u003e\n        \u003ctd\u003eGIT\u003c/td\u003e\n        \u003ctd\u003e-\u003c/td\u003e\n        \u003ctd\u003e-\u003c/td\u003e\n        \u003ctd\u003e-\u003c/td\u003e\n        \u003ctd\u003e59.8\u003c/td\u003e\n        \u003ctd\u003e-\u003c/td\u003e\n    \u003c/tr\u003e\n     \u003ctr\u003e\n        \u003ctd\u003eGIT2\u003c/td\u003e\n        \u003ctd\u003e-\u003c/td\u003e\n        \u003ctd\u003e-\u003c/td\u003e\n        \u003ctd\u003e-\u003c/td\u003e\n        \u003ctd\u003e67.3\u003c/td\u003e\n        \u003ctd\u003e-\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eBLIP-2\u003c/td\u003e\n        \u003ctd\u003e44.7*\u003c/td\u003e\n        \u003ctd\u003e-\u003c/td\u003e\n        \u003ctd\u003e-\u003c/td\u003e\n        \u003ctd\u003e-\u003c/td\u003e\n        \u003ctd\u003e21.7\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eInstructBLIP\u003c/td\u003e\n        \u003ctd\u003e49.5*\u003c/td\u003e\n        \u003ctd\u003e-\u003c/td\u003e\n        \u003ctd\u003e-\u003c/td\u003e\n        \u003ctd\u003e-\u003c/td\u003e\n        \u003ctd\u003e50.7*\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eQwen-VL\u003c/td\u003e\n        \u003ctd\u003e49.5*\u003c/td\u003e\n        \u003ctd\u003e-\u003c/td\u003e\n        \u003ctd\u003e-\u003c/td\u003e\n        \u003ctd\u003e-\u003c/td\u003e\n        \u003ctd\u003e50.7*\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eCogCoM\u003c/td\u003e\n        \u003ctd\u003e71.7\u003c/td\u003e\n        \u003ctd\u003e84.0\u003c/td\u003e\n        \u003ctd\u003e70.1\u003c/td\u003e\n        \u003ctd\u003e71.1\u003c/td\u003e\n        \u003ctd\u003e70.0\u003c/td\u003e\n    \u003c/tr\u003e\n\u003c/table\u003e\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eClick to view results of grounding benchmarks. \u003c/summary\u003e\n\n\u003ctable\u003e\n    \u003ctr\u003e\n        \u003ctd\u003e\u003c/td\u003e\n        \u003ctd\u003eRefCOCO\u003c/td\u003e\n        \u003ctd\u003e\u003c/td\u003e\n        \u003ctd\u003e\u003c/td\u003e\n        \u003ctd\u003eRefCOCO+\u003c/td\u003e\n        \u003ctd\u003e\u003c/td\u003e\n        \u003ctd\u003e\u003c/td\u003e\n        \u003ctd\u003eRefCOCOg\u003c/td\u003e\n        \u003ctd\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003e\u003c/td\u003e\n        \u003ctd\u003eval\u003c/td\u003e\n        \u003ctd\u003etestA\u003c/td\u003e\n        \u003ctd\u003etestB\u003c/td\u003e\n        \u003ctd\u003eval\u003c/td\u003e\n        \u003ctd\u003etestA\u003c/td\u003e\n        \u003ctd\u003etestB\u003c/td\u003e\n        \u003ctd\u003eval\u003c/td\u003e\n        \u003ctd\u003etest\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eCogCoM-grounding-generalist\u003c/td\u003e\n        \u003ctd\u003e92.34\u003c/td\u003e\n        \u003ctd\u003e94.57\u003c/td\u003e\n        \u003ctd\u003e89.15\u003c/td\u003e\n        \u003ctd\u003e88.19\u003c/td\u003e\n        \u003ctd\u003e92.80\u003c/td\u003e\n        \u003ctd\u003e82.08\u003c/td\u003e\n        \u003ctd\u003e89.32\u003c/td\u003e\n        \u003ctd\u003e90.45\u003c/td\u003e\n    \u003c/tr\u003e\n\u003c/table\u003e\n\u003c/details\u003e\n\n\n\n\n## 🍭 Examples\n\nCogCoM demonstrates the flexible capabilities for adapting to different multimodal scenarios, including evidential visual\nreasoning, Visual Grounding, Grounded Captioning, Image Captioning, Multi Choice, and Detailed Captioning.\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=assets/app_case.jpg width=100% /\u003e\n\u003c/p\u003e\n\n\n\n\n\n## 💡 Cookbook\n\n### Task Prompts\n\n1. **General Multi-Round Dialogue**: Say whatever you want.\n\n2. **Chain of Manipulations** : Explicitly launching CoM reasoning.\n\n    - We randomly add launching prompts to the CoM chains for solving meticulous visual problems, so you can explicitly let CogCoM to run with CoM mechanism, by adding the following launching prompt (we randomly generate numerous launching prompts for flexibility, see `com_dataset.py` for all details):\n\n    ```bash\n        Please solve the problem gradually via a chain of manipulations, where in each step you can selectively adopt one of the following manipulations GROUNDING(a phrase)-\u003eboxes, OCR(an image or a region)-\u003etexts, CROP_AND_ZOOMIN(a region on given image)-\u003enew_image, CALCULATE(a computable target)-\u003enumbers, or invent a new manipulation, if that seems helpful. {QUESTION}\n    ```\n\n\n3. **Visual Grounding**. Our model is compatible with the grounding instructions from MultiInstruct and CogVLM, we provide basic usage of three functionalities here:\n\n    - **Visual Grounding (VG)**: Returning grounding coordinates (bounding box) based on the description of objects. Use any template from [instruction template](cogcom/utils/template.py). For example (replacing \u0026lt;expr\u0026gt; with the object's description):\n\n      \u003e \"Find the region in image that \"\u0026lt;expr\u0026gt;\" describes.\"\n\n    - **Grounded Captioning (GC)**: Providing a description based on bounding box coordinates. Use a template from [instruction template](cogcom/utils/template.py). For example (replacing \u0026lt;objs\u0026gt; with the position coordinates),\n\n      \u003e \"Describe the content of *[[086,540,400,760]]* in the picture.\"\n\n    - **Image Description with Cooordinates (IDC)**: Image description with grounding coordinates (bounding box). Use any template\n      from [caption_with_box template](https://github.com/THUDM/CogVLM/blob/main/utils/utils/template.py#L537) as model\n      input. For example:\n\n      \u003e Can you provide a description of the image and include the coordinates [[x0,y0,x1,y1]] for each mentioned object?\n    \n**Format of coordination:** The bounding box coordinates in the model's input and output use the\nformat ``[[x1, y1, x2, y2]]``, with the origin at the top left corner, the x-axis to the right, and the y-axis\ndownward. (x1, y1) and (x2, y2) are the top-left and bottom-right corners, respectively, with values as relative\ncoordinates multiplied by 1000 (prefixed with zeros to three digits).\n\n\n### FAQ\n\n* If you have trouble in accessing huggingface.co, you can add `--local_tokenizer /path/to/vicuna-7b-v1.5` to load the\n  tokenizer.\n* Download model using 🔨[SAT](https://github.com/THUDM/SwissArmyTransformer), the model will be saved to the default\n  location `~/.sat_models`. Change the default location by setting the environment variable `SAT_HOME`. For example, if\n  you want to save the model to `/path/to/my/models`, you can run `export SAT_HOME=/path/to/my/models` before running\n  the python command.\n\n## 🔒 License\n\nThe code in this repository is open source under the [Apache-2.0 license](./LICENSE), while the use of the CogCoM model\nweights must comply with the [Model License](./MODEL_LICENSE).\n\n## ✒️ Citation \u0026 Acknowledgements\n\n```\n@article{qi2024cogcom,\n  title={CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations},\n  author={Qi, Ji and Ding, Ming and Wang, Weihan and Bai, Yushi and Lv, Qingsong and Hong, Wenyi and Xu, Bin and Hou, Lei and Li, Juanzi and Dong, Yuxiao and Tang, Jie},\n  journal={arXiv preprint arXiv:2402.04236},\n  year={2024}\n}\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthudm%2Fcogcom","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthudm%2Fcogcom","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthudm%2Fcogcom/lists"}