{"id":13568687,"url":"https://github.com/amazon-science/mm-cot","last_synced_at":"2025-05-14T18:03:50.919Z","repository":{"id":65663108,"uuid":"596424141","full_name":"amazon-science/mm-cot","owner":"amazon-science","description":"Official implementation for \"Multimodal Chain-of-Thought Reasoning in Language Models\" (stay tuned and more will be updated)","archived":false,"fork":false,"pushed_at":"2024-06-12T13:50:10.000Z","size":3507,"stargazers_count":3906,"open_issues_count":50,"forks_count":319,"subscribers_count":56,"default_branch":"main","last_synced_at":"2025-04-11T10:00:34.581Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2302.00923","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/amazon-science.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-02-02T06:31:32.000Z","updated_at":"2025-04-09T18:43:39.000Z","dependencies_parsed_at":null,"dependency_job_id":"87c4867b-9816-4476-99d2-fe56812b0f01","html_url":"https://github.com/amazon-science/mm-cot","commit_stats":{"total_commits":14,"total_committers":9,"mean_commits":"1.5555555555555556","dds":0.7857142857142857,"last_synced_commit":"8dd4ac02b94f21347973491f6e6b828502d23f9d"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amazon-science%2Fmm-cot","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amazon-science%2Fmm-cot/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amazon-science%2Fmm-cot/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amazon-science%2Fmm-cot/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/amazon-science","download_url":"https://codeload.github.com/amazon-science/mm-cot/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254198452,"owners_count":22030964,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T14:00:30.406Z","updated_at":"2025-05-14T18:03:45.910Z","avatar_url":"https://github.com/amazon-science.png","language":"Python","readme":"# Multimodal Chain-of-Thought Reasoning in Language Models\n\n\u003ch5 align=\"center\"\u003e\u003ci\u003e\"Imagine learning a textbook without figures or tables.\"\u003c/i\u003e\u003c/h5\u003e\n\nMultimodal-CoT incorporates vision features in a decoupled training framework. The framework consists of two training stages: (i) rationale generation and (ii) answer inference. Both stages share the same model architecture but differ in the input and output.\n\n![](vision_features/mm-cot.png)\n\n\n## Requirements\n\nInstall all required python dependencies:\n\n```\npip install -r requirements.txt\n```\n\n## Datasets\n\nDownload the dataset from the following repository:\n\n```\nhttps://github.com/lupantech/ScienceQA/tree/main/data\n```\nThe vision features (detr, resnet, clip, vit) are available at https://huggingface.co/cooelf/vision_features/tree/main\n\nAlternatively, you may download the extracted vision features (detr, resnet, clip) from [vision_features](https://drive.google.com/file/d/13B0hc_F_45-UlqPLKSgRz-ALtFQ8kIJr/view?usp=share_link) and unzip the files under `vision_features`\n\n## Extract Features (optional)\n\nThe processed vision features for ScienceQA are available at https://huggingface.co/cooelf/vision_features/tree/main. \n\nThe following instructions show how we obtain those features.\n\nDownload the image files from [Google Drive](https://drive.google.com/drive/folders/1w8imCXWYn2LxajmGeGH_g5DaL2rabHev?usp=sharing) and unzip all the images (train, dev, test) in the same folder (). The structure should be:\n\n```\nimages\n├── 1\n│   └── image.png\n├── 2\n│   └── image.png\n├── 3\n│   └── image.png\n├── 5\n│   └── image.png\n├── 7\n│   └── image.png\n```\n\nRun ```extract_features.py --data_root images --output_dir vision_features --img_type vit```\n\nIf you hope to use your own images, please structure those images in the way above, or modify the script ```extract_features.py```.\n\n## Extract Captions (optional)\n\nThe processed captions for ScienceQA are available at ```data/instruct_captions.json```. \n\nThe following instructions show how we obtain those features.\n\nIntall lavis and prepare Vicuna weights to use InstructBLIP for caption extraction.\n\nhttps://github.com/salesforce/LAVIS/tree/f982acc73288408bceda2d35471a8fcf55aa04ca/projects/instructblip\n\nAssume that the images are stored in the ```images``` folder. \n\n```\npython extract_caption.py\n```\n\n## Instructions\n\n### Training \n\n```\n# rationale generation\nCUDA_VISIBLE_DEVICES=0,1,2,3 python main.py \\\n    --data_root data/ScienceQA/data \\\n    --caption_file data/instruct_captions.json \\\n    --model declare-lab/flan-alpaca-large \\\n    --user_msg rationale --img_type vit \\\n    --bs 2 --eval_bs 4 --epoch 50 --lr 5e-5 --output_len 512 \\\n    --use_caption --use_generate --prompt_format QCM-E \\\n    --output_dir experiments\n\n# answer inference\nCUDA_VISIBLE_DEVICES=0,1,2,3 python main_central.py \\\n    --data_root data/ScienceQA/data \\\n    --caption_file data/instruct_captions.json \\\n    --model declare-lab/flan-alpaca-large \\\n    --user_msg answer --img_type vit \\\n    --bs 4 --eval_bs 8 --epoch 50 --lr 5e-5 --output_len 64 \\\n    --use_caption --use_generate --prompt_format QCMG-A \\\n    --output_dir experiments \\\n    --eval_le experiments/rationale_declare-lab-flan-alpaca-large_vit_QCM-E_lr5e-05_bs8_op512_ep50/predictions_ans_eval.json \\\n    --test_le experiments/rationale_declare-lab-flan-alpaca-large_vit_QCM-E_lr5e-05_bs8_op512_ep50/predictions_ans_test.json\n\n```\n\n### Inference \n\nOur trained models are available at https://huggingface.co/cooelf/mm-cot/tree/main. To use our trained models, please put the them under the ```models``` folder.\n\n```\n# rationale generation\nCUDA_VISIBLE_DEVICES=0,1,2,3 python main.py \\\n    --data_root data/ScienceQA/data \\\n    --caption_file data/instruct_captions.json \\\n    --model declare-lab/flan-alpaca-large \\\n    --user_msg rationale --img_type vit \\\n    --bs 2 --eval_bs 4  --epoch 50 --lr 5e-5 --output_len 512 \\\n    --use_caption --use_generate --prompt_format QCM-E \\\n    --output_dir experiments\n    --evaluate_dir models/mm-cot-large-rationale\n\n# answer inference\nCUDA_VISIBLE_DEVICES=0,1,2,3 python main_central.py \\\n    --data_root data/ScienceQA/data \\\n    --caption_file data/instruct_captions.json \\\n    --model declare-lab/flan-alpaca-large \\\n    --user_msg answer --img_type vit \\\n    --bs 4 --eval_bs 8 --epoch 50 --lr 5e-5 --output_len 64  \\\n    --use_caption --use_generate --prompt_format QCMG-A \\\n    --output_dir experiments \\\n    --eval_le experiments/rationale_declare-lab-flan-alpaca-large_vit_QCM-E_lr5e-05_bs8_op512_ep50/predictions_ans_eval.json \\\n    --test_le experiments/rationale_declare-lab-flan-alpaca-large_vit_QCM-E_lr5e-05_bs8_op512_ep50/predictions_ans_test.json \\\n    --evaluate_dir models/mm-cot-large-answer\n```\n\n## Citing MM-CoT\n\n```\n@article{zhang2023multicot,\n  title={Multimodal Chain-of-Thought Reasoning in Language Models},\n  author={Zhang, Zhuosheng and Zhang, Aston and Li, Mu and Zhao, Hai and Karypis, George and Smola, Alex},\n  journal={arXiv preprint arXiv:2302.00923},\n  year={2023}\n}\n```\n\n## License\n\nThis project is licensed under the Apache-2.0 License.\n\n## Acknowledgement\n\nPart of our codes are adapted from [ScienceQA](https://github.com/lupantech/ScienceQA), [Transformers](https://github.com/huggingface/transformers), [pytorch-image-models](https://github.com/huggingface/pytorch-image-models).\n\nWe thank [Pan Lu](https://lupantech.github.io/) for providing parameter size for ScienceQA baselines.\n","funding_links":[],"categories":["Amazon LLM","Python","A01_文本生成_文本对话","Reimplementations"],"sub_categories":["大语言对话模型及数据"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Famazon-science%2Fmm-cot","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Famazon-science%2Fmm-cot","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Famazon-science%2Fmm-cot/lists"}