{"id":17998759,"url":"https://github.com/shikiw/modality-integration-rate","last_synced_at":"2025-04-07T13:07:09.958Z","repository":{"id":259849056,"uuid":"870131620","full_name":"shikiw/Modality-Integration-Rate","owner":"shikiw","description":"The official code of the paper \"Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate\".","archived":false,"fork":false,"pushed_at":"2024-11-27T06:26:09.000Z","size":18599,"stargazers_count":97,"open_issues_count":1,"forks_count":3,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-07T13:06:48.860Z","etag":null,"topics":["chatbot","gpt-4o","large-multimodal-models","llama","llava","multimodal","vision-language-learning","vision-language-model"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/shikiw.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-10-09T13:52:42.000Z","updated_at":"2025-04-06T21:24:45.000Z","dependencies_parsed_at":"2025-01-15T01:28:56.115Z","dependency_job_id":null,"html_url":"https://github.com/shikiw/Modality-Integration-Rate","commit_stats":{"total_commits":14,"total_committers":2,"mean_commits":7.0,"dds":0.5,"last_synced_commit":"28531a42e021f7f9722a9060bf50bd4a88d402d0"},"previous_names":["shikiw/modality-integration-rate"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shikiw%2FModality-Integration-Rate","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shikiw%2FModality-Integration-Rate/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shikiw%2FModality-Integration-Rate/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shikiw%2FModality-Integration-Rate/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/shikiw","download_url":"https://codeload.github.com/shikiw/Modality-Integration-Rate/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247657281,"owners_count":20974345,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chatbot","gpt-4o","large-multimodal-models","llama","llava","multimodal","vision-language-learning","vision-language-model"],"created_at":"2024-10-29T22:05:15.826Z","updated_at":"2025-04-07T13:07:09.934Z","avatar_url":"https://github.com/shikiw.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate\n\n[![License: MIT](https://img.shields.io/badge/License-MIT-g.svg)](https://opensource.org/licenses/MIT)\n[![Arxiv](https://img.shields.io/badge/arXiv-2410.07167-B21A1B)](https://arxiv.org/abs/2410.07167)\n[![Hugging Face Transformers](https://img.shields.io/badge/%F0%9F%A4%97-HuggingFace-blue)](https://huggingface.co/papers/2410.07167)\n[![GitHub Stars](https://img.shields.io/github/stars/shikiw/Modality-Integration-Rate?style=social)](https://github.com/shikiw/Modality-Integration-Rate/stargazers)\n\n\nThis repository provides the official PyTorch implementation of the following paper: \n\u003e [**Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate**](https://arxiv.org/abs/2410.07167) \u003cbr\u003e\n\u003e [Qidong Huang](https://shikiw.github.io/)\u003csup\u003e1,2\u003c/sup\u003e, \n\u003e [Xiaoyi Dong](https://scholar.google.com/citations?user=FscToE0AAAAJ\u0026hl=en)\u003csup\u003e2,3\u003c/sup\u003e, \n\u003e [Pan Zhang](https://panzhang0212.github.io/)\u003csup\u003e2\u003c/sup\u003e,\n\u003e [Yuhang Zang](https://yuhangzang.github.io/) \u003csup\u003e2\u003c/sup\u003e,\n\u003e [Yuhang Cao](https://scholar.google.com/citations?user=sJkqsqkAAAAJ\u0026hl=zh-CN) \u003csup\u003e2\u003c/sup\u003e, \n\u003e [Jiaqi Wang](https://myownskyw7.github.io/)\u003csup\u003e2\u003c/sup\u003e,\n\u003e [Dahua Lin](http://dahua.site/)\u003csup\u003e2\u003c/sup\u003e, \n\u003e [Weiming Zhang](http://staff.ustc.edu.cn/~zhangwm/index.html)\u003csup\u003e1\u003c/sup\u003e, \n\u003e [Nenghai Yu](https://scholar.google.com/citations?user=7620QAMAAAAJ\u0026hl=en)\u003csup\u003e1\u003c/sup\u003e \u003cbr\u003e\n\u003e \u003csup\u003e1\u003c/sup\u003eUniversity of Science and Technology of China, \u003csup\u003e2\u003c/sup\u003eShanghai AI Laboratory, \u003csup\u003e3\u003c/sup\u003eThe Chinese University of Hong Kong \u003cbr\u003e\n\n## 🎯 News\n\n**[2024.10.10]** 🚀 We release the paper at [ArXiv](https://arxiv.org/abs/2410.07167) and [HuggingFace](https://huggingface.co/papers/2410.07167)!\n\n**[2024.10.10]** 🚀 This project page has been built!\n\n## 👨‍💻 Todo\n\n- [x] Release the code of MIR\n- [x] Release the training code and evaluation code of MoCa\n- [x] Release the checkpoints of MoCa\n\n\n\n## ⭐️ TL;DR\n### 1. For MIR\nIf you just want to use MIR as the pre-training indicator of your own model, no additional environment is required.\n\n1. Ensure the packages such as ```torch```, ```numpy```, and ```scipy``` are installed.\n2. Replace the model preprocessing and generation in ```mir.py``` with your own model's code, we display LLaVA's code as the reference.\n3. Specify the input args and run the command:\n```\npython mir.py --model_path PATH/TO/MODEL --base_llm PATH/TO/LLM --text_data_path PATH/TO/TEXT/DATA --image_data_path PATH/TO/VISION/DATA --eval_num 100 --mode fast\n```\nNote that ```base_llm``` is not required if you train the base LLM during pre-training and include its ckpt in the ```model_path```. \n\nYou can also adjust the args to the intialization style of your model.\n\n### 2. For MoCa\nIf you just want to use MoCa on your own model, we recommand you to following the steps below:\n\n1. Copy the code of [MoCa module](https://github.com/shikiw/Modality-Integration-Rate/blob/501d64dd37aa5382caf97d14c1da9b088bb8b4c7/transformers-4.37.2/src/transformers/models/llama/modeling_llama.py#L122-L139) into the modeling code of your own model and ensure MoCa is equipped by the base LLM layer in both [initialization](https://github.com/shikiw/Modality-Integration-Rate/blob/501d64dd37aa5382caf97d14c1da9b088bb8b4c7/transformers-4.37.2/src/transformers/models/llama/modeling_llama.py#L809-L814) and [forward](https://github.com/shikiw/Modality-Integration-Rate/blob/501d64dd37aa5382caf97d14c1da9b088bb8b4c7/transformers-4.37.2/src/transformers/models/llama/modeling_llama.py#L868-L870) functions.\n2. Make sure that the input preprocessing can compute the ```modality_mask```, please refer to [Line183-184](https://github.com/shikiw/Modality-Integration-Rate/blob/501d64dd37aa5382caf97d14c1da9b088bb8b4c7/llava/model/llava_arch.py#L183-L184), [Line269-276](https://github.com/shikiw/Modality-Integration-Rate/blob/501d64dd37aa5382caf97d14c1da9b088bb8b4c7/llava/model/llava_arch.py#L269-L276) and [Line373-382](https://github.com/shikiw/Modality-Integration-Rate/blob/501d64dd37aa5382caf97d14c1da9b088bb8b4c7/llava/model/llava_arch.py#L373-L382) in ```llava/model/llava_arch.py```. Also, make sure that the ```modality_mask``` can be successsfully delivered into the model forward pass, e.g., adding it as the formal parameter of each forward function, like [Line70](https://github.com/shikiw/Modality-Integration-Rate/blob/501d64dd37aa5382caf97d14c1da9b088bb8b4c7/llava/model/language_model/llava_llama.py#L70), [Line88](https://github.com/shikiw/Modality-Integration-Rate/blob/501d64dd37aa5382caf97d14c1da9b088bb8b4c7/llava/model/language_model/llava_llama.py#L88), [Line96](https://github.com/shikiw/Modality-Integration-Rate/blob/501d64dd37aa5382caf97d14c1da9b088bb8b4c7/llava/model/language_model/llava_llama.py#L96), [Line106](https://github.com/shikiw/Modality-Integration-Rate/blob/501d64dd37aa5382caf97d14c1da9b088bb8b4c7/llava/model/language_model/llava_llama.py#L106), [Line127](https://github.com/shikiw/Modality-Integration-Rate/blob/501d64dd37aa5382caf97d14c1da9b088bb8b4c7/llava/model/language_model/llava_llama.py#L127), [Line137](https://github.com/shikiw/Modality-Integration-Rate/blob/501d64dd37aa5382caf97d14c1da9b088bb8b4c7/llava/model/language_model/llava_llama.py#L137), [Line145](https://github.com/shikiw/Modality-Integration-Rate/blob/501d64dd37aa5382caf97d14c1da9b088bb8b4c7/llava/model/language_model/llava_llama.py#L145), [Line157](https://github.com/shikiw/Modality-Integration-Rate/blob/501d64dd37aa5382caf97d14c1da9b088bb8b4c7/llava/model/language_model/llava_llama.py#L157), [Line166](https://github.com/shikiw/Modality-Integration-Rate/blob/501d64dd37aa5382caf97d14c1da9b088bb8b4c7/llava/model/language_model/llava_llama.py#L166), [Line174-175](https://github.com/shikiw/Modality-Integration-Rate/blob/501d64dd37aa5382caf97d14c1da9b088bb8b4c7/llava/model/language_model/llava_llama.py#L174-L175) in ```llava/model/language_model/llava_llama.py```. \n3. Check some details to support the usage of ```use_moca=True```, such as (it is recommanded to search ```use_moca``` in this repo to find which places should be revised):\n   1）Add it into the model config ([here](https://github.com/shikiw/Modality-Integration-Rate/blob/501d64dd37aa5382caf97d14c1da9b088bb8b4c7/llava/model/language_model/llava_llama.py#L35)).\n   2) Add it into training arguments ([here](https://github.com/shikiw/Modality-Integration-Rate/blob/501d64dd37aa5382caf97d14c1da9b088bb8b4c7/llava/train/train.py#L72)).\n   3) Unlock it during training ([here](https://github.com/shikiw/Modality-Integration-Rate/blob/501d64dd37aa5382caf97d14c1da9b088bb8b4c7/llava/train/train.py#L1056-L1060)).\n   4) Ensure the correct checkpoint saving ([here1](https://github.com/shikiw/Modality-Integration-Rate/blob/501d64dd37aa5382caf97d14c1da9b088bb8b4c7/llava/train/train.py#L199), [here2](https://github.com/shikiw/Modality-Integration-Rate/blob/501d64dd37aa5382caf97d14c1da9b088bb8b4c7/llava/train/llava_trainer.py#L278), [here3](https://github.com/shikiw/Modality-Integration-Rate/blob/501d64dd37aa5382caf97d14c1da9b088bb8b4c7/llava/train/llava_trainer.py#L299)).\n4. Add ```--use_moca``` when running the training command to enable the usage of MoCa.\n\n\n\n## 📜 Setup\nIf you want to use our codebase (modified on LLaVA) for reproduction, you are recommanded to build a new environment though the steps below. \nThe following steps are just listed for Linux. If you are using macOS or Windows, please refer to [LLaVA](https://github.com/haotian-liu/LLaVA?tab=readme-ov-file).\n1. Clone this repository and navigate to Modality-Integration-Rate folder\n```\ngit clone https://github.com/shikiw/Modality-Integration-Rate.git\ncd Modality-Integration-Rate\n```\n2. Install Package\n```\nconda create -n llava python=3.10 -y\nconda activate llava\npython -m pip install --upgrade pip  # enable PEP 660 support\npython -m pip install -e .\npython -m pip install -e transformers-4.37.2\n```\n3. Install additional packages for training cases\n```\npythom -m pip install -e \".[train]\"\npythom -m pip install flash-attn --no-build-isolation\n```\n\n\n## MIR\n\nTo reproduce the MIR implementation on this codebase, you can follow these steps:\n1. Specify the ```text_data_path``` and ```image_data_path``` for MIR calculation. You can also specify them like [Line55-64](https://github.com/shikiw/Modality-Integration-Rate/blob/b9ec4d3b080444dcf2b2b7cc3d21a3fdb9dcb42b/mir.py#L55-L64) in ```mir.py```, using TextVQA val images and CNN/DM text by default, i.e., \n   1) Download [TextVQA_0.5.1_val.json](https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_val.json) and [images](https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip) and extract to ```PATH/TO/VISION/DATA```.\n   2) Download [CNN stories](https://cs.nyu.edu/~kcho/DMQA/) and extract to ```PATH/TO/TEXT/DATA```.\n   3) Modify [Line55-64](https://github.com/shikiw/Modality-Integration-Rate/blob/b9ec4d3b080444dcf2b2b7cc3d21a3fdb9dcb42b/mir.py#L55-L64) with the text data path and image data path.\n2. If you pre-train only MLP, run this command:\n```\npython mir.py --model_path PATH/TO/MODEL --base_llm PATH/TO/LLM --eval_num 100 --mode fast\n```\n3. If your pre-train any part of ViT or base LLM, run this command:\n```\npython mir.py --model_path PATH/TO/MODEL --eval_num 100 --mode fast\n```\n\n## MoCa\nOur codebase supports ```--use_moca``` to activate the implementation of MoCa. Check out ```scripts/v1_5/pre_sft_moca.sh``` for more details.\n\n| Model | Size | Schedule | Average| MMStar | MME | MMB | MMB-CN | SEED-IMG | TextVQA | MM-Vet | POPE | GQA |\n|----------------|-----------|--------|---|---|---|---|---|---|---|---|---|---|\n| LLaVA-v1.5 | 7B | full_ft-1e | 59.1 | 30.3 | 1510.7 | 64.3 | 58.3 | 66.1 | 58.2 | 31.1 | 85.9 | 62.0 |\n| +MoCa | 7B | full_ft-1e | 60.6 | 36.5 | 1481.0 | 66.8 | 60.0 | 67.0 | 58.7 | 32.2 | 86.9 | 62.8 |\n\nThe [pretrained](https://huggingface.co/shikiw/LLaVA-v1.5-MoCa-7B-pretrain) and [finetuned](https://huggingface.co/shikiw/LLaVA-v1.5-MoCa-7B) checkpoints are released.\n\n## Train\nThis codebase is based on [LLaVA](https://github.com/haotian-liu/LLaVA) and [ShareGPT4V](https://github.com/InternLM/InternLM-XComposer/tree/main/projects/ShareGPT4V), where we introduce some new features and now it supports the following inputs in the launch script:\n   1) ```--tune_vision_tower``` and ```--tune_vit_from_layer```\n   2) ```--tune_language_model``` and ```--tune_llm_utill_layer```\n   3) ```--tune_entire_model```\n   4) ```--data_scale```\n   5) ```--use_moca``` and ```--moca_std```\n\nSome cases for reference: \n\n1. To pre-train the model with the customized data scale (e.g., 200K):\n```\nsh scripts/v1_5/pre_data_scale.sh\n```\n\n2. To pre-train the model (unlock the 13-24 layer of ViT and the 1-16 layer of base LLM), and SFT (unlock entire LLM by default):\n```\nsh scripts/v1_5/pre_unlock_vit-12_llm-16_sft.sh\n```\n\n3. To pre-train the model (unlock the 13-24 layer of ViT and the entire base LLM), and SFT (unlock entire LLM by default):\n```\nsh scripts/v1_5/pre_unlock_vit-12_llm-all_sft.sh\n```\n\n4. To apply MoCa in training:\n```\nsh scripts/v1_5/pre_sft_moca.sh\n```\n\n\n## Evaluation\nWe follow the original evaluation in [LLaVA](https://github.com/haotian-liu/LLaVA) for most of benchmarks. For [MMStar](https://github.com/MMStar-Benchmark/MMStar), we use [VLMEvalKit](https://github.com/open-compass/VLMEvalKit). \n\nSee [Evaluation.md](https://github.com/haotian-liu/LLaVA/blob/main/docs/Evaluation.md). \n\n\n## Acknowledgement\nThis repo is based on the codebase of [LLaVA](https://github.com/haotian-liu/LLaVA) and [ShareGPT4V](https://github.com/InternLM/InternLM-XComposer/tree/main/projects/ShareGPT4V). Thanks for their impressive works!\n\n\n## Citation\nIf you find this work useful for your research, please cite our paper:\n```\n@article{huang2024deciphering,\n  title={Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate},\n  author={Huang, Qidong and Dong, Xiaoyi and Zhang, Pan and Zang, Yuhang and Cao, Yuhang and Wang, Jiaqi and Lin, Dahua and Zhang, Weiming and Yu, Nenghai},\n  journal={arXiv preprint arXiv:2410.07167},\n  year={2024}\n}\n```\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshikiw%2Fmodality-integration-rate","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fshikiw%2Fmodality-integration-rate","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshikiw%2Fmodality-integration-rate/lists"}