{"id":20161905,"url":"https://github.com/ailab-cvc/m2pt","last_synced_at":"2025-07-25T23:35:13.526Z","repository":{"id":208673962,"uuid":"720995931","full_name":"AILab-CVC/M2PT","owner":"AILab-CVC","description":"[CVPR'24] Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities","archived":false,"fork":false,"pushed_at":"2024-03-13T06:28:04.000Z","size":18326,"stargazers_count":99,"open_issues_count":1,"forks_count":5,"subscribers_count":7,"default_branch":"main","last_synced_at":"2025-03-24T02:06:40.938Z","etag":null,"topics":["artificial-intelligence","deep-learning","multimodal","transformers"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2401.14405","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AILab-CVC.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-11-20T06:20:13.000Z","updated_at":"2025-03-16T06:34:38.000Z","dependencies_parsed_at":"2024-11-14T00:21:48.810Z","dependency_job_id":"10ce384e-5689-475c-93a5-7cbed0e6d689","html_url":"https://github.com/AILab-CVC/M2PT","commit_stats":null,"previous_names":["ailab-cvc/m2pt"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AILab-CVC%2FM2PT","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AILab-CVC%2FM2PT/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AILab-CVC%2FM2PT/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AILab-CVC%2FM2PT/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AILab-CVC","download_url":"https://codeload.github.com/AILab-CVC/M2PT/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248132316,"owners_count":21053022,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["artificial-intelligence","deep-learning","multimodal","transformers"],"created_at":"2024-11-14T00:21:43.148Z","updated_at":"2025-04-10T00:23:07.219Z","avatar_url":"https://github.com/AILab-CVC.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003ch1 class=\"title\"\u003eMultimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities [CVPR 2024]\u003c/h1\u003e\n\n\u003cp align=\"center\" width=\"100%\"\u003e\n\u003cimg src=\"assets/banner.png\"  width=\"80%\" height=\"100%\"\u003e\n\u003c/p\u003e\n\n\u003cdiv align=\"center\"\u003e\n     \u003cspan class=\"author-block\"\u003e\n    \u003ca href=\"https://invictus717.github.io/\" target=\"_blank\"\u003eYiyuan Zhang\u003c/a\u003e\u003csup\u003e1\u003c/sup\u003e,\u003c/span\u003e\n    \u003cspan class=\"author-block\"\u003e\n    \u003ca href=\"https://dingxiaohan.xyz/\" target=\"_blank\"\u003eXiaohan Ding\u003c/a\u003e\u003csup\u003e2\u003c/sup\u003e,\n    \u003c/span\u003e\n    \u003cspan class=\"author-block\"\u003e\n    \u003ca href=\"https://kxgong.github.io/\" target=\"_blank\"\u003eKaixiong Gong\u003c/a\u003e\u003csup\u003e1\u003c/sup\u003e,\u003c/span\u003e\n    \u003cspan class=\"author-block\"\u003e\n    \u003c/span\u003e\n    \u003ca href=\"https://geyixiao.com/\" target=\"_blank\"\u003eYixiao Ge\u003c/a\u003e\u003csup\u003e2\u003c/sup\u003e,\n    \u003c/span\u003e\n    \u003cspan class=\"author-block\"\u003e\n    \u003ca href=\"https://scholar.google.com/citations?user=4oXBp9UAAAAJ\u0026hl=en\u0026oi=ao\" target=\"_blank\"\u003eYing Shan\u003c/a\u003e\u003csup\u003e2\u003c/sup\u003e,\n    \u003c/span\u003e\n    \u003cspan class=\"author-block\"\u003e\n    \u003ca href=\"http://people.eecs.berkeley.edu/~xyyue/\" target=\"_blank\"\u003eXiangyu Yue\u003c/a\u003e\u003csup\u003e1\u003c/sup\u003e\n    \u003c/span\u003e\n\u003c/div\u003e\n\n\u003cdiv align=\"center\"\u003e\n    \u003csup\u003e1\u003c/sup\u003e\n    \u003ca href='http://mmlab.ie.cuhk.edu.hk/' target='_blank'\u003eThe Chinese University of Hong Kong\u003c/a\u003e\u0026emsp;\n    \u003csup\u003e2\u003c/sup\u003e\n    \u003ca href='https://ai.tencent.com/' target='_blank'\u003eTencent AI Lab\u003c/a\u003e\u0026emsp;\n\u003c/div\u003e\n\n-----------------\n\n[![arXiv](https://img.shields.io/badge/arxiv-2401.14405-b31b1b?style=plastic\u0026color=b31b1b\u0026link=https%3A%2F%2Farxiv.org%2Fabs%2F2401.14405)](https://arxiv.org/abs/2401.14405)\n[![website](https://img.shields.io/badge/Project-Website-blue)](https://ailab-cvc.github.io/M2PT/)\n\n### Inspiration of Multimodal Pathway\nThis diagram's composition is inspired by a figure in Jeff Dean's blog post, where he envisions \"pathways\" as a high-level concept for general AI models. Our proposed Multimodal Pathway Transformer is a novel approach, and we are delighted to discover that some of its effects align with Jeff Dean's high-level vision, such as training a single model to *do many things, enabling multiple senses, and making models sparse and efficient*. Multimodal Pathway Transformer can be seen as an initial exploration of this \"pathways\" concept in the context of basic Transformer models and multimodal learning. Read more about Jeff Dean's concept in \u003ca href=\"https://blog.google/technology/ai/introducing-pathways-next-generation-ai-architecture/\" target=\"_blank\"\u003ehis blog post\u003c/a\u003e.\n\n### Abstract\nWe propose to improve transformers of a specific modality with irrelevant data from other modalities, *e.g*, improve an ImageNet model with audio or point cloud datasets. We would like to highlight that the data samples of the target modality are irrelevant to the other modalities, which distinguishes our method from other works utilizing paired (*e.g.* CLIP) or interleaved data of different modalities. We propose a methodology named Multimodal Pathway - given a target modality and a transformer designed for it, we use an auxiliary transformer trained with data of another modality and construct pathways to connect components of the two models so that data of the target modality can be processed by both models. In this way, we utilize the universal sequence-to-sequence modeling abilities of transformers obtained from two modalities. As a concrete implementation, we use a modality-specific tokenizer and task-specific head as usual but utilize the transformer blocks of the auxiliary model via a proposed method named Cross-Modal Re-parameterization, which exploits the auxiliary weights without any inference costs. On the image, point cloud, video, and audio recognition tasks, we observe significant and consistent performance improvements with irrelevant data from other modalities.\n\n\u003cp align=\"center\" width=\"100%\"\u003e\n\u003cimg src=\"assets/result.png\"  width=\"100%\" height=\"80%\"\u003e\n\u003c/p\u003e\n\n## Model Zoo\n\n\n|      Model      |   Modality   | Pretraining | #Param |                                               Google Drive | Tencent Cloud                                               |\n| :------------: | :----------: | :----------------------: | :----: | :------: |:--------: | \n| ViT-B16  | Image |         MAE         |  86M  |   [ckpt](https://drive.google.com/file/d/1GG8W-n3rJwJKSfmFE7npARbeJTPYzDTo/view?usp=drive_link)    | [ckpt](https://share.weiyun.com/rFarvpGx)\n| Audio ViT-B16  | Audio |         Audio MAE         |  86M  |   [ckpt](https://drive.google.com/file/d/1xSelnHFBB27tZjOtP_m8ehRon3Ea8zlR/view?usp=drive_link)    | [ckpt](https://share.weiyun.com/QLoaFJyi)\n| Point ViT-B16  | Point |         Point MAE         |  86M  |   [ckpt](https://drive.google.com/file/d/1c3t4wcd34OU4E56tIjhOy91wiY3Pgt5_/view?usp=drive_link)    | [ckpt](https://share.weiyun.com/lnoMWqR8)\n| Video ViT-B16  | Video |         Video MAE         |  86M  |   [ckpt](https://drive.google.com/file/d/17FDpa7qUrPGv6NYyHsUrAH1x_ZyaxQ0g/view?usp=sharing)    | [ckpt](https://share.weiyun.com/QSYmzg0I)\n\n## Usage\n\n* Demo of use:\n\n    ```python\n    #  Create Model\n    import torch,timm\n    model = timm.create_model(\"vit_base_patch16_224\",pretrained = False)\n    aux_model = timm.create_model(\"vit_base_patch16_224\",pretrained = False)\n    #  Load Pretrained Models\n    pretrained_state_dict = torch.load(\"Image_ViT_B16.pth\")\n    aux_state_dict = torch.load(\"Audio_ViT_B16.pth\")\n    model.load_state_dict(pretrained_state_dict, strict=True)\n    aux_model.load_state_dict(aux_state_dict, strict=True)\n    #  Construct Multimodal Pathway\n    from multimodal_pathway import reparameterize_aux_into_target_model\n    reparameterize_aux_into_target_model(model, aux_model, layer_names=('attn.qkv', 'attn.proj', 'mlp.fc1','mlp.fc2'))\n    ```\n\n* For image recognition, please refer to [image doc](Image/README.md).\n\n* For video recognition, please refer to [video doc](Video/README.md).\n\n* For point cloud analysis, please refer to [pcd doc](Point/README.md).\n\n* For audio recognition, please refer to [audio doc](Audio/README.md).\n\n\u003csection class=\"section\" id=\"BibTeX\"\u003e\n    \u003cdiv class=\"container is-max-desktop content\"\u003e\n      \u003ch2 class=\"title\"\u003eBibTeX\u003c/h2\u003e\n      If you find our work useful, please kindly cite:\n      \u003cpre\u003e\u003ccode\u003e@article{zhang2024multimodal,\n      title={Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities},\n      author={Zhang, Yiyuan and Ding, Xiaohan and Gong, Kaixiong and Ge, Yixiao and Shan, Ying and Yue, Xiangyu},\n      journal={arXiv preprint arXiv:2401.14405},\n      year={2024}\n    }\u003c/code\u003e\u003c/pre\u003e\n    \u003c/div\u003e\n\u003c/section\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Failab-cvc%2Fm2pt","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Failab-cvc%2Fm2pt","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Failab-cvc%2Fm2pt/lists"}