{"id":13597922,"url":"https://github.com/invictus717/MetaTransformer","last_synced_at":"2025-04-10T06:30:35.606Z","repository":{"id":180135776,"uuid":"663918038","full_name":"invictus717/MetaTransformer","owner":"invictus717","description":"Meta-Transformer for Unified Multimodal Learning","archived":false,"fork":false,"pushed_at":"2023-12-05T07:36:11.000Z","size":22707,"stargazers_count":1513,"open_issues_count":3,"forks_count":113,"subscribers_count":22,"default_branch":"master","last_synced_at":"2024-10-29T17:41:10.051Z","etag":null,"topics":["artificial-intelligence","computer-vision","foundationmodel","machine-learning","multimedia","multimodal","transformers"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2307.10802","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/invictus717.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2023-07-08T12:40:54.000Z","updated_at":"2024-10-29T12:36:12.000Z","dependencies_parsed_at":null,"dependency_job_id":"5fdcedce-bdac-4a43-9bd4-293ecbd11103","html_url":"https://github.com/invictus717/MetaTransformer","commit_stats":null,"previous_names":["invictus717/metatransformer"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/invictus717%2FMetaTransformer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/invictus717%2FMetaTransformer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/invictus717%2FMetaTransformer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/invictus717%2FMetaTransformer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/invictus717","download_url":"https://codeload.github.com/invictus717/MetaTransformer/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223100261,"owners_count":17087388,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["artificial-intelligence","computer-vision","foundationmodel","machine-learning","multimedia","multimodal","transformers"],"created_at":"2024-08-01T17:00:43.580Z","updated_at":"2024-11-06T22:31:03.233Z","avatar_url":"https://github.com/invictus717.png","language":"Python","funding_links":[],"categories":["Python","多模态大模型","Paper List","Open Source Projects"],"sub_categories":["网络服务_其他","Seminal Papers"],"readme":"\u003cp align=\"center\" width=\"100%\"\u003e\n\u003cimg src=\"assets\\Meta-Transformer_banner.png\"  width=\"80%\" height=\"80%\"\u003e\n\u003c/p\u003e\n\n\u003cdiv\u003e\n\u003cdiv align=\"center\"\u003e\n    \u003ca href='https://scholar.google.com/citations?user=KuYlJCIAAAAJ\u0026hl=en' target='_blank'\u003eYiyuan Zhang\u003csup\u003e1,2*\u003c/sup\u003e\u003c/a\u003e\u0026emsp;\n    \u003ca href='https://kxgong.github.io/' target='_blank'\u003eKaixiong Gong\u003csup\u003e1,2*\u003c/sup\u003e\u003c/a\u003e\u0026emsp;\n    \u003ca href='http://kpzhang93.github.io/' target='_blank'\u003eKaipeng Zhang\u003csup\u003e2,†\u003c/sup\u003e\u003c/a\u003e\u0026emsp;\n    \u003c/br\u003e\n    \u003ca href='http://www.ee.cuhk.edu.hk/~hsli/' target='_blank'\u003eHongsheng Li \u003csup\u003e1,2\u003c/sup\u003e\u003c/a\u003e\u0026emsp;\n    \u003ca href='https://mmlab.siat.ac.cn/yuqiao/index.html' target='_blank'\u003eYu Qiao \u003csup\u003e2\u003c/sup\u003e\u003c/a\u003e\u0026emsp;\n    \u003ca href='https://wlouyang.github.io/' target='_blank'\u003eWanli Ouyang\u003csup\u003e2\u003c/sup\u003e\u003c/a\u003e\u0026emsp;\n    \u003ca href='http://people.eecs.berkeley.edu/~xyyue/' target='_blank'\u003eXiangyu Yue\u003csup\u003e1,†,‡\u003c/sup\u003e\u003c/a\u003e\n\u003c/div\u003e\n\u003cdiv\u003e\n\n\u003cdiv align=\"center\"\u003e\n    \u003csup\u003e1\u003c/sup\u003e\n    \u003ca href='http://mmlab.ie.cuhk.edu.hk/' target='_blank'\u003eMultimedia Lab, The Chinese University of Hong Kong\u003c/a\u003e\u0026emsp;\n    \u003c/br\u003e\n    \u003csup\u003e2\u003c/sup\u003e \u003ca href='https://github.com/OpenGVLab' target='_blank'\u003eOpenGVLab，Shanghai AI Laboratory \n    \u003c/a\u003e\u003c/br\u003e\n    \u003csup\u003e*\u003c/sup\u003e Equal Contribution\u0026emsp;\n    \u003csup\u003e†\u003c/sup\u003e Corresponding Author\u0026emsp;\n    \u003csup\u003e‡\u003c/sup\u003e Project Lead\u0026emsp;\n\u003c/div\u003e\n\n-----------------\n\n[![arXiv](https://img.shields.io/badge/arxiv-2307.10802-b31b1b?style=plastic\u0026color=b31b1b\u0026link=https%3A%2F%2Farxiv.org%2Fabs%2F2307.10802)](https://arxiv.org/abs/2307.10802)\n[![website](https://img.shields.io/badge/Project-Website-brightgreen)](https://kxgong.github.io/meta_transformer/)\n[![blog-cn](https://img.shields.io/badge/%E6%9C%BA%E5%99%A8%E4%B9%8B%E5%BF%83-%E7%AE%80%E4%BB%8B-brightgreen)](https://mp.weixin.qq.com/s/r38bzqdJxDZUvtDI0c9CEw)\n[![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Space-blue)](https://huggingface.co/papers/2307.10802)\n[![OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/zhangyiyuan/MetaTransformer)\n![](https://img.shields.io/github/stars/invictus717/MetaTransformer?style=social)\n\u003ca href=\"https://twitter.com/_akhaliq/status/1682248055637041152\"\u003e\u003cimg src=\"https://img.icons8.com/color/48/000000/twitter.png\" width=\"25\" height=\"25\"\u003e\u003c/a\u003e\n\u003ca href=\"https://www.youtube.com/watch?v=V8L8xbsTyls\u0026ab_channel=CSBoard\"\u003e\u003cimg src=\"https://img.icons8.com/color/48/000000/youtube-play.png\" width=\"25\" height=\"25\"\u003e\u003c/a\u003e \u003ca href='https://huggingface.co/kxgong/Meta-Transformer'\u003e \u003cimg src=\"assets\\icons\\huggingface.png\" width=\"25\" height=\"25\"\u003e \u003c/a\u003e \u003ca href='https://open.spotify.com/episode/6JJxcy2zMtTwr4jXPQEXjh'\u003e \u003cimg src=\"https://upload.wikimedia.org/wikipedia/commons/1/19/Spotify_logo_without_text.svg\" width=\"20\" height=\"20\"\u003e\u003c/a\u003e\n\n\n## Meta-Transformer with Large Language Models ✨✨✨\n\nWe're thrilled to present [OneLLM](https://github.com/csuhan/OneLLM), ensembling Meta-Transformer framework with Multimodal Large Language Models, which performs multimodal joint training🚀, supports more modalities including fMRI, Depth and Normal Maps 🚀, and demonstrates very impressive performances on **25** benchmarks🚀🚀🚀. \n\n🔥🔥 The code, pretrained models, and datasets are publicly available at [OneLLM](https://github.com/csuhan/OneLLM).\n\n🔥🔥 Project Website is at [OneLLM](https://onellm.csuhan.com/).\n\n### 🌟 Single Foundation Model Supports A Wide Range of Applications\n\n\n\nAs a foundation model, Meta-Transformer can handle data from 12 modalities, which determines that it can support a wide range of applications. As shown in this figure, Meta-Transformer can provide services for downstream tasks including stock analysis 📈, weather forecasting ☀️ ☔ ☁️ ❄️ ⛄ ⚡, remote sensing 📡, autonomous driving 🚗, social network 🌍, speech recognition 🔉, etc.\n\n\u003cp align=\"center\" width=\"100%\"\u003e\n\u003cimg src=\"assets\\Meta-Transformer_application.png\"  width=\"100%\" height=\"100%\"\u003e\n\u003c/p\u003e\n\n**Table 1**: Meta-Transformer is capable of handling up to 12 modalities, including natural language \u003cimg src=\"assets\\icons\\text.jpg\" width=\"15\" height=\"15\"\u003e, RGB images \u003cimg src=\"assets\\icons\\img.jpg\" width=\"15\" height=\"15\"\u003e, point clouds \u003cimg src=\"assets\\icons\\pcd.jpg\" width=\"15\" height=\"15\"\u003e, audios \u003cimg src=\"assets\\icons\\audio.jpg\" width=\"15\" height=\"15\"\u003e, videos \u003cimg src=\"assets\\icons\\video.jpg\" width=\"15\" height=\"15\"\u003e, tabular data \u003cimg src=\"assets\\icons\\table.jpg\" width=\"15\" height=\"15\"\u003e, graph \u003cimg src=\"assets\\icons\\graph.jpg\" width=\"15\" height=\"15\"\u003e, time series data \u003cimg src=\"assets\\icons\\time.jpg\" width=\"15\" height=\"15\"\u003e, hyper-spectral images \u003cimg src=\"assets\\icons\\hyper.jpg\" width=\"15\" height=\"15\"\u003e, IMU \u003cimg src=\"assets\\icons\\imu.jpg\" width=\"15\" height=\"15\"\u003e, medical images \u003cimg src=\"assets\\icons\\xray.jpg\" width=\"15\" height=\"15\"\u003e, and infrared images \u003cimg src=\"assets\\icons\\infrared.jpg\" width=\"15\" height=\"15\"\u003e.\n\u003cp align=\"left\"\u003e\n\u003cimg src=\"assets\\Meta-Transformer_cmp.png\" width=100%\u003e\n\u003c/p\u003e\n\n## 🚩🚩🚩 Shared-Encoder, Unpaired Data, More Modalities \n\n\n\u003cdiv\u003e\n  \u003cimg class=\"image\" src=\"assets\\Meta-Transformer_teaser.png\" width=\"52%\" height=\"100%\"\u003e\n  \u003cimg class=\"image\" src=\"assets\\Meta-Transformer_exp.png\" width=\"45.2%\" height=\"100%\"\u003e\n\u003c/div\u003e\n\n\nThis repository is built to explore the potential and extensibility of transformers for multimodal learning. We utilize the advantages of Transformers to deal with length-variant sequences. Then we propose the *Data-to-Sequence* tokenization following a meta-scheme, then we apply it to 12 modalities including text, image, point cloud, audio, video, infrared, hyper-spectral, X-Ray, tabular, graph, time-series, and Inertial Measurement Unit (IMU) data.\n\n\u003cp align=\"left\"\u003e\n\u003cimg src=\"assets\\Meta-Transformer_data2seq.png\" width=100%\u003e\n\u003c/p\u003e\n\nAfter obtaining the token sequence, we employ a modality-shared encoder to extract representation across different modalities. With task-specific heads, Meta-Transformer can handle various tasks on the different modalities, such as: classification, detection, and segmentation.\n\n\u003cp align=\"left\"\u003e\n\u003cimg src=\"assets\\Meta-Transformer_framework.png\" width=100%\u003e\n\u003c/p\u003e\n\n\n\n# 🌟 News\n* **2023.8.17:** Release code to directly get embeddings from multiple modalities. We will further release code on utilizing Meta-Transformer for Human-Centric vision tasks.\n* **2023.8.2:** 🎉🎉🎉 The implementation of Meta-Transformer for image, point cloud, graph, tabular, time-series, X-Ray, hyper-spectrum, LiDAR data has been released. We also release a very powerful foundation model for Autonomous Driving 🚀🚀🚀.  \n* **2023.7.22:** Pretrained weights and a usage demo for our Meta-Transformer have been released. Comprehensive documentation and implementation of the image modality are underway and will be released soon. Stay tuned for more exciting updates!⌛⌛⌛\n* **2023.7.21:** Paper is released at [arxiv](https://arxiv.org/abs/2307.10802), and code will be gradually released.\n* **2023.7.8:** Github Repository Initialization.\n\n# 🔓 Model Zoo\n\n\u003c!-- \u003cdetails\u003e --\u003e\n\u003csummary\u003e Open-source Modality-Agnostic Models \u003c/summary\u003e\n\u003cbr\u003e\n\u003cdiv\u003e\n\n|      Model      |   Pretraining   | Scale | #Param |                                               Download | 国内下载源                                               |\n| :------------: | :----------: | :----------------------: | :----: | :---------------------------------------------------------------------------------------------------: | :--------: | \n| Meta-Transformer-B16  | LAION-2B |         Base          |  85M  |   [ckpt](https://drive.google.com/file/d/19ahcN2QKknkir_bayhTW5rucuAiX0OXq/view?usp=sharing)    | [ckpt](https://download.openxlab.org.cn/models/zhangyiyuan/MetaTransformer/weight//Meta-Transformer_base_patch16_encoder)\n| Meta-Transformer-L14  | LAION-2B |         Large          |  302M  |   [ckpt](https://drive.google.com/file/d/15EtzCBAQSqmelhdLz6k880A19_RpcX9B/view?usp=drive_link)   | [ckpt](https://download.openxlab.org.cn/models/zhangyiyuan/MetaTransformer/weight//Meta-Transformer_large_patch14_encoder)\n\n\u003c/div\u003e\n\n\u003c!-- \u003c/details\u003e --\u003e\n\n\u003c!-- \u003cdetails\u003e --\u003e\n* Demo of Use for Pretrained Encoder\n\n```python\nimport torch \nimport torch.nn as nn\nfrom timm.models.vision_transformer import Block\nfrom Data2Seq import Data2Seq\nvideo_tokenier = Data2Seq(modality='video',dim=768)\naudio_tokenier = Data2Seq(modality='audio',dim=768)\ntime_series_tokenier = Data2Seq(modality='time-series',dim=768)\n\nfeatures = torch.concat([video_tokenizer(video), audio_tokenizer(audio), time_series_tokenizer(time_data)],dim=1)\n# For base-scale encoder:\nckpt = torch.load(\"Meta-Transformer_base_patch16_encoder.pth\")\nencoder = nn.Sequential(*[\n            Block(\n                dim=768,\n                num_heads=12,\n                mlp_ratio=4.,\n                qkv_bias=True,\n                norm_layer=nn.LayerNorm,\n                act_layer=nn.GELU\n            )\n            for i in range(12)])\nencoder.load_state_dict(ckpt,strict=True)\n# For large-scale encoder:\nckpt = torch.load(\"Meta-Transformer_large_patch14_encoder.pth\")\nencoder = nn.Sequential(*[\n            Block(\n                dim=1024,\n                num_heads=16,\n                mlp_ratio=4.,\n                qkv_bias=True,\n                norm_layer=nn.LayerNorm,\n                act_layer=nn.GELU\n            )\n            for i in range(24)])\nencoder.load_state_dict(ckpt,strict=True)\nencoded_features = encoder(features)\n```\n\u003c!-- \u003c/details\u003e --\u003e\n\n# 🕙 ToDo\n- [ x ] Meta-Transformer with Large Language Models.\n- [ x ] Multimodal Joint Training with Meta-Transformer.\n- [ x ] Support More Modalities and More Tasks.\n\n# Contact\n🚀🚀🚀 We aspire to shape this repository into **a formidable foundation for mainstream AI perception tasks across diverse modalities**. Your contributions can play a significant role in this endeavor, and we warmly welcome your participation in our project!\n\nTo contact us, never hestitate to send an email to `yiyuanzhang.ai@gmail.com` ,`kaixionggong@gmail.com`, `zhangkaipeng@pjlab.org.cn`, or `xyyue@ie.cuhk.edu.hk`!\n\u003cbr\u003e\u003c/br\u003e\n\n\u0026ensp;\n# Citation\nIf the code and paper help your research, please kindly cite:\n```\n@article{zhang2023meta,\n  title={Meta-transformer: A unified framework for multimodal learning},\n  author={Zhang, Yiyuan and Gong, Kaixiong and Zhang, Kaipeng and Li, Hongsheng and Qiao, Yu and Ouyang, Wanli and Yue, Xiangyu},\n  journal={arXiv preprint arXiv:2307.10802},\n  year={2023}\n}\n```\n# License\nThis project is released under the [Apache 2.0 license](LICENSE).\n# Acknowledgement\nThis code is developed based on excellent open-sourced projects including [MMClassification](https://github.com/open-mmlab/mmpretrain/tree/mmcls-1.x), [MMDetection](https://github.com/open-mmlab/mmdetection), [MMsegmentation](https://github.com/open-mmlab/mmsegmentation), [OpenPoints](https://github.com/guochengqian/openpoints), [Time-Series-Library](https://github.com/thuml/Time-Series-Library), [Graphomer](https://github.com/microsoft/Graphormer), [SpectralFormer](https://github.com/danfenghong/IEEE_TGRS_SpectralFormer), and [ViT-Adapter](https://github.com/czczup/ViT-Adapter).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Finvictus717%2FMetaTransformer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Finvictus717%2FMetaTransformer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Finvictus717%2FMetaTransformer/lists"}