{"id":18322519,"url":"https://github.com/tencentarc/vit-lens","last_synced_at":"2025-04-04T20:08:10.452Z","repository":{"id":187806029,"uuid":"677357222","full_name":"TencentARC/ViT-Lens","owner":"TencentARC","description":"[CVPR 2024] ViT-Lens: Towards Omni-modal Representations","archived":false,"fork":false,"pushed_at":"2025-02-03T03:35:05.000Z","size":138125,"stargazers_count":174,"open_issues_count":4,"forks_count":10,"subscribers_count":10,"default_branch":"main","last_synced_at":"2025-03-28T19:05:58.638Z","etag":null,"topics":["multimodal-learning"],"latest_commit_sha":null,"homepage":"https://ailab-cvc.github.io/seed/vitlens/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/TencentARC.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-08-11T11:20:31.000Z","updated_at":"2025-03-28T16:48:25.000Z","dependencies_parsed_at":"2025-02-17T15:11:32.707Z","dependency_job_id":"417fe983-7f8d-417e-b322-7ce560c447c6","html_url":"https://github.com/TencentARC/ViT-Lens","commit_stats":null,"previous_names":["tencentarc/vit-lens"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TencentARC%2FViT-Lens","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TencentARC%2FViT-Lens/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TencentARC%2FViT-Lens/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TencentARC%2FViT-Lens/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/TencentARC","download_url":"https://codeload.github.com/TencentARC/ViT-Lens/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247242671,"owners_count":20907133,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["multimodal-learning"],"created_at":"2024-11-05T18:24:59.465Z","updated_at":"2025-04-04T20:08:10.418Z","avatar_url":"https://github.com/TencentARC.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ViT-Lens\n\n[![Project Homepage](https://img.shields.io/badge/Project-Homepage-green)](https://ailab-cvc.github.io/seed/vitlens/)\n[![arXiv](https://img.shields.io/badge/arXiv-2311.16081-b31b1b.svg)](https://arxiv.org/abs/2311.16081)\n[![arXiv](https://img.shields.io/badge/arXiv-2308.10185-b31b1b.svg)](https://arxiv.org/abs/2308.10185)\n[![Static Badge](https://img.shields.io/badge/Model-Huggingface-yellow)](https://huggingface.co/TencentARC/ViT-Lens/tree/main)\n\n*TL;DR*: We present ViT-Lens, an approach for advancing omni-modal representation learning by leveraging a pretrained-ViT with modality Lens to comprehend diverse modalities.\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"assets/vitlens-teaser.png\" alt=\"vit-lens-omni-modal\" width=\"400\" /\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"assets/vitlens-sc.png\" alt=\"vit-lens-capabilities\" width=\"600\" /\u003e\n\u003c/p\u003e\n\n### 📢 News\n\u003c!--  --\u003e\n- [2023.12.13] We release training code and models of ViT-Lens.\n- [2023.11.28] We upgrade ViT-Lens, with added modalities and applications. Stay tuned for the release of code and models [[`arXiv paper`](https://arxiv.org/abs/2311.16081)].\n- [2023.08.22] We release the arXiv paper, inference codes and checkpoints for 3D [[`arXiv paper`](https://arxiv.org/abs/2308.10185)]. \n\n### 📝 Todo\n- [x] Models for more modalities.\n- [ ] Code for ViT-Lens integration with InstructBLIP and SEED.\n- [ ] Online demo for ViT-Lens integration with InstructBLIP and SEED.\n\n## 🔨 Installation\n```shell\nconda create -n vit-lens python=3.8.8 -y\nconda activate vit-lens\n\n# Install pytorch\u003e=1.9.0 \nconda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch -y\n\n# Install ViT-Lens\ngit clone https://github.com/TencentARC/ViT-Lens.git\ncd ViT-Lens/\npip install -e vitlens/\npip install -r vitlens/requirements-training.txt\n```\n\u003cdetails\u003e\n  \u003csummary\u003eTraining/Inference on OpenShape Triplets on 3D point clouds: environment setup (click to expand)\u003c/summary\u003e\n\n```shell\nconda create -n vit-lens python=3.8.8 -y\nconda activate vit-lens\nconda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch -y\nconda install -c dglteam/label/cu113 dgl -y\n\n# Install ViT-Lens\ngit clone https://github.com/TencentARC/ViT-Lens.git\ncd ViT-Lens/\npip install -e vitlens/\npip install -r vitlens/requirements-training.txt\n```\n\u003c/details\u003e\n\n## 🔍 ViT-Lens Model\n|                 |   MN40   |  SUN.D   |  NYU.D   | Audioset | VGGSound |  ESC50   |    Clotho    |   AudioCaps   |  TAG.M   |  IN.EEG  |                           Download                           |\n| --------------- | :------: | :------: | :------: | :------: | :------: | :------: | :----------: | :-----------: | :------: | :------: | :----------------------------------------------------------: |\n| ImageBind(Huge) |    -     |   35.1   |   54.0   |   17.6   |   27.8   |   66.9   |   6.0/28.4   |   9.3/42.3    |    -     |    -     |                              -                               |\n| ViT-Lens-L| **80.6** | **52.2** | **68.5** | **26.7** | **31.7** | **75.9** | **8.1/31.2** | **14.4/54.9** | **65.8** | **42.7** | [vitlensL](https://huggingface.co/TencentARC/ViT-Lens/blob/main/vitlensL.pt) |\n\nWe release a one-stop `ViT-Lens-L` model (based on Large ViT) and show its performance on ModelNet40 (MN40, top1 accuracy), SUN RGBD Depth-only (SUN.D, top1 accuracy), NYUv2 Depth-only (NYU.D, top1 accuracy), Audioset (Audioset, mAP), VGGSound (VGGSound, top1 accuracy), ESC50 (ESC50, top1 accuracy), Clotho (Clotho, R@1/R@10), AudioCaps (AudioCaps, R@1/R@10), TAG.M (Touch-and-Go Material, top1 accuracy) and IN.EEG (ImageNet EEG, top1 accuracy). ViT-Lens consistently outperforms ImageBind.\n\nFor more model checkpoints (trained on different data or with better performance), please refer to [MODEL_ZOO.md](MODEL_ZOO.md).\n\n\n## 📚 Usage\n- You may set your paths for you own project in [constants.py](vitlens/src/open_clip/constants.py).\n- We provide an API ([source file](vitlens/src/mm_vit_lens/vitlens.py)) and provide an example ([here](example.py)) for reference. You can use ViT-Lens to extract and compare features across modalities:\n  ```python\n  import os\n  import torch\n\n  from open_clip import ModalityType\n  from mm_vit_lens import ViTLens\n\n  here = os.path.abspath(os.path.dirname(__file__))\n\n  model = ViTLens(modality_loaded=[ModalityType.IMAGE, ModalityType.AUDIO, ModalityType.TEXT, ModalityType.PC])\n\n  device = \"cuda:0\" if torch.cuda.is_available() else \"cpu\"\n  model = model.to(device)\n\n  # Example 1\n  images = [\n      os.path.join(here, \"assets/example/image_bird.jpg\"),\n      os.path.join(here, \"assets/example/image_fire.jpg\"),\n      os.path.join(here, \"assets/example/image_dog.jpg\"),\n      os.path.join(here, \"assets/example/image_beach.jpg\"),\n  ]\n  audios = [\n      os.path.join(here, \"assets/example/audio_chirping_birds.flac\"),\n      os.path.join(here, \"assets/example/audio_crackling_fire.flac\"),\n      os.path.join(here, \"assets/example/audio_dog.flac\"),\n      os.path.join(here, \"assets/example/audio_sea_wave.flac\"),\n  ]\n  texts = [\n      \"a bird\",\n      \"crackling fire\",\n      \"a dog\",\n      \"sea wave\",\n  ]\n  inputs_1 = {\n      ModalityType.IMAGE: images,\n      ModalityType.AUDIO: audios,\n      ModalityType.TEXT: texts,\n  }\n\n  with torch.no_grad(), torch.cuda.amp.autocast():\n      outputs_1 = model.encode(inputs_1, normalize=True)\n\n  sim_at = torch.softmax(100 * outputs_1[ModalityType.AUDIO] @ outputs_1[ModalityType.TEXT].T, dim=-1)\n  print(\n      \"Audio x Text:\\n\",\n      sim_at\n  )\n  # Expected output\n  # Audio x Text:\n  #  tensor([[9.9998e-01, 9.3977e-07, 2.1545e-05, 9.3642e-08],\n  #         [3.8017e-09, 1.0000e+00, 3.1551e-09, 6.9498e-10],\n  #         [9.4895e-03, 1.3270e-06, 9.9051e-01, 2.5545e-07],\n  #         [9.7020e-06, 6.4767e-07, 2.8860e-06, 9.9999e-01]], device='cuda:0')\n\n  sim_ai = torch.softmax(100 * outputs_1[ModalityType.AUDIO] @ outputs_1[ModalityType.IMAGE].T, dim=-1)\n  print(\n      \"Audio x Image:\\n\",\n      sim_ai\n  )\n  # Expected output\n  # Audio x Image:\n  #  tensor([[1.0000e+00, 1.5798e-06, 2.0614e-06, 1.6502e-07],\n  #         [2.3712e-09, 1.0000e+00, 1.4446e-10, 1.2260e-10],\n  #         [4.9333e-03, 1.2942e-02, 9.8212e-01, 1.8582e-06],\n  #         [6.8347e-04, 1.0547e-02, 1.3476e-05, 9.8876e-01]], device='cuda:0')\n\n\n  # Example 2\n  pcs = [\n      os.path.join(here, \"assets/example/pc_car_0260.npy\"),\n      os.path.join(here, \"assets/example/pc_guitar_0243.npy\"),\n      os.path.join(here, \"assets/example/pc_monitor_0503.npy\"),\n      os.path.join(here, \"assets/example/pc_person_0102.npy\"),\n      os.path.join(here, \"assets/example/pc_piano_0286.npy\"),\n  ]\n  text_pcs = [\"a car\", \"a guitar\", \"a monitor\", \"a person\", \"a piano\"]\n  inputs_2 = {\n      ModalityType.PC: pcs,\n      ModalityType.TEXT: text_pcs,\n  }\n  with torch.no_grad(), torch.cuda.amp.autocast():\n      outputs_2 = model.encode(inputs_2, normalize=True)\n  sim_pc_t = torch.softmax(100 * outputs_2[ModalityType.PC] @ outputs_2[ModalityType.TEXT].T, dim=-1)\n  print(\n      \"PointCould x Text:\\n\",\n      sim_pc_t\n  )\n  # Expected output:\n  # PointCould x Text:\n  #  tensor([[9.9945e-01, 1.0483e-05, 1.4904e-04, 2.3988e-05, 3.7041e-04],\n  #         [1.2574e-09, 1.0000e+00, 6.8450e-09, 2.6463e-08, 3.3659e-07],\n  #         [6.2730e-09, 1.9918e-06, 9.9999e-01, 6.7161e-06, 4.9279e-06],\n  #         [1.8846e-06, 7.4831e-06, 4.4594e-06, 9.9998e-01, 7.9092e-06],\n  #         [1.2218e-08, 1.5571e-06, 1.8991e-07, 1.7521e-08, 1.0000e+00]],\n  #        device='cuda:0')\n\n  ```\n\n\n## 📦 Datasets\nPlease refer to [DATASETS.md](DATASETS.md) for dataset preparation. \n\n## 🚀 Training \u0026 Inference\nPlease refer to [TRAIN_INFERENCE.md](TRAIN_INFERENCE.md) for details.\n\n## 🧩 Model Zoo\nPlease refer to [MODEL_ZOO.md](MODEL_ZOO.md) for details.\n\n\n## 👀 Visualization of Demo\n\n\u003cdetails open\u003e\u003csummary\u003e[ Plug ViT-Lens into SEED: Video Demo ]\u003c/summary\u003e\u003cimg src=\"./assets/vid_seed.gif\" alt=\"vitlens-seed.video\" style=\"width: 80%; height: auto;\"\u003e\u003c/details\u003e\n\n\u003cdetails close\u003e\u003csummary\u003e[ Plug ViT-Lens into SEED: enabling compound Any-to-Image Generation ]\u003c/summary\u003e\u003cimg src=\"./assets/seed_integrated.png\" alt=\"vitlens-seed\" style=\"width: 70%; height: auto;\"\u003e\u003c/details\u003e\n\n\n\u003cdetails open\u003e\u003csummary\u003e[ Plug ViT-Lens into InstructBLIP: Video Demo ]\u003c/summary\u003e\u003cimg src=\"./assets/insblip.gif\" alt=\"insblip.video\" style=\"width: 80%; height: auto;\"\u003e\u003c/details\u003e\n\n\u003cdetails close\u003e\u003csummary\u003e[ Plug ViT-Lens into InstructBLIP: enabling Any instruction following ]\u003c/summary\u003e\u003cimg src=\"./assets/insblip_2inp.png\" alt=\"vitlens.instblip2\" style=\"width: 70%; height: auto;\"\u003e\n\u003c/details\u003e\n\n\u003cdetails close\u003e\u003csummary\u003e[ Plug ViT-Lens into InstructBLIP: enabling Any instruction following ]\u003c/summary\u003e\u003cimg src=\"./assets/insblip_3inp.png\" alt=\"mmvitlens.instblip3\" style=\"width: 70%; height: auto;\"\u003e\n\u003c/details\u003e\n\n\u003cdetails close\u003e\u003csummary\u003e[ Example: Plug 3D lens to LLM ]\u003c/summary\u003e\u003cimg src=\"./assets/e_3d_plant.png\" alt=\"plant\" style=\"width: 60%; height: auto;\"\u003e\n\u003c/details\u003e\n\n\u003cdetails close\u003e\u003csummary\u003e[ Example: Plug 3D lens to LLM ]\u003c/summary\u003e\u003cimg src=\"./assets/e_3d_piano.png\" alt=\"piano\" style=\"width: 60%; height: auto;\"\u003e\n\u003c/details\u003e\n\n\n## 🎓 Citation\nIf you find our work helps, please give us a star🌟 and consider citing:\n```bib\n@InProceedings{Lei_VitLens_CVPR_2024,\n    author    = {Lei, Weixian and Ge, Yixiao and Yi, Kun and Zhang, Jianfeng and Gao, Difei and Sun, Dylan and Ge, Yuying and Shan, Ying and Shou, Mike Zheng},\n    title     = {ViT-Lens: Towards Omni-modal Representations},\n    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},\n    month     = {June},\n    year      = {2024},\n    pages     = {26647-26657}\n}\n```\n\n\n## ✉️ Contact\nQuestions and discussions are welcome via leiwx52@gmail.com or open an issue.\n\n\n## 🙏 Acknowledgement\nThis codebase is based on [open_clip](https://github.com/mlfoundations/open_clip), [ULIP](https://github.com/salesforce/ULIP), [OpenShape](https://github.com/Colin97/OpenShape_code) and [LAVIS](https://github.com/salesforce/LAVIS). Big thanks to the authors for their awesome contributions!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftencentarc%2Fvit-lens","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftencentarc%2Fvit-lens","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftencentarc%2Fvit-lens/lists"}