{"id":15108670,"url":"https://github.com/czczup/vit-adapter","last_synced_at":"2025-05-16T03:03:30.015Z","repository":{"id":37544927,"uuid":"492936445","full_name":"czczup/ViT-Adapter","owner":"czczup","description":"[ICLR 2023 Spotlight] Vision Transformer Adapter for Dense Predictions","archived":false,"fork":false,"pushed_at":"2024-03-18T09:37:37.000Z","size":1862,"stargazers_count":1368,"open_issues_count":80,"forks_count":144,"subscribers_count":16,"default_branch":"main","last_synced_at":"2025-05-16T03:03:26.405Z","etag":null,"topics":["adapter","object-detection","semantic-segmentation","vision-transformer"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2205.08534","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/czczup.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-05-16T17:32:59.000Z","updated_at":"2025-05-15T08:15:32.000Z","dependencies_parsed_at":"2023-02-16T04:45:48.415Z","dependency_job_id":"5fad28cf-7c84-43f0-9821-73c12dda0491","html_url":"https://github.com/czczup/ViT-Adapter","commit_stats":null,"previous_names":[],"tags_count":21,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/czczup%2FViT-Adapter","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/czczup%2FViT-Adapter/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/czczup%2FViT-Adapter/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/czczup%2FViT-Adapter/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/czczup","download_url":"https://codeload.github.com/czczup/ViT-Adapter/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254459083,"owners_count":22074604,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["adapter","object-detection","semantic-segmentation","vision-transformer"],"created_at":"2024-09-25T22:21:29.063Z","updated_at":"2025-05-16T03:03:27.353Z","avatar_url":"https://github.com/czczup.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ViT-Adapter\n\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vision-transformer-adapter-for-dense/semantic-segmentation-on-ade20k)](https://paperswithcode.com/sota/semantic-segmentation-on-ade20k?p=vision-transformer-adapter-for-dense)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vision-transformer-adapter-for-dense/semantic-segmentation-on-cityscapes)](https://paperswithcode.com/sota/semantic-segmentation-on-cityscapes?p=vision-transformer-adapter-for-dense)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vision-transformer-adapter-for-dense/semantic-segmentation-on-cityscapes-val)](https://paperswithcode.com/sota/semantic-segmentation-on-cityscapes-val?p=vision-transformer-adapter-for-dense)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vision-transformer-adapter-for-dense/semantic-segmentation-on-pascal-context)](https://paperswithcode.com/sota/semantic-segmentation-on-pascal-context?p=vision-transformer-adapter-for-dense)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vision-transformer-adapter-for-dense/object-detection-on-coco-minival)](https://paperswithcode.com/sota/object-detection-on-coco-minival?p=vision-transformer-adapter-for-dense)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vision-transformer-adapter-for-dense/object-detection-on-coco)](https://paperswithcode.com/sota/object-detection-on-coco?p=vision-transformer-adapter-for-dense)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vision-transformer-adapter-for-dense/instance-segmentation-on-coco-minival)](https://paperswithcode.com/sota/instance-segmentation-on-coco-minival?p=vision-transformer-adapter-for-dense)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vision-transformer-adapter-for-dense/instance-segmentation-on-coco)](https://paperswithcode.com/sota/instance-segmentation-on-coco?p=vision-transformer-adapter-for-dense)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vision-transformer-adapter-for-dense/panoptic-segmentation-on-coco-minival)](https://paperswithcode.com/sota/panoptic-segmentation-on-coco-minival?p=vision-transformer-adapter-for-dense)\n\nThe official implementation of the paper \"[Vision Transformer Adapter for Dense Predictions](https://arxiv.org/abs/2205.08534)\".\n\n[Paper](https://arxiv.org/abs/2205.08534) | [Blog in Chinese](https://zhuanlan.zhihu.com/p/608272954) | [Slides](https://drive.google.com/file/d/1LotIZIEnZzKhsANjBTZs3qcezk9fbVCV/view?usp=share_link) | [Poster](https://iclr.cc/media/PosterPDFs/ICLR%202023/12048.png?t=1680764158.7068026) | [Video in English](https://iclr.cc/virtual/2023/poster/12048) | [Video in Chinese](https://www.bilibili.com/video/BV1ry4y1976b/?spm_id_from=333.337.search-card.all.click)\n\n[Segmentation Colab Notebook](https://colab.research.google.com/drive/1yEd5lQMjShloicImtShkwttb74KPGY5U?usp=sharing) | [Detection Colab Notebook](https://colab.research.google.com/drive/1Im7l0dSvEgsP-AJtUOxgbU9a1C3DdSwe?usp=sharing) (thanks [@IamShubhamGupto](https://github.com/IamShubhamGupto), [@dudifrid](https://github.com/dudifrid))\n\n## News\n- `2024/01/19`: Train ViT-Adapter with frozen InternViT-6B, see [here](https://github.com/OpenGVLab/InternVL-MMDetSeg)!\n- `2023/12/23`: 🚀🚀🚀 We release a ViT-based vision foundation model with 6B parameters, see [here](https://github.com/OpenGVLab/InternVL)!\n- `2023/08/31`: 🚀🚀 DINOv2 released the ViT-g-based segmentor with ViT-Adapter, see [here](https://github.com/facebookresearch/dinov2/blob/main/notebooks/semantic_segmentation.ipynb).\n- `2023/07/10`: 🚀 Support the weights of [DINOv2](https://github.com/facebookresearch/dinov2) for object detection, see [here](detection/configs/mask_rcnn/dinov2/)!\n- `2023/06/26`: ViT-Adapter is adopted by the champion solution [NVOCC](https://opendrivelab.com/e2ead/AD23Challenge/Track_3_NVOCC.pdf) in Track 3 (3D Occupancy Prediction) of the CVPR 2023 Autonomous Driving Challenge.\n- `2023/06/07`: ViT-Adapter is used by [ONE-PEACE](https://github.com/OFA-Sys/ONE-PEACE) and they created new SOTA of 63.0 mIoU on ADE20K.\n- `2023/04/14`: ViT-Adapter is used in [EVA](https://arxiv.org/abs/2211.07636) and [DINOv2](https://arxiv.org/abs/2304.07193)!\n- `2023/01/21`: Our paper is accepted by ICLR 2023!\n- `2023/01/17`: We win the champion of [WSDM Cup 2023 Toloka VQA Challenge](/wsdm2023) using ViT-Adapter.\n- `2022/10/20`: ViT-Adapter is adopted by Zhang et al. and they ranked 1st in the [UVO Challenge 2022](https://arxiv.org/pdf/2210.09629.pdf).\n- `2022/08/22`: ViT-Adapter is adopted by [BEiT-3](https://github.com/microsoft/unilm/tree/master/beit3) and created new SOTA of 62.8 mIoU on ADE20K.\n- `2022/06/09`: ViT-Adapter-L achieves 60.4 box AP and 52.5 mask AP on COCO test-dev without Objects365.\n- `2022/06/04`: Code and models are released.\n- `2022/05/12`: ViT-Adapter-L reaches 85.2 mIoU on Cityscapes test set without coarse data.\n- `2022/05/05`: ViT-Adapter-L achieves the SOTA on ADE20K val set with 60.5 mIoU!\n\n\n## Highlights\n\n- ViT-Adapter supports various dense prediction tasks, including `object detection`, `instance segmentation`, `semantic segmentation`, `visual grounding`, `panoptic segmentation`, etc.\n- This codebase includes many SOTA detectors and segmenters to achieve top performance, such as `HTC++`, `Mask2Former`, `DINO`.\n\nhttps://user-images.githubusercontent.com/23737120/208140362-f2029060-eb16-4280-b85f-074006547a12.mp4\n\n\n\n## Abstract\n\nThis work investigates a simple yet powerful dense prediction task adapter for Vision Transformer (ViT). Unlike recently advanced variants that incorporate vision-specific inductive biases into their architectures, the plain ViT suffers inferior performance on dense predictions due to weak prior assumptions. To address this\nissue, we propose the ViT-Adapter, which allows plain ViT to achieve comparable performance to vision-specific transformers. Specifically, the backbone in our\nframework is a plain ViT that can learn powerful representations from large-scale\nmulti-modal data. When transferring to downstream tasks, a pre-training-free\nadapter is used to introduce the image-related inductive biases into the model,\nmaking it suitable for these tasks. We verify ViT-Adapter on multiple dense prediction tasks, including object detection, instance segmentation, and semantic segmentation. Notably, without using extra detection data, our ViT-Adapter-L yields\nstate-of-the-art 60.9 box AP and 53.0 mask AP on COCO test-dev. We hope that\nthe ViT-Adapter could serve as an alternative for vision-specific transformers and\nfacilitate future research. The code and models will be released.\n\n## Method\n\n\u003cimg width=\"810\" alt=\"image\" src=\"https://user-images.githubusercontent.com/23737120/217998186-8a37eacb-18f8-445a-8d92-0863e35712ab.png\"\u003e\n\n\u003cimg width=\"810\" alt=\"image\" src=\"https://user-images.githubusercontent.com/23737120/194904786-ea9c40a3-f6ac-4fe1-90ad-976e7b9e8f03.png\"\u003e\n\n## Catalog\n- [ ] Support flash attention\n- [ ] Support faster deformable attention\n- [x] Segmentation checkpoints\n- [x] Segmentation code\n- [x] Detection checkpoints\n- [x] Detection code\n- [x] Initialization\n\n## Awesome Competition Solutions with ViT-Adapter\n\n**[1st Place Solution for the 5th LSVOS Challenge: Video Instance Segmentation](https://arxiv.org/abs/2306.04091)**\n\u003c/br\u003e\nTao Zhang, Xingye Tian, Yikang Zhou, Yuehua Wu, Shunping Ji, Cilin Yan, Xuebo Wang, Xin Tao, Yuanhui Zhang, Pengfei Wan\n\u003c/br\u003e\n[[`Code`](https://github.com/zhang-tao-whu/DVIS)]\n[![Star](https://img.shields.io/github/stars/zhang-tao-whu/DVIS.svg?style=social\u0026label=Star)]([https://github.com/zhang-tao-whu/DVIS])\n\u003c/br\u003e\nAugust 28, 2023\n\n**2nd place solution in Scene Understanding for Autonomous Drone Delivery (SUADD'23) competition**\n\u003c/br\u003e\nMykola Lavreniuk, Nivedita Rufus, Unnikrishnan R Nair\n\u003c/br\u003e\n[[`Code`](https://github.com/Lavreniuk/2nd-place-solution-in-Scene-Understanding-for-Autonomous-Drone-Delivery)]\n[![Star](https://img.shields.io/github/stars/Lavreniuk/2nd-place-solution-in-Scene-Understanding-for-Autonomous-Drone-Delivery.svg?style=social\u0026label=Star)]([https://github.com/Lavreniuk/2nd-place-solution-in-Scene-Understanding-for-Autonomous-Drone-Delivery])\n\u003c/br\u003e\nJuly 18, 2023\n\n**Champion solution in Track 3 (3D Occupancy Prediction) of the CVPR 2023 Autonomous Driving Challenge**\n\u003c/br\u003e\n[FB-OCC: 3D Occupancy Prediction based on Forward-Backward View Transformation](https://arxiv.org/abs/2307.01492)\n\u003c/br\u003e\nZhiqi Li, Zhiding Yu, David Austin, Mingsheng Fang, Shiyi Lan, Jan Kautz, Jose M. Alvarez\n\u003c/br\u003e\n[[`Code`](https://github.com/NVlabs/FB-BEV)]\n[![Star](https://img.shields.io/github/stars/NVlabs/FB-BEV.svg?style=social\u0026label=Star)]([https://github.com/NVlabs/FB-BEV])\n\u003c/br\u003e\nJune 26, 2023 \n\n\n**[3rd Place Solution for PVUW Challenge 2023: Video Panoptic Segmentation](https://arxiv.org/abs/2306.06753)**\n\u003c/br\u003e\nJinming Su, Wangwang Yang, Junfeng Luo, Xiaolin Wei\n\u003c/br\u003e\nJune 6, 2023 \n\n**Champion solution in the Video Scene Parsing in the Wild Challenge at CVPR 2023**\n\u003c/br\u003e\n[Semantic Segmentation on VSPW Dataset through Contrastive Loss and Multi-dataset Training Approach](https://arxiv.org/abs/2306.03508)\n\u003c/br\u003e\nMin Yan, Qianxiong Ning, Qian Wang\n\u003c/br\u003e\nJune 3, 2023 \n\n**2nd place in the Video Scene Parsing in the Wild Challenge at CVPR 2023**\n\u003c/br\u003e\n[Recyclable Semi-supervised Method Based on Multi-model Ensemble for Video Scene Parsing](https://arxiv.org/abs/2306.02894)\n\u003c/br\u003e\nBiao Wu, Shaoli Liu, Diankai Zhang, Chengjian Zheng, Si Gao, Xiaofeng Zhang, Ning Wang\n\u003c/br\u003e\nJune 2, 2023 \n\n\n**[Champion Solution for the WSDM2023 Toloka VQA Challenge](https://arxiv.org/abs/2301.09045)**\n\u003c/br\u003e\nShengyi Gao, Zhe Chen, Guo Chen, Wenhai Wang, Tong Lu\n\u003c/br\u003e\n[[`Code`](https://github.com/czczup/ViT-Adapter/tree/main/wsdm2023)]\n\u003c/br\u003e\nJanuary 9, 2023\n\n**[1st Place Solutions for the UVO Challenge 2022](https://arxiv.org/abs/2210.09629)**\n\u003c/br\u003e\nJiajun Zhang, Boyu Chen, Zhilong Ji, Jinfeng Bai, Zonghai Hu\n\u003c/br\u003e\nOctober 9, 2022\n\n\n## Citation\n\nIf this work is helpful for your research, please consider citing the following BibTeX entry.\n\n```\n@article{chen2022vitadapter,\n  title={Vision Transformer Adapter for Dense Predictions},\n  author={Chen, Zhe and Duan, Yuchen and Wang, Wenhai and He, Junjun and Lu, Tong and Dai, Jifeng and Qiao, Yu},\n  journal={arXiv preprint arXiv:2205.08534},\n  year={2022}\n}\n```\n\n## License\n\nThis repository is released under the Apache 2.0 license as found in the [LICENSE](LICENSE.md) file.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fczczup%2Fvit-adapter","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fczczup%2Fvit-adapter","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fczczup%2Fvit-adapter/lists"}