{"id":20305529,"url":"https://github.com/vitae-transformer/vitdet","last_synced_at":"2025-10-25T12:33:48.593Z","repository":{"id":44429758,"uuid":"482198566","full_name":"ViTAE-Transformer/ViTDet","owner":"ViTAE-Transformer","description":"Unofficial implementation for [ECCV'22] \"Exploring Plain Vision Transformer Backbones for Object Detection\"","archived":false,"fork":false,"pushed_at":"2022-04-24T06:11:54.000Z","size":8696,"stargazers_count":574,"open_issues_count":18,"forks_count":45,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-09-05T02:29:13.426Z","etag":null,"topics":["deep-learning","object-detection","pytorch","vision-transformer"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ViTAE-Transformer.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2022-04-16T08:27:16.000Z","updated_at":"2025-08-29T11:22:20.000Z","dependencies_parsed_at":"2022-08-12T11:10:53.999Z","dependency_job_id":null,"html_url":"https://github.com/ViTAE-Transformer/ViTDet","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ViTAE-Transformer/ViTDet","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ViTAE-Transformer%2FViTDet","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ViTAE-Transformer%2FViTDet/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ViTAE-Transformer%2FViTDet/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ViTAE-Transformer%2FViTDet/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ViTAE-Transformer","download_url":"https://codeload.github.com/ViTAE-Transformer/ViTDet/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ViTAE-Transformer%2FViTDet/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279007499,"owners_count":26084313,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-11T02:00:06.511Z","response_time":55,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","object-detection","pytorch","vision-transformer"],"created_at":"2024-11-14T17:08:50.274Z","updated_at":"2025-10-11T14:12:30.588Z","avatar_url":"https://github.com/ViTAE-Transformer.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003ch1 align=\"left\"\u003eUnofficial PyTorch Implementation of Exploring Plain Vision Transformer Backbones for Object Detection\u003ca href=\"https://arxiv.org/abs/2203.16527\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-Paper-\u003cCOLOR\u003e.svg\" \u003e\u003c/a\u003e\u003c/h1\u003e \n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"#Results\"\u003eResults\u003c/a\u003e |\n  \u003ca href=\"#Updates\"\u003eUpdates\u003c/a\u003e |\n  \u003ca href=\"#Usage\"\u003eUsage\u003c/a\u003e |\n  \u003ca href='#Todo'\u003eTodo\u003c/a\u003e |\n  \u003ca href=\"#Acknowledge\"\u003eAcknowledge\u003c/a\u003e\n\u003c/p\u003e\n\nThis branch contains the **unofficial** pytorch implementation of \u003ca href=\"https://arxiv.org/abs/2203.16527\"\u003eExploring Plain Vision Transformer Backbones for Object Detection\u003c/a\u003e. Thanks for their wonderful work!\n\n## Results from this repo on COCO\n\nThe models are trained on 4 A100 machines with 2 images per gpu, which makes a batch size of 64 during training.\n\n| Model | Pretrain | Machine | FrameWork | Box mAP | Mask mAP | config | log | weight |\n| :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | \n| ViT-Base | IN1K+MAE | TPU | Mask RCNN | 51.1 | 45.5 | [config](./configs/ViTDet/ViTDet-ViT-Base-100e.py) | [log](logs/ViT-Base-TPU.log.json) | [OneDrive](https://1drv.ms/u/s!AimBgYV7JjTlgQuegyG-Z3FH2LDP?e=9ij98g) |\n| ViT-Base | IN1K+MAE | GPU | Mask RCNN | 51.1 | 45.4 | [config](./configs/ViTDet/ViTDet-ViT-Base-100e.py) | [log](logs/ViT-Base-GPU.log.json) | [OneDrive](https://1drv.ms/u/s!AimBgYV7JjTlgRA7Y9s2rA5NC4wn?e=QfpKJf) |\n| [ViTAE-Base](https://arxiv.org/abs/2202.10108) | IN1K+MAE | GPU | Mask RCNN | 51.6 | 45.8 | [config](configs/ViTDet/ViTDet-ViTAE-Base-100e.py) | [log](logs/ViTAE-Base-GPU.log.json) | [OneDrive](https://1drv.ms/u/s!AimBgYV7JjTlgQ--Ez4mzEnO-G5Y?e=ACfLxC) |\n| [ViTAE-Small](https://arxiv.org/abs/2202.10108) | IN1K+Sup | GPU | Mask RCNN | 45.6 | 40.1 | [config](configs/ViTDet/ViTDet-ViTAE-Small-100e.py) | [log](logs/ViTAE-S-GPU.log.json) | [OneDrive](https://1drv.ms/u/s!AimBgYV7JjTlgQ7PorGY53K6gIGd?e=lw81U5) |\n\n## Updates\n\n\u003e [2022-04-18] Explore using small 1K supervised trained models (20M parameters) for ViTDet (**45.6 mAP**). The results with multi-stage structure is **46.0 mAP** for [Swin-T](https://github.com/SwinTransformer/Swin-Transformer-Object-Detection) and **47.8 mAP** for [ViTAEv2-S](https://github.com/ViTAE-Transformer/ViTAE-Transformer/tree/main/Object-Detection) with Mask RCNN on COCO.\n\n\u003e [2022-04-17] Release the pretrained weights and logs for ViT-B and ViTAE-B on MS COCO. The models are totally trained with PyTorch on GPU.\n\n\u003e [2022-04-16] Release the initial unofficial implementation of ViTDet with ViT-Base model! It obtains 51.1 mAP and 45.5 mAP on detection and segmentation, respectively. The weights and logs will be uploaded soon. \n\n\u003e Applications of ViTAE Transformer include: [image classification](https://github.com/ViTAE-Transformer/ViTAE-Transformer/tree/main/Image-Classification) | [object detection](https://github.com/ViTAE-Transformer/ViTAE-Transformer/tree/main/Object-Detection) | [semantic segmentation](https://github.com/ViTAE-Transformer/ViTAE-Transformer/tree/main/Semantic-Segmentation) | [animal pose segmentation](https://github.com/ViTAE-Transformer/ViTAE-Transformer/tree/main/Animal-Pose-Estimation) | [remote sensing](https://github.com/ViTAE-Transformer/ViTAE-Transformer-Remote-Sensing) | [matting](https://github.com/ViTAE-Transformer/ViTAE-Transformer-Matting)\n\n## Usage\n\nWe use PyTorch 1.9.0 or NGC docker 21.06, and mmcv 1.3.9 for the experiments.\n```bash\ngit clone https://github.com/open-mmlab/mmcv.git\ncd mmcv\ngit checkout v1.3.9\nMMCV_WITH_OPS=1 pip install -e .\ncd ..\ngit clone https://github.com/ViTAE-Transformer/ViTDet.git\ncd ViTDet\npip install -v -e .\n```\n\nAfter install the two repos, install timm and einops, i.e.,\n```bash\npip install timm==0.4.9 einops\n```\n\nDownload the pretrained models from [MAE](https://github.com/facebookresearch/mae) or [ViTAE](https://github.com/ViTAE-Transformer/ViTAE-Transformer), and then conduct the experiments by\n\n```bash\n# for single machine\nbash tools/dist_train.sh \u003cConfig PATH\u003e \u003cNUM GPUs\u003e --cfg-options model.pretrained=\u003cPretrained PATH\u003e\n\n# for multiple machines\npython -m torch.distributed.launch --nnodes \u003cNum Machines\u003e --node_rank \u003cRank of Machine\u003e --nproc_per_node \u003cGPUs Per Machine\u003e --master_addr \u003cMaster Addr\u003e --master_port \u003cMaster Port\u003e tools/train.py \u003cConfig PATH\u003e --cfg-options model.pretrained=\u003cPretrained PATH\u003e --launcher pytorch\n```\n\n## Todo\n\nThis repo current contains modifications including:\n\n- using LN for the convolutions in RPN and heads\n- using large scale jittor for augmentation\n- using RPE from MViT\n- using longer training epochs and 1024 test size\n- using global attention layers\n\nThere are other things to do:\n\n- [ ] Implement the conv blocks for global information communication\n\n- [ ] Tune the models for Cascade RCNN \n\n- [ ] Train ViT models for the LVIS dataset\n\n- [ ] Train ViTAE model with the ViTDet framework\n\n## Acknowledge\nWe acknowledge the excellent implementation from [mmdetection](https://github.com/open-mmlab/mmdetection), [MAE](https://github.com/facebookresearch/mae), [MViT](https://github.com/facebookresearch/mvit), and [BeiT](https://github.com/microsoft/unilm/tree/master/beit).\n\n## Citing ViTDet\n```\n@article{Li2022ExploringPV,\n  title={Exploring Plain Vision Transformer Backbones for Object Detection},\n  author={Yanghao Li and Hanzi Mao and Ross B. Girshick and Kaiming He},\n  journal={ArXiv},\n  year={2022},\n  volume={abs/2203.16527}\n}\n```\n\nFor ViTAE and ViTAEv2, please refer to:\n```\n@article{xu2021vitae,\n  title={Vitae: Vision transformer advanced by exploring intrinsic inductive bias},\n  author={Xu, Yufei and Zhang, Qiming and Zhang, Jing and Tao, Dacheng},\n  journal={Advances in Neural Information Processing Systems},\n  volume={34},\n  year={2021}\n}\n\n@article{zhang2022vitaev2,\n  title={ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond},\n  author={Zhang, Qiming and Xu, Yufei and Zhang, Jing and Tao, Dacheng},\n  journal={arXiv preprint arXiv:2202.10108},\n  year={2022}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvitae-transformer%2Fvitdet","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvitae-transformer%2Fvitdet","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvitae-transformer%2Fvitdet/lists"}