https://github.com/vitae-transformer/vitdet

Unofficial implementation for [ECCV'22] "Exploring Plain Vision Transformer Backbones for Object Detection"
https://github.com/vitae-transformer/vitdet

deep-learning object-detection pytorch vision-transformer

Last synced: 8 months ago
JSON representation

Unofficial implementation for [ECCV'22] "Exploring Plain Vision Transformer Backbones for Object Detection"

Host: GitHub
URL: https://github.com/vitae-transformer/vitdet
Owner: ViTAE-Transformer
License: apache-2.0
Created: 2022-04-16T08:27:16.000Z (about 4 years ago)
Default Branch: main
Last Pushed: 2022-04-24T06:11:54.000Z (about 4 years ago)
Last Synced: 2025-09-05T02:29:13.426Z (10 months ago)
Topics: deep-learning, object-detection, pytorch, vision-transformer
Language: Python
Homepage:
Size: 8.29 MB
Stars: 574
Watchers: 4
Forks: 45
Open Issues: 18
Metadata Files:
- Readme: README.md
- License: LICENSE
- Citation: CITATION.cff

Awesome Lists containing this project

README

          
Unofficial PyTorch Implementation of Exploring Plain Vision Transformer Backbones for Object Detection

 



  Results |

  Updates |

  Usage |

  Todo |

  Acknowledge



This branch contains the **unofficial** pytorch implementation of Exploring Plain Vision Transformer Backbones for Object Detection. Thanks for their wonderful work!

## Results from this repo on COCO

The models are trained on 4 A100 machines with 2 images per gpu, which makes a batch size of 64 during training.

| Model | Pretrain | Machine | FrameWork | Box mAP | Mask mAP | config | log | weight |

| :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | 

| ViT-Base | IN1K+MAE | TPU | Mask RCNN | 51.1 | 45.5 | [config](./configs/ViTDet/ViTDet-ViT-Base-100e.py) | [log](logs/ViT-Base-TPU.log.json) | [OneDrive](https://1drv.ms/u/s!AimBgYV7JjTlgQuegyG-Z3FH2LDP?e=9ij98g) |

| ViT-Base | IN1K+MAE | GPU | Mask RCNN | 51.1 | 45.4 | [config](./configs/ViTDet/ViTDet-ViT-Base-100e.py) | [log](logs/ViT-Base-GPU.log.json) | [OneDrive](https://1drv.ms/u/s!AimBgYV7JjTlgRA7Y9s2rA5NC4wn?e=QfpKJf) |

| [ViTAE-Base](https://arxiv.org/abs/2202.10108) | IN1K+MAE | GPU | Mask RCNN | 51.6 | 45.8 | [config](configs/ViTDet/ViTDet-ViTAE-Base-100e.py) | [log](logs/ViTAE-Base-GPU.log.json) | [OneDrive](https://1drv.ms/u/s!AimBgYV7JjTlgQ--Ez4mzEnO-G5Y?e=ACfLxC) |

| [ViTAE-Small](https://arxiv.org/abs/2202.10108) | IN1K+Sup | GPU | Mask RCNN | 45.6 | 40.1 | [config](configs/ViTDet/ViTDet-ViTAE-Small-100e.py) | [log](logs/ViTAE-S-GPU.log.json) | [OneDrive](https://1drv.ms/u/s!AimBgYV7JjTlgQ7PorGY53K6gIGd?e=lw81U5) |

## Updates

> [2022-04-18] Explore using small 1K supervised trained models (20M parameters) for ViTDet (**45.6 mAP**). The results with multi-stage structure is **46.0 mAP** for [Swin-T](https://github.com/SwinTransformer/Swin-Transformer-Object-Detection) and **47.8 mAP** for [ViTAEv2-S](https://github.com/ViTAE-Transformer/ViTAE-Transformer/tree/main/Object-Detection) with Mask RCNN on COCO.

> [2022-04-17] Release the pretrained weights and logs for ViT-B and ViTAE-B on MS COCO. The models are totally trained with PyTorch on GPU.

> [2022-04-16] Release the initial unofficial implementation of ViTDet with ViT-Base model! It obtains 51.1 mAP and 45.5 mAP on detection and segmentation, respectively. The weights and logs will be uploaded soon. 

> Applications of ViTAE Transformer include: [image classification](https://github.com/ViTAE-Transformer/ViTAE-Transformer/tree/main/Image-Classification) | [object detection](https://github.com/ViTAE-Transformer/ViTAE-Transformer/tree/main/Object-Detection) | [semantic segmentation](https://github.com/ViTAE-Transformer/ViTAE-Transformer/tree/main/Semantic-Segmentation) | [animal pose segmentation](https://github.com/ViTAE-Transformer/ViTAE-Transformer/tree/main/Animal-Pose-Estimation) | [remote sensing](https://github.com/ViTAE-Transformer/ViTAE-Transformer-Remote-Sensing) | [matting](https://github.com/ViTAE-Transformer/ViTAE-Transformer-Matting)

## Usage

We use PyTorch 1.9.0 or NGC docker 21.06, and mmcv 1.3.9 for the experiments.

```bash

git clone https://github.com/open-mmlab/mmcv.git

cd mmcv

git checkout v1.3.9

MMCV_WITH_OPS=1 pip install -e .

cd ..

git clone https://github.com/ViTAE-Transformer/ViTDet.git

cd ViTDet

pip install -v -e .

```

After install the two repos, install timm and einops, i.e.,

```bash

pip install timm==0.4.9 einops

```

Download the pretrained models from [MAE](https://github.com/facebookresearch/mae) or [ViTAE](https://github.com/ViTAE-Transformer/ViTAE-Transformer), and then conduct the experiments by

```bash

# for single machine

bash tools/dist_train.sh   --cfg-options model.pretrained=

# for multiple machines

python -m torch.distributed.launch --nnodes  --node_rank  --nproc_per_node  --master_addr  --master_port  tools/train.py  --cfg-options model.pretrained= --launcher pytorch

```

## Todo

This repo current contains modifications including:

- using LN for the convolutions in RPN and heads

- using large scale jittor for augmentation

- using RPE from MViT

- using longer training epochs and 1024 test size

- using global attention layers

There are other things to do:

- [ ] Implement the conv blocks for global information communication

- [ ] Tune the models for Cascade RCNN 

- [ ] Train ViT models for the LVIS dataset

- [ ] Train ViTAE model with the ViTDet framework

## Acknowledge

We acknowledge the excellent implementation from [mmdetection](https://github.com/open-mmlab/mmdetection), [MAE](https://github.com/facebookresearch/mae), [MViT](https://github.com/facebookresearch/mvit), and [BeiT](https://github.com/microsoft/unilm/tree/master/beit).

## Citing ViTDet

```

@article{Li2022ExploringPV,

  title={Exploring Plain Vision Transformer Backbones for Object Detection},

  author={Yanghao Li and Hanzi Mao and Ross B. Girshick and Kaiming He},

  journal={ArXiv},

  year={2022},

  volume={abs/2203.16527}

}

```

For ViTAE and ViTAEv2, please refer to:

```

@article{xu2021vitae,

  title={Vitae: Vision transformer advanced by exploring intrinsic inductive bias},

  author={Xu, Yufei and Zhang, Qiming and Zhang, Jing and Tao, Dacheng},

  journal={Advances in Neural Information Processing Systems},

  volume={34},

  year={2021}

}

@article{zhang2022vitaev2,

  title={ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond},

  author={Zhang, Qiming and Xu, Yufei and Zhang, Jing and Tao, Dacheng},

  journal={arXiv preprint arXiv:2202.10108},

  year={2022}

}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/vitae-transformer/vitdet

Awesome Lists containing this project

README

Unofficial PyTorch Implementation of Exploring Plain Vision Transformer Backbones for Object Detection