https://github.com/danczs/Visformer

Last synced: about 1 month ago
JSON representation
Host: GitHub
URL: https://github.com/danczs/Visformer
Owner: danczs
Created: 2021-04-12T03:21:28.000Z (about 4 years ago)
Default Branch: main
Last Pushed: 2023-02-10T05:36:58.000Z (over 2 years ago)
Last Synced: 2024-11-15T06:33:19.524Z (7 months ago)
Language: Python
Size: 102 KB
Stars: 132
Watchers: 7
Forks: 22
Open Issues: 6
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

awesome_vision_transformer - code
README

        # Visformer

![pytorch](https://img.shields.io/badge/pytorch-v1.7.0-green.svg?style=plastic)

## Introduction

This is a pytorch implementation for the Visformer models. This project is based on the training code in [DeiT](https://github.com/facebookresearch/deit) and the tools in [timm](https://github.com/rwightman/pytorch-image-models).

## Usage

Clone the repository:

```bash

git clone https://github.com/danczs/Visformer.git

```

Install pytorch, timm and einops:

```bash

pip install -r requirements.txt

```

## Data Preparation

The layout of Imagenet data:

```bash

/path/to/imagenet/

  train/

    class1/

      img1.jpeg

    class2/

      img2.jpeg

  val/

    class1/

      img1.jpeg

    class2/

      img2.jpeg

```

## Network Training

Visformer_small

```bash

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --model visformer_small --batch-size 64 --data-path /path/to/imagenet --output_dir /path/to/save

```

Visformer_tiny

```bash

python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py --model visformer_tiny --batch-size 256 --drop-path 0.03 --data-path /path/to/imagenet --output_dir /path/to/save

```

Viformer V2 models

```bash

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --model swin_visformer_small_v2 --batch-size 64 --data-path /path/to/imagenet --output_dir /path/to/save --amp --qk-scale-factor=-0.5

python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py --model swin_visformer_tiny_v2 --batch-size 256 --drop-path 0.03 --data-path /path/to/imagenet --output_dir /path/to/save --amp --qk-scale-factor=-0.5

```

The model performance:

|        model        | top-1 (%) | FLOPs (G) | paramters (M) | 

|:-------------------:|:---------:|:---------:|:-------------:|

|   Visformer_tiny    |   78.6    |    1.3    |     10.3      |

|  Visformer_tiny_V2  |   79.6    |    1.3    |      9.4      |

|   Visformer_small   |   82.2    |    4.9    |     40.2      |

| Visformer_small_V2  |   83.0    |    4.3    |     23.6      |

| Visformer_medium_V2 |   83.6    |    8.5    |     44.5      |

pre-trained models:

|                       model                       |   model    |                                                 log                                                 | top-1 (%) | 

|:-------------------------------------------------:|:----------:|:---------------------------------------------------------------------------------------------------:|:---------:|

|            Visformer_small (original)             | [github](https://github.com/danczs/Visformer/releases/download/v1.0.0/visformer_small.pth) |   [github](https://github.com/danczs/Visformer/releases/download/v1.0.0/log_visformer_small.txt)    |   82.21   |

|  Visformer_small  (+ Swin for downstream tasks)   | [github](https://github.com/danczs/Visformer/releases/download/v1.0.0/swin_visformer_small.pth) |  [github](https://github.com/danczs/Visformer/releases/download/v1.0.0/log_swin_visformer_small.txt)   |   82.34   |

| Visformer_small_v2 (+ Swin for downstream tasks)  | [github](https://github.com/danczs/Visformer/releases/download/v1.0.0/swin_visformer_small_v2.pth) | [github](https://github.com/danczs/Visformer/releases/download/v1.0.0/log_swin_visformer_small_v2.txt) |   83.00   |

| Visformer_medium_v2 (+ Swin for downstream tasks) | [github](https://github.com/danczs/Visformer/releases/download/v1.0.0/swin_visofrmer_medium.pth) |    [github](https://github.com/danczs/Visformer/releases/download/v1.0.0/log_visformer_medium.txt)     |   83.62   |

(In some logs, the model is only tested for the last 50 epochs to save the training time.)

[More information about Visformer V2](https://arxiv.org/abs/2104.12533).

## Object Detection on COCO

The standard self-attention is not efficient for high-reolution inputs, 

so we simply replace the standard self-attention with Swin-attention for object detection. Therefore, Swin Transformer is our directly baseline. 

### Mask R-CNN

| Backbone | sched | box mAP | mask mAP | params | FLOPs | FPS |

| :---: | :---: |  :---: | :---: |  :---: |  :---: | :---: | 

| Swin-T |1x| 42.6 | 39.3 | 48 | 267 | 14.8 |

| Visformer-S | 1x| 43.0 | 39.6 | 60 | 275 | 13.1|

| VisformerV2-S | 1x| 44.8 | 40.7 | 43 | 262 | 15.2 |

|Swin-T |3x + MS|  46.0 | 41.6 | 48 | 367 | 14.8 |

| VisformerV2-S | 3x + MS| 47.8 | 42.5 | 43 | 262 | 15.2 |

### Cascade Mask R-CNN

| Backbone | sched | box mAP | mask mAP | params | FLOPs | FPS |

| :---: | :---: |  :---: | :---: |  :---: |  :---: | :---: |

| Swin-T |1x + MS|  48.1 | 41.7 | 86 | 745 | 9.5 |

| VisformerV2-S |1x + MS|  49.3 | 42.3 | 81 | 740 | 9.6 |

| Swin-T |3x + MS|  50.5 | 43.7 | 86 | 745 | 9.5 |

| VisformerV2-S |3x + MS|  51.6 | 44.1 | 81 | 740 | 9.6 |

This repo only contains the key files for object detection ('./ObjectDetction'). [Swin-Visformer-Object-Detection](https://github.com/danczs/Swin-Visformer-Object-Detection)  is the full detection project.

## Pre-trained Model

Beacause of the policy of our institution, we cannot send the pre-trained models out directly. Thankfully, @[hzhang57](https://github.com/hzhang57)  and @[developer0hye](https://github.com/developer0hye) provides [Visformer_small](https://drive.google.com/drive/folders/18GpH1SeVOsq3_2QGTA5Z_3O1UFtKugEu?usp=sharing) and [Visformer_tiny](https://drive.google.com/file/d/1LLBGbj7-ok1fDvvMCab-Fn5T3cjTzOKB/view?usp=sharing) models trained by themselves.

## Automatic Mixed Precision (amp)

In the original version of Visformer, amp can cause NaN values. We find that the overflow comes from the attention mask:

```python

scale = head_dim ** -0.5

attn = ( q  @ k.transpose(-2,-1) ) * scale

``` 

To avoid overflow, we pre-normalize q & k, and, thus, overall normalize 'attn' with 'head_dim' instead of  'head_dim ** 0.5':

```python

scale = head_dim ** -0.5

attn =  (q * scale) @ (k.transpose(-2,-1) * scale) 

```

Amp training:

```bash

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --model visformer_small --batch-size 64 --data-path /path/to/imagenet --output_dir /path/to/save --amp --qk-scale-factor=-0.5

python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py --model visformer_tiny --batch-size 256 --drop-path 0.03 --data-path /path/to/imagenet --output_dir /path/to/save --amp --qk-scale-factor=-0.5

```

This change won't degrade the training performance. 

Using amp for the original pre-trained models:

```bash

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --model visformer_small --batch-size 64 --data-path /path/to/imagenet --output_dir /path/to/save --eval --resume /path/to/weights --amp

```

## Citing

```bash

@inproceedings{chen2021visformer,

  title={Visformer: The vision-friendly transformer},

  author={Chen, Zhengsu and Xie, Lingxi and Niu, Jianwei and Liu, Xuefeng and Wei, Longhui and Tian, Qi},

  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},

  pages={589--598},

  year={2021}

}

```
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/danczs/Visformer

Awesome Lists containing this project

README