https://github.com/mlpc-ucsd/CoaT

(ICCV 2021 Oral) CoaT: Co-Scale Conv-Attentional Image Transformers
https://github.com/mlpc-ucsd/CoaT
Last synced: 4 months ago
JSON representation
(ICCV 2021 Oral) CoaT: Co-Scale Conv-Attentional Image Transformers
Host: GitHub
URL: https://github.com/mlpc-ucsd/CoaT
Owner: mlpc-ucsd
License: apache-2.0
Created: 2021-03-25T00:35:20.000Z (about 4 years ago)
Default Branch: main
Last Pushed: 2022-02-03T22:13:14.000Z (over 3 years ago)
Last Synced: 2024-10-16T18:17:43.557Z (8 months ago)
Language: Jupyter Notebook
Homepage:
Size: 6.98 MB
Stars: 227
Watchers: 11
Forks: 30
Open Issues: 3
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

awesome_vision_transformer - code
awesome_vision_transformer - code
README

        # CoaT: Co-Scale Conv-Attentional Image Transformers

## Introduction

This repository contains the official code and pretrained models for [CoaT: Co-Scale Conv-Attentional Image Transformers](http://arxiv.org/abs/2104.06399). It introduces (1) a co-scale mechanism to realize fine-to-coarse, coarse-to-fine and cross-scale attention modeling and (2) an efficient conv-attention module to realize relative position encoding in the factorized attention.



For more details, please refer to [CoaT: Co-Scale Conv-Attentional Image Transformers](http://arxiv.org/abs/2104.06399) by [Weijian Xu*](https://weijianxu.com/), [Yifan Xu*](https://yfxu.com/), [Tyler Chang](https://tylerachang.github.io/), and [Zhuowen Tu](https://pages.ucsd.edu/~ztu/).

## Performance

1. Classification (ImageNet dataset)

   | Name | Acc@1 | Acc@5 | #Params |

   | --- | --- | --- | --- |

   | CoaT-Lite Tiny | 77.5 | 93.8 | 5.7M |

   | CoaT-Lite Mini | 79.1 | 94.5 | 11M |

   | CoaT-Lite Small | 81.9 | 95.5 | 20M |

   | CoaT-Lite Medium | 83.6 | 96.7 | 45M |

   | CoaT Tiny | 78.3 | 94.0 | 5.5M |

   | CoaT Mini | 81.0 | 95.2 | 10M |

   | CoaT Small | 82.1 | 96.1 | 22M |

2. Instance Segmentation (Mask R-CNN w/ FPN on COCO dataset)

   | Name | Schedule | Bbox AP | Segm AP |

   | --- | --- | --- | --- |

   | CoaT-Lite Mini | 1x | 41.4 | 38.0 |

   | CoaT-Lite Mini | 3x | 42.9 | 38.9 |

   | CoaT-Lite Small | 1x | 45.2 | 40.7 |

   | CoaT-Lite Small | 3x | 45.7 | 41.1 |

   | CoaT Mini | 1x | 45.1 | 40.6 |

   | CoaT Mini | 3x | 46.5 | 41.8 |

   | CoaT Small | 1x | 46.5 | 41.8 |

   | CoaT Small | 3x | 49.0 | 43.7 |

3. Object Detection (Deformable-DETR on COCO dataset)

   | Name | AP | AP50 | AP75 | APS | APM | APL |

   | --- | --- | --- | --- | --- | --- | --- |

   | CoaT-Lite Small | 47.0 | 66.5 | 51.2 | 28.8 | 50.3 | 63.3 |

   | CoaT Small | 48.4 | 68.5 | 52.4 | 30.1 | 51.8 | 63.8 |

## Changelog

12/12/2021: Code and pre-trained checkpoints for Deformable-DETR with CoaT Small backbone are released. 


12/07/2021: Training commands for CoaT-Lite Medium (384x384) are released. 


12/06/2021: Pre-trained checkpoints for CoaT-Lite Medium (384x384) are released. 


12/05/2021: Training scripts for CoaT Small and CoaT-Lite Medium are released. 


09/27/2021: Code and pre-trained checkpoints for instance segmentation with MMDetection are released. 


08/27/2021: Pre-trained checkpoints for CoaT Small and CoaT-Lite Medium are released. 


05/19/2021: Pre-trained checkpoints for Mask R-CNN benchmark with CoaT-Lite Small backbone are released. 


05/19/2021: Code and pre-trained checkpoints for Deformable-DETR with CoaT-Lite Small backbone are released. 


05/11/2021: Pre-trained checkpoints for CoaT-Lite Small are released. 


05/09/2021: Pre-trained checkpoints for Mask R-CNN benchmark with CoaT Mini backbone are released. 


05/06/2021: Pre-trained checkpoints for CoaT Mini are released. 


05/02/2021: Pre-trained checkpoints for CoaT Tiny are released. 


04/25/2021: Code and pre-trained checkpoints for Mask R-CNN benchmark with CoaT-Lite Mini backbone are released. 


04/23/2021: Pre-trained checkpoints for CoaT-Lite Mini are released. 


04/22/2021: Code and pre-trained checkpoints for CoaT-Lite Tiny are released.

## Usage

The following usage is provided for the classification task using CoaT model. For the other tasks, please follow the corresponding readme, such as [instance segmentation](./tasks/mmdet/README.md) and [object detection](./tasks/Deformable-DETR/README.md).

### Environment Preparation

1. Set up a new conda environment and activate it.

   ```bash

   # Create an environment with Python 3.8.

   conda create -n coat python==3.8

   conda activate coat

   ```

2. Install required packages.

   ```bash

   # Install PyTorch 1.7.1 w/ CUDA 11.0.

   pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html

   # Install timm 0.3.2.

   pip install timm==0.3.2

   # Install einops.

   pip install einops

   ```

### Code and Dataset Preparation

1. Clone the repo.

   ```bash

   git clone https://github.com/mlpc-ucsd/CoaT

   cd CoaT

   ```

2. Download ImageNet dataset (ILSVRC 2012) and extract.

   ```bash

   # Create dataset folder.

   mkdir -p ./data/ImageNet

   # Download the dataset (not shown here) and copy the files (assume the download path is in $DATASET_PATH).

   cp $DATASET_PATH/ILSVRC2012_img_train.tar $DATASET_PATH/ILSVRC2012_img_val.tar $DATASET_PATH/ILSVRC2012_devkit_t12.tar.gz ./data/ImageNet

   # Extract the dataset.

   python -c "from torchvision.datasets import ImageNet; ImageNet('./data/ImageNet', split='train')"

   python -c "from torchvision.datasets import ImageNet; ImageNet('./data/ImageNet', split='val')"

   # After the extraction, you should observe `train` and `val` folders under ./data/ImageNet.

   ```

### Evaluate Pre-trained Checkpoint

We provide the CoaT checkpoints pre-trained on the ImageNet dataset.

| Name | Acc@1 | Acc@5 | #Params | SHA-256 (first 8 chars) | URL |

| --- | --- | --- | --- | --- | --- |

| CoaT-Lite Tiny | 77.5 | 93.8 | 5.7M | e88e96b0 |[model](https://vcl.ucsd.edu/coat/pretrained/coat_lite_tiny_e88e96b0.pth), [log](https://vcl.ucsd.edu/coat/pretrained/coat_lite_tiny_e88e96b0.txt) |

| CoaT-Lite Mini | 79.1 | 94.5 | 11M | 6b4a8ae5 |[model](https://vcl.ucsd.edu/coat/pretrained/coat_lite_mini_6b4a8ae5.pth), [log](https://vcl.ucsd.edu/coat/pretrained/coat_lite_mini_6b4a8ae5.txt) |

| CoaT-Lite Small | 81.9 | 95.5 | 20M | 8d362f48 |[model](https://vcl.ucsd.edu/coat/pretrained/coat_lite_small_8d362f48.pth), [log](https://vcl.ucsd.edu/coat/pretrained/coat_lite_small_8d362f48.txt) |

| CoaT-Lite Medium | 83.6 | 96.7 | 45M | a750cd63 |[model](https://vcl.ucsd.edu/coat/pretrained/coat_lite_medium_a750cd63.pth), [log](https://vcl.ucsd.edu/coat/pretrained/coat_lite_medium_a750cd63.txt)

| CoaT-Lite Medium (384x384) | 84.5 | 97.1 | 45M | f9129688 |[model](https://vcl.ucsd.edu/coat/pretrained/coat_lite_medium_384x384_f9129688.pth), [log](https://vcl.ucsd.edu/coat/pretrained/coat_lite_medium_384x384_f9129688.txt)

| CoaT Tiny | 78.3 | 94.0 | 5.5M | c6efc33c |[model](https://vcl.ucsd.edu/coat/pretrained/coat_tiny_c6efc33c.pth), [log](https://vcl.ucsd.edu/coat/pretrained/coat_tiny_c6efc33c.txt) |

| CoaT Mini | 81.0 | 95.2 | 10M | 40667eec |[model](https://vcl.ucsd.edu/coat/pretrained/coat_mini_40667eec.pth), [log](https://vcl.ucsd.edu/coat/pretrained/coat_mini_40667eec.txt) |

| CoaT Small | 82.1 | 96.1 | 22M | 7479cf9b |[model](https://vcl.ucsd.edu/coat/pretrained/coat_small_7479cf9b.pth), [log](https://vcl.ucsd.edu/coat/pretrained/coat_small_7479cf9b.txt)

The following commands provide an example (CoaT-Lite Tiny) to evaluate the pre-trained checkpoint.

   ```bash

   # Download the pretrained checkpoint.

   mkdir -p ./output/pretrained

   wget http://vcl.ucsd.edu/coat/pretrained/coat_lite_tiny_e88e96b0.pth -P ./output/pretrained

   sha256sum ./output/pretrained/coat_lite_tiny_e88e96b0.pth  # Make sure it matches the SHA-256 hash (first 8 characters) in the table.

   # Evaluate.

   # Usage: bash ./scripts/eval.sh [model name] [output folder] [checkpoint path]

   bash ./scripts/eval.sh coat_lite_tiny coat_lite_tiny_pretrained ./output/pretrained/coat_lite_tiny_e88e96b0.pth

   # It should output results similar to "Acc@1 77.504 Acc@5 93.814" at very last.

   ```

   **Note**: For CoaT-Lite Medium with 384x384 input, we use the following command for evaluation:

   ```bash

   # Evaluation command for CoaT-Lite Medium (384x384).

   bash ./scripts/eval_extra_args.sh coat_lite_medium coat_lite_medium_384x384_pretrained ./output/pretrained/coat_lite_medium_384x384_f9129688.pth --batch-size 128 --input-size 384

   ```

   

### Train

   The following commands provide an example (CoaT-Lite Tiny, 8-GPU) to train the CoaT model.

   ```bash

   # Usage: bash ./scripts/train.sh [model name] [output folder]

   bash ./scripts/train.sh coat_lite_tiny coat_lite_tiny

   ```

   **Note**: Some training hyperparameters for CoaT Small and CoaT-Lite Medium are different from the default settings:

   ```bash

   # Training command for CoaT Small.

   bash ./scripts/train_extra_args.sh coat_small coat_small --batch-size 128 --drop-path 0.2 --no-model-ema --warmup-epochs 20 --clip-grad 5.0

   

   # Training command for CoaT-Lite Medium.

   bash ./scripts/train_extra_args.sh coat_lite_medium coat_lite_medium --batch-size 128 --drop-path 0.3 --no-model-ema --warmup-epochs 20 --clip-grad 5.0

   # Training command for CoaT-Lite Medium (384x384).

   bash ./scripts/train_extra_args.sh coat_lite_medium coat_lite_medium_384x384 \

      --resume ./output/pretrained/coat_lite_medium_a750cd63.pth \

      --resume_only_state \

      --batch-size 32 \

      --drop-path 0.2 \

      --no-model-ema \

      --warmup-epochs 0 \

      --clip-grad 5.0 \

      --input-size 384 \

      --lr 5e-6 \

      --min-lr 5e-6 \

      --weight-decay 1e-8 \

      --epochs 6 \

      --save_freq 1

   ```

### Evaluate

   The following commands provide an example (CoaT-Lite Tiny) to evaluate the checkpoint after training.

   ```bash

   # Usage: bash ./scripts/eval.sh [model name] [output folder] [checkpoint path]

   bash ./scripts/eval.sh coat_lite_tiny coat_lite_tiny_eval ./output/coat_lite_tiny/checkpoints/checkpoint0299.pth

   ```

## Citation

```

@InProceedings{Xu_2021_ICCV,

    author    = {Xu, Weijian and Xu, Yifan and Chang, Tyler and Tu, Zhuowen},

    title     = {Co-Scale Conv-Attentional Image Transformers},

    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},

    month     = {October},

    year      = {2021},

    pages     = {9981-9990}

}

```

## License

This repository is released under the Apache License 2.0. License can be found in [LICENSE](LICENSE) file.

## Acknowledgment

Thanks to [DeiT](https://github.com/facebookresearch/deit) and [pytorch-image-models](https://github.com/rwightman/pytorch-image-models) for a clear and data-efficient implementation of [ViT](https://openreview.net/forum?id=YicbFdNTTy). Thanks to [lucidrains' implementation](https://github.com/lucidrains/lambda-networks) of [Lambda Networks](https://openreview.net/forum?id=xTJEN-ggl1b) and [CPVT](https://github.com/Meituan-AutoML/CPVT).
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mlpc-ucsd/CoaT

Awesome Lists containing this project

README