https://github.com/ma-xu/FCViT

A Close Look at Spatial Modeling: From Attention to Convolution
https://github.com/ma-xu/FCViT

Last synced: 3 months ago
JSON representation

A Close Look at Spatial Modeling: From Attention to Convolution

Host: GitHub
URL: https://github.com/ma-xu/FCViT
Owner: ma-xu
License: apache-2.0
Created: 2022-12-24T05:54:55.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2022-12-27T01:25:58.000Z (over 2 years ago)
Last Synced: 2024-10-28T05:13:22.702Z (8 months ago)
Language: Python
Homepage:
Size: 8.65 MB
Stars: 91
Watchers: 3
Forks: 5
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # A Close Look at Spatial Modeling: From Attention to Convolution [[arXiv]](https://arxiv.org/abs/2212.12552)

by [Xu Ma](https://ma-xu.github.io/), [Huan Wang](http://huanwang.tech/), [Can Qin](https://canqin.tech/), [Kunpeng Li](https://kunpengli1994.github.io/), [Xingchen Zhao](https://www.xingchenzhao.com/), [Jie Fu](https://bigaidream.github.io/), [Yun Fu](http://www1.ece.neu.edu/~yunfu/)



    

    

    

    



----

## Motivation



  



Figure 1: Attention map visualizations of Vision Transformers. For each pair, we show the **query point** and its corresponding **attention map** (of last block and last head). We randomly selected images and the query points for illustration. The right color bar identifies the value of normalized attention maps. 

:eyes: :bangbang: **Observations & Motivations:**

 * :small_orange_diamond: **`Query-irrelevant behavior.`** The attention maps consistently show a query-irrelevant (and even head-irrelevant) behavior. Visually, the attention maps appear to be nearly identical for each testing model and image, regardless of the query patch. This is a departure from the design philosophy of self-attention that each patch should exhibit a distinct attention map, indicating that a global context may be concealed behind the attention mechanism.

 * :small_orange_diamond: **`Sparse attention & Convolution helps.`**  The attention weights (see ViT-B, ViT- L, and DeiT-B) are relatively sparse, indicating that only several patches dominate the attention.  By introducing the knowledge from convolution, the attention weights (see DeiT-B-Distill) are largely smoothed, and the performance is significantly improved as well (83.4% of DeiT-B-Distill vs. 81.8% of DeiT-B top-1 accuracy on ImageNet-1K validation set).

## Solution: From Attention to Convolution



  



Figure 2: Illustration of an FCViT block. Following MetaFormer, FCViT considers the block as a combination of token-mixer and channel-mixer, with residual connection and layer normalization (LN). In the token-mixers, we dynamically integrate the global context with input tokens by the token-global similarity. A depth-wise convolution is employed to fuse local information. To improve the generalization ability of the global context, we introduce a competition-driven information bottleneck structure.



  



Figure 3: Visual comparisons of `FCViT-B12 similarity` and `ViT-B attention map`. We plot all the outputs of the last block for the two models (8 groups for FCViT and 12 heads for ViT). Compared to ViT, the results indicate that: **1), FCViT focuses more on the objects**; **2), FCViT presents more diversities than multi-head attention**, whose attention maps from different heads are nearly the same.

----

## Image Classification

### 1. Requirements

torch>=1.7.0; torchvision>=0.8.0; pyyaml; [apex-amp](https://github.com/NVIDIA/apex) (if you want to use fp16); [timm](https://github.com/rwightman/pytorch-image-models) (`pip install git+https://github.com/rwightman/pytorch-image-models.git@9d6aad44f8fd32e89e5cca503efe3ada5071cc2a`)

data prepare: ImageNet with the following folder structure, you can extract ImageNet by this [script](https://gist.github.com/BIGBALLON/8a71d225eff18d88e469e6ea9b39cef4).

```

│imagenet/

├──train/

│  ├── n01440764

│  │   ├── n01440764_10026.JPEG

│  │   ├── n01440764_10027.JPEG

│  │   ├── ......

│  ├── ......

├──val/

│  ├── n01440764

│  │   ├── ILSVRC2012_val_00000293.JPEG

│  │   ├── ILSVRC2012_val_00002138.JPEG

│  │   ├── ......

│  ├── ......

```

### 2. FCViT Models

| Model    |  #params | Image resolution | Top1 Acc| Download | 

| :---     |   :---:    |  :---: |  :---:  |  :---:  |

| FCViT-tiny  |    4.6M     |   224 |  74.9  | [download](https://drive.google.com/drive/folders/1YSa8tkXkUQT94mgo-L7q4pv5KiHRTAaR?usp=sharing) |

| FCViT-B12 |   14M     |   224 |  80.9  | [download](https://drive.google.com/drive/folders/1QuyalIGhJeD2pxcVxR0_gJNrk2mZ8WEb?usp=sharing) |

| FCViT-B24  |   25.7M     |   224 |  82.5  | [download](https://drive.google.com/drive/folders/1II2v1rhNe9sgLJtoSR-cgSh2mpPQMt4t?usp=sharing) |

| FCViT-B48 |   49.1M     |   224 |  83.6 | [download](https://drive.google.com/drive/folders/16joP1cQwbx4oICL-WbPC1SqGNoc4-NZm?usp=sharing) |

### 3. Validation

To evaluate our FCViT models, run:

```bash

MODEL=fcvit_tiny #{tiny, b12, b24, b48}

python3 validate.py /path/to/imagenet  --model $MODEL -b 128 --checkpoint {/path/to/checkpoint} 

```

### 4. Train

We show how to train FCViT on 8 GPUs. The relation between learning rate and batch size is lr=bs/1024*1e-3.

For convenience, assuming the batch size is 1024, then the learning rate is set as 1e-3 (for batch size of 1024, setting the learning rate as 2e-3 sometimes sees better performance). 

```bash

MODEL=fcvit_tiny # fcvit_{tiny, b12, b24, b48}

DROP_PATH=0.1 # drop path rates [0.1, 0.1, 0,1, 0.2] responding to model [tiny, b12, b24, b48]

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./distributed_train.sh 8 /path/to/imagenet \

  --model $MODEL -b 128 --lr 1e-3 --drop-path $DROP_PATH --apex-amp

```

### 5. Detection and Segmentation

For detection and segmentation tasks, please see here: [[detection & instance segmentation]](./detection) and [[semantic segmentation]](./segmentation).

----

## Acknowledgment

Our implementation is mainly based on the following codebases. We gratefully thank the authors for their wonderful works.

[poolformer](https://github.com/sail-sg/poolformer), [pytorch-image-models](https://github.com/rwightman/pytorch-image-models), [mmdetection](https://github.com/open-mmlab/mmdetection), [mmsegmentation](https://github.com/open-mmlab/mmsegmentation).

----

## Citation

```

@article{ma2022fcvit,

  author      = {Ma, Xu and Wang, Huan and Qin, Can and Li, Kunpeng and Zhao, Xingchen and Fu, Jie and Fu, Yun},

  title       = {A Close Look at Spatial Modeling: From Attention to Convolution},

  publisher   = {arXiv},

  year        = {2022},

}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ma-xu/FCViT

Awesome Lists containing this project

README