Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/wgcban/adamae

[CVPR'23] AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders
https://github.com/wgcban/adamae

Last synced: 11 days ago
JSON representation

[CVPR'23] AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders

Host: GitHub
URL: https://github.com/wgcban/adamae
Owner: wgcban
License: mit
Created: 2022-11-15T03:01:29.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2024-02-02T14:44:41.000Z (9 months ago)
Last Synced: 2024-05-13T22:52:48.490Z (6 months ago)
Language: Python
Homepage: https://www.wgcban.com/research/adamae
Size: 19.9 MB
Stars: 63
Watchers: 3
Forks: 7
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # [CVPR'23] *Ada*MAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders

:book: Paper: [`CVPR'23`](https://openaccess.thecvf.com/content/CVPR2023/papers/Bandara_AdaMAE_Adaptive_Masking_for_Efficient_Spatiotemporal_Learning_With_Masked_Autoencoders_CVPR_2023_paper.pdf) and [``arXiv``](https://arxiv.org/abs/2211.09120v1)

Our paper (AdaMAE) has been accepted for presentation at CVPR'23.

### :bulb: Contributions:

- We propose *Ada*MAE, a novel, adaptive, and end-to-end trainable token sampling strategy for MAEs that takes into account the spatiotemporal properties of all input tokens to sample fewer but informative tokens.

- We empirically show that *Ada*MAE samples more tokens from high spatiotemporal information regions of the input, resulting in learning meaningful representations for downstream tasks.

- We demonstrate the efficiency of *Ada*MAE in terms of performance and GPU memory against random *patch*, *tube*, and *frame* sampling by conducting a thorough ablation study on the SSv2 dataset.

- We show that our *Ada*MAE outperforms state-of-the-art (SOTA) by $0.7\%$ and $1.1\%$ (in top-1) improvements on $SSv2$ and $Kinetics-400$, respectively.

### Method

![mask-vis-1](figs/adamae-intro-fig.jpeg)

### Adaptive mask visualizations from $SSv2$ (samples from $50th$ epoch)

|   Video    | Pred.  |   Error   |     CAT   | Mask |   |  Video  | Pred.  |   Error   |     CAT    | Mask   |

| ----------- | --------- | --------- | --------- | --------- |--|--------- | --------- | --------- | --------- | --------- |



  

   





  

   





  

   





  

   





  

   





  

   



### Adaptive mask visualizations from $K400$ (samples from $50th$ epoch):

|   Video    | Pred.  |   Error   |     CAT   | Mask |   |  Video  | Pred.  |   Error   |     CAT    | Mask   |

| ----------- | --------- | --------- | --------- | --------- |--|--------- | --------- | --------- | --------- | --------- |



  

   





  

   





  

   





  

   





  

   





  

   



### A comparision

Comparison of our adaptive masking with existing random *patch*, *tube*, and *frame* masking for masking ratio of 80\%.} Our adaptive masking approach selects more tokens from the regions with high spatiotemporal information while a small number of tokens from the background.

![mask-type-comp](figs/adamae-mask-types.jpeg)

## Ablation experiments on SSv2 dataset:

We use ViT-Base as the backbone for all experiments. MHA $(D=2, d=384)$ denotes our adaptive token sampling network with a depth of two and embedding dimension of $384$.  All pre-trained models are evaluated based on the evaluation protocol described in Sec. 4. The default choice of our *Ada*MAE is highlighted in gray color. The GPU memory consumption is reported for a batch size of 16 on a single GPU.

![ssv2-ablations](figs/adamae-ablations.png)

# Pre-training *Ada*MAE & fine-tuning:

- We closely follow the [VideoMAE](https://github.com/MCG-NJU/VideoMAE.git) pre-trainig receipy, but now with our *adaptive masking* instead of *tube masking*. To pre-train *Ada*MAE, please follow the steps in [``DATASET.md``](readme/DATASET.md), [``PRETRAIN.md``](readme/PRETRAIN.md).

- To check the performance of pre-trained *Ada*MAE please follow the steps in [``DATASET.md``](readme/DATASET.md) and [``FINETUNE.md``](readme/FINETUNE.md).

- To setup the conda environment, please refer [``FINETUNE.md``](readme/INSTALL.md).

# Pre-trained model weights

- Download the pre-trained model weights for SSv2 and K400 datasets [``here``](https://github.com/wgcban/adamae/releases/tag/v1).

## Acknowledgement:

Our AdaMAE codebase is based on the implementation of VideoMAE paper. We thank the authors of the [VideoMAE](https://github.com/MCG-NJU/VideoMAE.git) for making their code available to the public.

## Citation:

```

@InProceedings{Bandara_2023_CVPR,

    author    = {Bandara, Wele Gedara Chaminda and Patel, Naman and Gholami, Ali and Nikkhah, Mehdi and Agrawal, Motilal and Patel, Vishal M.},

    title     = {AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning With Masked Autoencoders},

    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},

    month     = {June},

    year      = {2023},

    pages     = {14507-14517}

}

```