https://github.com/OpenGVLab/VideoMAEv2

[CVPR 2023] VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
https://github.com/OpenGVLab/VideoMAEv2

action-detection action-recognition cvpr2023 foundation-model self-supervised-learning temporal-action-detection video-understanding

Last synced: 7 months ago
JSON representation

[CVPR 2023] VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

Host: GitHub
URL: https://github.com/OpenGVLab/VideoMAEv2
Owner: OpenGVLab
License: mit
Created: 2023-04-04T07:15:23.000Z (over 2 years ago)
Default Branch: master
Last Pushed: 2024-10-08T05:04:52.000Z (12 months ago)
Last Synced: 2024-10-27T07:32:20.917Z (12 months ago)
Topics: action-detection, action-recognition, cvpr2023, foundation-model, self-supervised-learning, temporal-action-detection, video-understanding
Language: Python
Homepage: https://arxiv.org/abs/2303.16727
Size: 935 KB
Stars: 510
Watchers: 5
Forks: 58
Open Issues: 12
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

Awesome-Reasoning-Foundation-Models - [Code

README

          # [CVPR 2023] Official Implementation of VideoMAE V2

![flowchart](misc/VideoMAEv2_flowchart.png)

[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/videomae-v2-scaling-video-masked-autoencoders/spatio-temporal-action-localization-on-ava)](https://paperswithcode.com/sota/spatio-temporal-action-localization-on-ava?p=videomae-v2-scaling-video-masked-autoencoders)


[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/videomae-v2-scaling-video-masked-autoencoders/action-recognition-on-ava-v2-2)](https://paperswithcode.com/sota/action-recognition-on-ava-v2-2?p=videomae-v2-scaling-video-masked-autoencoders)


[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/videomae-v2-scaling-video-masked-autoencoders/temporal-action-localization-on-fineaction)](https://paperswithcode.com/sota/temporal-action-localization-on-fineaction?p=videomae-v2-scaling-video-masked-autoencoders)


[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/videomae-v2-scaling-video-masked-autoencoders/action-recognition-in-videos-on-hmdb-51)](https://paperswithcode.com/sota/action-recognition-in-videos-on-hmdb-51?p=videomae-v2-scaling-video-masked-autoencoders)


[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/videomae-v2-scaling-video-masked-autoencoders/temporal-action-localization-on-thumos14)](https://paperswithcode.com/sota/temporal-action-localization-on-thumos14?p=videomae-v2-scaling-video-masked-autoencoders)


[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/videomae-v2-scaling-video-masked-autoencoders/action-recognition-in-videos-on-ucf101)](https://paperswithcode.com/sota/action-recognition-in-videos-on-ucf101?p=videomae-v2-scaling-video-masked-autoencoders)


[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/videomae-v2-scaling-video-masked-autoencoders/action-recognition-in-videos-on-something-1)](https://paperswithcode.com/sota/action-recognition-in-videos-on-something-1?p=videomae-v2-scaling-video-masked-autoencoders)


[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/videomae-v2-scaling-video-masked-autoencoders/action-recognition-in-videos-on-something)](https://paperswithcode.com/sota/action-recognition-in-videos-on-something?p=videomae-v2-scaling-video-masked-autoencoders)


[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/videomae-v2-scaling-video-masked-autoencoders/action-classification-on-kinetics-400)](https://paperswithcode.com/sota/action-classification-on-kinetics-400?p=videomae-v2-scaling-video-masked-autoencoders)


[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/videomae-v2-scaling-video-masked-autoencoders/action-classification-on-kinetics-600)](https://paperswithcode.com/sota/action-classification-on-kinetics-600?p=videomae-v2-scaling-video-masked-autoencoders)


> [**VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking**](https://arxiv.org/abs/2303.16727)


> [Limin Wang](http://wanglimin.github.io/), [Bingkun Huang](https://github.com/congee524), [Zhiyu Zhao](https://github.com/JerryFlymi), [Zhan Tong](https://scholar.google.com/citations?user=6FsgWBMAAAAJ), [Yinan He](https://dblp.org/pid/93/7763.html), [Yi Wang](https://scholar.google.com.hk/citations?hl=zh-CN&user=Xm2M8UwAAAAJ), [Yali Wang](https://scholar.google.com/citations?user=hD948dkAAAAJ), and [Yu Qiao](https://scholar.google.com/citations?user=gFtI-8QAAAAJ&hl)


> Nanjing University, Shanghai AI Lab, CAS


## News

**[2024.09.19]** Checkpoints have been migrated to Hugging Face. You can obtain weights from [VideoMAEv2-hf](https://huggingface.co/OpenGVLab/VideoMAE2/tree/main).


**[2023.05.29]** VideoMAE V2-g features for THUMOS14 and FineAction datasets are available at [TAD.md](docs/TAD.md) now.


**[2023.05.11]** We have supported testing of our distilled models at MMAction2 (dev version)! See [PR#2460](https://github.com/open-mmlab/mmaction2/pull/2460).


**[2023.05.11]** The feature extraction script for TAD datasets has been released! See instructions at [TAD.md](docs/TAD.md).


**[2023.04.19]** ViT-giant model weights have been released! You can get the download links from [MODEL_ZOO.md](docs/MODEL_ZOO.md).


**[2023.04.18]** Code and the distilled models (vit-s & vit-b) have been released!


**[2023.04.03]** ~~Code and models will be released soon.~~


## Model Zoo

We now provide the model weights in [MODEL_ZOO.md](docs/MODEL_ZOO.md). We have additionally provided distilled models in MODEL_ZOO.

|  Model  | Dataset | Teacher Model | \#Frame | K710 Top-1 | K400 Top-1 | K600 Top-1 |

| :-----: | :-----: | :-----------: | :-----: | :--------: | :--------: | :--------: |

| ViT-small | K710 | vit_g_hybrid_pt_1200e_k710_ft | 16x5x3 | 77.6 | 83.7 | 83.1 |

| ViT-base | K710 | vit_g_hybrid_pt_1200e_k710_ft | 16x5x3 | 81.5 | 86.6 | 85.9 |

## Installation

Please follow the instructions in [INSTALL.md](docs/INSTALL.md).

## Data Preparation

Please follow the instructions in [DATASET.md](docs/DATASET.md) for data preparation.

## Pre-training

The pre-training instruction is in [PRETRAIN.md](docs/PRETRAIN.md).

## Fine-tuning

The fine-tuning instruction is in [FINETUNE.md](docs/FINETUNE.md).

## Citation

If you find this repository useful, please use the following BibTeX entry for citation.

```latex

@InProceedings{wang2023videomaev2,

    author    = {Wang, Limin and Huang, Bingkun and Zhao, Zhiyu and Tong, Zhan and He, Yinan and Wang, Yi and Wang, Yali and Qiao, Yu},

    title     = {VideoMAE V2: Scaling Video Masked Autoencoders With Dual Masking},

    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},

    month     = {June},

    year      = {2023},

    pages     = {14549-14560}

}

@misc{videomaev2,

      title={VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking},

      author={Limin Wang and Bingkun Huang and Zhiyu Zhao and Zhan Tong and Yinan He and Yi Wang and Yali Wang and Yu Qiao},

      year={2023},

      eprint={2303.16727},

      archivePrefix={arXiv},

      primaryClass={cs.CV}

}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/OpenGVLab/VideoMAEv2

Awesome Lists containing this project

README