https://github.com/OpenGVLab/VideoMAEv2
[CVPR 2023] VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
https://github.com/OpenGVLab/VideoMAEv2
action-detection action-recognition cvpr2023 foundation-model self-supervised-learning temporal-action-detection video-understanding
Last synced: 2 months ago
JSON representation
[CVPR 2023] VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
- Host: GitHub
- URL: https://github.com/OpenGVLab/VideoMAEv2
- Owner: OpenGVLab
- License: mit
- Created: 2023-04-04T07:15:23.000Z (about 2 years ago)
- Default Branch: master
- Last Pushed: 2024-10-08T05:04:52.000Z (8 months ago)
- Last Synced: 2024-10-27T07:32:20.917Z (7 months ago)
- Topics: action-detection, action-recognition, cvpr2023, foundation-model, self-supervised-learning, temporal-action-detection, video-understanding
- Language: Python
- Homepage: https://arxiv.org/abs/2303.16727
- Size: 935 KB
- Stars: 510
- Watchers: 5
- Forks: 58
- Open Issues: 12
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# [CVPR 2023] Official Implementation of VideoMAE V2

[](https://paperswithcode.com/sota/spatio-temporal-action-localization-on-ava?p=videomae-v2-scaling-video-masked-autoencoders)
[](https://paperswithcode.com/sota/action-recognition-on-ava-v2-2?p=videomae-v2-scaling-video-masked-autoencoders)
[](https://paperswithcode.com/sota/temporal-action-localization-on-fineaction?p=videomae-v2-scaling-video-masked-autoencoders)
[](https://paperswithcode.com/sota/action-recognition-in-videos-on-hmdb-51?p=videomae-v2-scaling-video-masked-autoencoders)
[](https://paperswithcode.com/sota/temporal-action-localization-on-thumos14?p=videomae-v2-scaling-video-masked-autoencoders)
[](https://paperswithcode.com/sota/action-recognition-in-videos-on-ucf101?p=videomae-v2-scaling-video-masked-autoencoders)
[](https://paperswithcode.com/sota/action-recognition-in-videos-on-something-1?p=videomae-v2-scaling-video-masked-autoencoders)
[](https://paperswithcode.com/sota/action-recognition-in-videos-on-something?p=videomae-v2-scaling-video-masked-autoencoders)
[](https://paperswithcode.com/sota/action-classification-on-kinetics-400?p=videomae-v2-scaling-video-masked-autoencoders)
[](https://paperswithcode.com/sota/action-classification-on-kinetics-600?p=videomae-v2-scaling-video-masked-autoencoders)> [**VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking**](https://arxiv.org/abs/2303.16727)
> [Limin Wang](http://wanglimin.github.io/), [Bingkun Huang](https://github.com/congee524), [Zhiyu Zhao](https://github.com/JerryFlymi), [Zhan Tong](https://scholar.google.com/citations?user=6FsgWBMAAAAJ), [Yinan He](https://dblp.org/pid/93/7763.html), [Yi Wang](https://scholar.google.com.hk/citations?hl=zh-CN&user=Xm2M8UwAAAAJ), [Yali Wang](https://scholar.google.com/citations?user=hD948dkAAAAJ), and [Yu Qiao](https://scholar.google.com/citations?user=gFtI-8QAAAAJ&hl)
> Nanjing University, Shanghai AI Lab, CAS## News
**[2024.09.19]** Checkpoints have been migrated to Hugging Face. You can obtain weights from [VideoMAEv2-hf](https://huggingface.co/OpenGVLab/VideoMAE2/tree/main).
**[2023.05.29]** VideoMAE V2-g features for THUMOS14 and FineAction datasets are available at [TAD.md](docs/TAD.md) now.
**[2023.05.11]** We have supported testing of our distilled models at MMAction2 (dev version)! See [PR#2460](https://github.com/open-mmlab/mmaction2/pull/2460).
**[2023.05.11]** The feature extraction script for TAD datasets has been released! See instructions at [TAD.md](docs/TAD.md).
**[2023.04.19]** ViT-giant model weights have been released! You can get the download links from [MODEL_ZOO.md](docs/MODEL_ZOO.md).
**[2023.04.18]** Code and the distilled models (vit-s & vit-b) have been released!
**[2023.04.03]** ~~Code and models will be released soon.~~## Model Zoo
We now provide the model weights in [MODEL_ZOO.md](docs/MODEL_ZOO.md). We have additionally provided distilled models in MODEL_ZOO.
| Model | Dataset | Teacher Model | \#Frame | K710 Top-1 | K400 Top-1 | K600 Top-1 |
| :-----: | :-----: | :-----------: | :-----: | :--------: | :--------: | :--------: |
| ViT-small | K710 | vit_g_hybrid_pt_1200e_k710_ft | 16x5x3 | 77.6 | 83.7 | 83.1 |
| ViT-base | K710 | vit_g_hybrid_pt_1200e_k710_ft | 16x5x3 | 81.5 | 86.6 | 85.9 |## Installation
Please follow the instructions in [INSTALL.md](docs/INSTALL.md).
## Data Preparation
Please follow the instructions in [DATASET.md](docs/DATASET.md) for data preparation.
## Pre-training
The pre-training instruction is in [PRETRAIN.md](docs/PRETRAIN.md).
## Fine-tuning
The fine-tuning instruction is in [FINETUNE.md](docs/FINETUNE.md).
## Citation
If you find this repository useful, please use the following BibTeX entry for citation.
```latex
@InProceedings{wang2023videomaev2,
author = {Wang, Limin and Huang, Bingkun and Zhao, Zhiyu and Tong, Zhan and He, Yinan and Wang, Yi and Wang, Yali and Qiao, Yu},
title = {VideoMAE V2: Scaling Video Masked Autoencoders With Dual Masking},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2023},
pages = {14549-14560}
}@misc{videomaev2,
title={VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking},
author={Limin Wang and Bingkun Huang and Zhiyu Zhao and Zhan Tong and Yinan He and Yi Wang and Yali Wang and Yu Qiao},
year={2023},
eprint={2303.16727},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
```