{"id":13429510,"url":"https://github.com/MCG-NJU/VideoMAE","last_synced_at":"2025-03-16T03:31:47.241Z","repository":{"id":37305286,"uuid":"473059611","full_name":"MCG-NJU/VideoMAE","owner":"MCG-NJU","description":"[NeurIPS 2022 Spotlight] VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training","archived":false,"fork":false,"pushed_at":"2023-12-08T13:44:48.000Z","size":560,"stargazers_count":1354,"open_issues_count":42,"forks_count":136,"subscribers_count":16,"default_branch":"main","last_synced_at":"2024-10-27T07:32:24.676Z","etag":null,"topics":["action-recognition","mae","masked-autoencoder","neurips-2022","pytorch","self-supervised-learning","transformer","video-analysis","video-representation-learning","video-transformer","video-understanding","vision-transformer"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2203.12602","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MCG-NJU.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-03-23T06:15:50.000Z","updated_at":"2024-10-26T19:12:24.000Z","dependencies_parsed_at":"2023-02-18T06:01:07.338Z","dependency_job_id":null,"html_url":"https://github.com/MCG-NJU/VideoMAE","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MCG-NJU%2FVideoMAE","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MCG-NJU%2FVideoMAE/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MCG-NJU%2FVideoMAE/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MCG-NJU%2FVideoMAE/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MCG-NJU","download_url":"https://codeload.github.com/MCG-NJU/VideoMAE/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243822310,"owners_count":20353496,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["action-recognition","mae","masked-autoencoder","neurips-2022","pytorch","self-supervised-learning","transformer","video-analysis","video-representation-learning","video-transformer","video-understanding","vision-transformer"],"created_at":"2024-07-31T02:00:40.980Z","updated_at":"2025-03-16T03:31:46.902Z","avatar_url":"https://github.com/MCG-NJU.png","language":"Python","funding_links":[],"categories":["2 Foundation Models","其他_机器视觉","Python","MIM for CV Downstream Tasks","多模态预训练 (Multimodal Pre-training)","Frameworks and Libraries","Video and Long-Context Multimodality"],"sub_categories":["2.2 Vision Foundation Models","网络服务_其他","Video Rrepresentation","视频 (Video)","Video Action Recognition","Models and systems"],"readme":"# Official PyTorch Implementation of VideoMAE (NeurIPS 2022 Spotlight).\n\n![VideoMAE Framework](figs/videomae.jpg)\n\n[![License: CC BY-NC 4.0](https://img.shields.io/badge/License-CC_BY--NC_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc/4.0/)\u003cbr\u003e\n[![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/models?other=videomae)[![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/sayakpaul/video-classification-ucf101-subset)[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/video_classification.ipynb)\u003cbr\u003e\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/videomae-masked-autoencoders-are-data-1/action-recognition-in-videos-on-something)](https://paperswithcode.com/sota/action-recognition-in-videos-on-something?p=videomae-masked-autoencoders-are-data-1)\u003cbr\u003e\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/videomae-masked-autoencoders-are-data-1/action-classification-on-kinetics-400)](https://paperswithcode.com/sota/action-classification-on-kinetics-400?p=videomae-masked-autoencoders-are-data-1)\u003cbr\u003e[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/videomae-masked-autoencoders-are-data-1/action-recognition-on-ava-v2-2)](https://paperswithcode.com/sota/action-recognition-on-ava-v2-2?p=videomae-masked-autoencoders-are-data-1)\u003cbr\u003e\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/videomae-masked-autoencoders-are-data-1/self-supervised-action-recognition-on-ucf101)](https://paperswithcode.com/sota/self-supervised-action-recognition-on-ucf101?p=videomae-masked-autoencoders-are-data-1)\u003cbr\u003e\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/videomae-masked-autoencoders-are-data-1/self-supervised-action-recognition-on-hmdb51)](https://paperswithcode.com/sota/self-supervised-action-recognition-on-hmdb51?p=videomae-masked-autoencoders-are-data-1)\n\n\u003e [**VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training**](https://arxiv.org/abs/2203.12602)\u003cbr\u003e\n\u003e [Zhan Tong](https://github.com/yztongzhan), [Yibing Song](https://ybsong00.github.io/), [Jue Wang](https://juewang725.github.io/), [Limin Wang](http://wanglimin.github.io/)\u003cbr\u003eNanjing University, Tencent AI Lab\n\n## 📰 News\n**[2023.4.18]** 🎈Everyone can download **Kinetics-400**, which is used in VideoMAE, from [this link](https://opendatalab.com/Kinetics-400).\u003cbr\u003e\n**[2023.4.18]** Code and pre-trained models of [VideoMAE V2](https://arxiv.org/abs/2303.16727) have been released! Check and enjoy this [repo](https://github.com/OpenGVLab/VideoMAEv2)!\u003cbr\u003e\n**[2023.4.17]** We propose **[EVAD](https://arxiv.org/abs/2304.08451)**, an **end-to-end Video Action Detection** framework.\u003cbr\u003e\n**[2023.2.28]** Our [VideoMAE V2](https://arxiv.org/abs/2303.16727) is accepted by **CVPR 2023**! 🎉\u003cbr\u003e\n**[2023.1.16]** Code and pre-trained models for **Action Detection** in VideoMAE are [available](https://github.com/MCG-NJU/VideoMAE-Action-Detection)! \u003cbr\u003e\n**[2022.12.27]** 🎈Everyone can download extracted **VideoMAE** features of **THUMOS**, **ActivityNet**, **HACS** and **FineAction** from [InternVideo](https://github.com/OpenGVLab/InternVideo/tree/main/Downstream/Temporal-Action-Localization#to-reproduce-our-results-of-internvideo).\u003cbr\u003e\n**[2022.11.20]** 👀 VideoMAE is integrated into [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/sayakpaul/video-classification-ucf101-subset) and [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/video_classification.ipynb), supported by [@Sayak Paul](https://github.com/sayakpaul).\u003cbr\u003e\n**[2022.10.25]** 👀 VideoMAE is integrated into [MMAction2](https://github.com/open-mmlab/mmaction2/tree/dev-1.x/configs/recognition/videomae), the results  on Kinetics-400 can be reproduced successfully. \u003cbr\u003e\n**[2022.10.20]** The pre-trained models and scripts of **ViT-S** and **ViT-H** are available! \u003cbr\u003e\n**[2022.10.19]** The pre-trained models and scripts on **UCF101** are [available](MODEL_ZOO.md#UCF101)! \u003cbr\u003e\n**[2022.9.15]** VideoMAE is accepted by **NeurIPS 2022** as a **spotlight** presentation! 🎉 \u003cbr\u003e\n**[2022.8.8]** 👀 VideoMAE is integrated into **official** [🤗HuggingFace Transformers](https://huggingface.co/docs/transformers/main/en/model_doc/videomae) now! [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/models?other=videomae)\u003cbr\u003e\n**[2022.7.7]**  We have updated new results on downstream AVA 2.2 benchmark. Please refer to our [paper](https://arxiv.org/abs/2203.12602) for details. \u003cbr\u003e\n**[2022.4.24]**  Code and pre-trained models are available now! \u003cbr\u003e\n**[2022.3.24]** ~~Code and pre-trained models will be released here.~~ Welcome to **watch** this repository for the latest updates.\n\n## ✨ Highlights\n\n### 🔥 Masked Video Modeling for Video Pre-Training\n\nVideoMAE performs the task of masked video modeling for video pre-training. We propose the **extremely high** masking ratio (90%-95%) and **tube masking** strategy to create a challenging task for self-supervised video pre-training.\n\n### ⚡️ A Simple, Efficient and Strong Baseline in SSVP\n\nVideoMAE uses the simple masked autoencoder and **plain ViT** backbone to perform video self-supervised learning. Due to the extremely high masking ratio, the pre-training time of VideoMAE is **much shorter** than contrastive learning methods (**3.2x** speedup). VideoMAE can serve as **a simple but strong baseline** for future research in self-supervised video pre-training.\n\n### 😮 High performance, but NO extra data required\n\nVideoMAE works well for video datasets of different scales and can achieve **87.4%** on Kinects-400, **75.4%** on Something-Something V2, **91.3%** on UCF101, and **62.6%** on HMDB51. To our best knowledge, VideoMAE is the **first** to achieve the state-of-the-art performance on these four popular benchmarks with the **vanilla ViT** backbones while **doesn't need** any extra data or pre-trained models.\n\n## 🚀 Main Results\n\n### ✨ Something-Something V2\n\n|  Method  | Extra Data | Backbone | Resolution | #Frames x Clips x Crops | Top-1 | Top-5 |\n| :------: | :--------: | :------: | :--------: | :---------------------: | :---: | :---: |\n| VideoMAE |  ***no***  |  ViT-S   |  224x224   |         16x2x3          | 66.8  | 90.3  |\n| VideoMAE |  ***no***  |  ViT-B   |  224x224   |         16x2x3          | 70.8  | 92.4  |\n| VideoMAE |  ***no***  |  ViT-L   |  224x224   |         16x2x3          | 74.3  | 94.6  |\n| VideoMAE |  ***no***  |  ViT-L   |  224x224   |         32x1x3          | 75.4  | 95.2  |\n\n### ✨ Kinetics-400\n\n|  Method  | Extra Data | Backbone | Resolution | #Frames x Clips x Crops | Top-1 | Top-5 |\n| :------: | :--------: | :------: | :--------: | :---------------------: | :---: | :---: |\n| VideoMAE |  ***no***  |  ViT-S   |  224x224   |         16x5x3          | 79.0  | 93.8  |\n| VideoMAE |  ***no***  |  ViT-B   |  224x224   |         16x5x3          | 81.5  | 95.1  |\n| VideoMAE |  ***no***  |  ViT-L   |  224x224   |         16x5x3          | 85.2  | 96.8  |\n| VideoMAE |  ***no***  |  ViT-H   |  224x224   |         16x5x3          | 86.6  | 97.1  |\n| VideoMAE |  ***no***  |  ViT-L   |  320x320   |         32x4x3          | 86.1  | 97.3  |\n| VideoMAE |  ***no***  |  ViT-H   |  320x320   |         32x4x3          | 87.4  | 97.6  |\n\n### ✨ AVA 2.2\n\nPlease check the code and checkpoints in [VideoMAE-Action-Detection](https://github.com/MCG-NJU/VideoMAE-Action-Detection).\n|  Method  |  Extra Data  | Extra Label | Backbone | #Frame x Sample Rate | mAP  |\n| :------: | :----------: | :---------: | :------: | :------------------: | :--: |\n| VideoMAE | Kinetics-400 |   \u0026cross;   |  ViT-S   |         16x4         | 22.5 |\n| VideoMAE | Kinetics-400 |   \u0026check;   |  ViT-S   |         16x4         | 28.4 |\n| VideoMAE | Kinetics-400 |   \u0026cross;   |  ViT-B   |         16x4         | 26.7 |\n| VideoMAE | Kinetics-400 |   \u0026check;   |  ViT-B   |         16x4         | 31.8 |\n| VideoMAE | Kinetics-400 |   \u0026cross;   |  ViT-L   |         16x4         | 34.3 |\n| VideoMAE | Kinetics-400 |   \u0026check;   |  ViT-L   |         16x4         | 37.0 |\n| VideoMAE | Kinetics-400 |   \u0026cross;   |  ViT-H   |         16x4         | 36.5 |\n| VideoMAE | Kinetics-400 |   \u0026check;   |  ViT-H   |         16x4         | 39.5 |\n| VideoMAE | Kinetics-700 |   \u0026cross;   |  ViT-L   |         16x4         | 36.1 |\n| VideoMAE | Kinetics-700 |   \u0026check;   |  ViT-L   |         16x4         | 39.3 |\n\n### ✨ UCF101 \u0026 HMDB51\n\n|  Method  |  Extra Data  | Backbone | UCF101 | HMDB51 |\n| :------: | :----------: | :------: | :----: | :----: |\n| VideoMAE |   ***no***   |  ViT-B   |  91.3  |  62.6  |\n| VideoMAE | Kinetics-400 |  ViT-B   |  96.1  |  73.3  |\n\n## 🔨 Installation\n\nPlease follow the instructions in [INSTALL.md](INSTALL.md).\n\n## ➡️ Data Preparation\n\nPlease follow the instructions in [DATASET.md](DATASET.md) for data preparation.\n\n## 🔄 Pre-training\n\nThe pre-training instruction is in [PRETRAIN.md](PRETRAIN.md).\n\n## ⤴️ Fine-tuning with pre-trained models\n\nThe fine-tuning instruction is in [FINETUNE.md](FINETUNE.md).\n\n## 📍Model Zoo\n\nWe provide pre-trained and fine-tuned models in [MODEL_ZOO.md](MODEL_ZOO.md).\n\n## 👀 Visualization\n\nWe provide the script for visualization in [`vis.sh`](vis.sh).  Colab notebook for better visualization is coming soon.\n\n## ☎️ Contact \n\nZhan Tong: tongzhan@smail.nju.edu.cn\n\n## 👍 Acknowledgements\n\nThanks to [Ziteng Gao](https://sebgao.github.io/), Lei Chen, [Chongjian Ge](https://chongjiange.github.io/), and [Zhiyu Zhao](https://github.com/JerryFlymi) for their kind support.\u003cbr\u003e\nThis project is built upon [MAE-pytorch](https://github.com/pengzhiliang/MAE-pytorch) and [BEiT](https://github.com/microsoft/unilm/tree/master/beit). Thanks to the contributors of these great codebases.\n\n## 🔒 License\n\nThe majority of this project is released under the CC-BY-NC 4.0 license as found in the [LICENSE](https://github.com/MCG-NJU/VideoMAE/blob/main/LICENSE) file. Portions of the project are available under separate license terms: [SlowFast](https://github.com/facebookresearch/SlowFast) and [pytorch-image-models](https://github.com/rwightman/pytorch-image-models) are licensed under the Apache 2.0 license. [BEiT](https://github.com/microsoft/unilm/tree/master/beit) is licensed under the MIT license.\n\n## ✏️ Citation\n\nIf you think this project is helpful, please feel free to leave a star⭐️ and cite our paper:\n\n```\n@inproceedings{tong2022videomae,\n  title={Video{MAE}: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training},\n  author={Zhan Tong and Yibing Song and Jue Wang and Limin Wang},\n  booktitle={Advances in Neural Information Processing Systems},\n  year={2022}\n}\n\n@article{videomae,\n  title={VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training},\n  author={Tong, Zhan and Song, Yibing and Wang, Jue and Wang, Limin},\n  journal={arXiv preprint arXiv:2203.12602},\n  year={2022}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMCG-NJU%2FVideoMAE","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FMCG-NJU%2FVideoMAE","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMCG-NJU%2FVideoMAE/lists"}