{"id":13429502,"url":"https://github.com/OpenGVLab/VideoMAEv2","last_synced_at":"2025-03-16T03:31:45.946Z","repository":{"id":153560819,"uuid":"623340346","full_name":"OpenGVLab/VideoMAEv2","owner":"OpenGVLab","description":"[CVPR 2023] VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking","archived":false,"fork":false,"pushed_at":"2024-10-08T05:04:52.000Z","size":957,"stargazers_count":510,"open_issues_count":12,"forks_count":58,"subscribers_count":5,"default_branch":"master","last_synced_at":"2024-10-27T07:32:20.917Z","etag":null,"topics":["action-detection","action-recognition","cvpr2023","foundation-model","self-supervised-learning","temporal-action-detection","video-understanding"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2303.16727","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/OpenGVLab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2023-04-04T07:15:23.000Z","updated_at":"2024-10-26T22:44:38.000Z","dependencies_parsed_at":"2024-01-21T03:56:52.995Z","dependency_job_id":null,"html_url":"https://github.com/OpenGVLab/VideoMAEv2","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGVLab%2FVideoMAEv2","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGVLab%2FVideoMAEv2/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGVLab%2FVideoMAEv2/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGVLab%2FVideoMAEv2/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/OpenGVLab","download_url":"https://codeload.github.com/OpenGVLab/VideoMAEv2/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243822310,"owners_count":20353496,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["action-detection","action-recognition","cvpr2023","foundation-model","self-supervised-learning","temporal-action-detection","video-understanding"],"created_at":"2024-07-31T02:00:40.817Z","updated_at":"2025-03-16T03:31:45.940Z","avatar_url":"https://github.com/OpenGVLab.png","language":"Python","readme":"# [CVPR 2023] Official Implementation of VideoMAE V2\n\n![flowchart](misc/VideoMAEv2_flowchart.png)\n\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/videomae-v2-scaling-video-masked-autoencoders/spatio-temporal-action-localization-on-ava)](https://paperswithcode.com/sota/spatio-temporal-action-localization-on-ava?p=videomae-v2-scaling-video-masked-autoencoders)\u003cbr\u003e\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/videomae-v2-scaling-video-masked-autoencoders/action-recognition-on-ava-v2-2)](https://paperswithcode.com/sota/action-recognition-on-ava-v2-2?p=videomae-v2-scaling-video-masked-autoencoders)\u003cbr\u003e\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/videomae-v2-scaling-video-masked-autoencoders/temporal-action-localization-on-fineaction)](https://paperswithcode.com/sota/temporal-action-localization-on-fineaction?p=videomae-v2-scaling-video-masked-autoencoders)\u003cbr\u003e\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/videomae-v2-scaling-video-masked-autoencoders/action-recognition-in-videos-on-hmdb-51)](https://paperswithcode.com/sota/action-recognition-in-videos-on-hmdb-51?p=videomae-v2-scaling-video-masked-autoencoders)\u003cbr\u003e\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/videomae-v2-scaling-video-masked-autoencoders/temporal-action-localization-on-thumos14)](https://paperswithcode.com/sota/temporal-action-localization-on-thumos14?p=videomae-v2-scaling-video-masked-autoencoders)\u003cbr\u003e\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/videomae-v2-scaling-video-masked-autoencoders/action-recognition-in-videos-on-ucf101)](https://paperswithcode.com/sota/action-recognition-in-videos-on-ucf101?p=videomae-v2-scaling-video-masked-autoencoders)\u003cbr\u003e\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/videomae-v2-scaling-video-masked-autoencoders/action-recognition-in-videos-on-something-1)](https://paperswithcode.com/sota/action-recognition-in-videos-on-something-1?p=videomae-v2-scaling-video-masked-autoencoders)\u003cbr\u003e\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/videomae-v2-scaling-video-masked-autoencoders/action-recognition-in-videos-on-something)](https://paperswithcode.com/sota/action-recognition-in-videos-on-something?p=videomae-v2-scaling-video-masked-autoencoders)\u003cbr\u003e\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/videomae-v2-scaling-video-masked-autoencoders/action-classification-on-kinetics-400)](https://paperswithcode.com/sota/action-classification-on-kinetics-400?p=videomae-v2-scaling-video-masked-autoencoders)\u003cbr\u003e\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/videomae-v2-scaling-video-masked-autoencoders/action-classification-on-kinetics-600)](https://paperswithcode.com/sota/action-classification-on-kinetics-600?p=videomae-v2-scaling-video-masked-autoencoders)\u003cbr\u003e\n\n\u003e [**VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking**](https://arxiv.org/abs/2303.16727)\u003cbr\u003e\n\u003e [Limin Wang](http://wanglimin.github.io/), [Bingkun Huang](https://github.com/congee524), [Zhiyu Zhao](https://github.com/JerryFlymi), [Zhan Tong](https://scholar.google.com/citations?user=6FsgWBMAAAAJ), [Yinan He](https://dblp.org/pid/93/7763.html), [Yi Wang](https://scholar.google.com.hk/citations?hl=zh-CN\u0026user=Xm2M8UwAAAAJ), [Yali Wang](https://scholar.google.com/citations?user=hD948dkAAAAJ), and [Yu Qiao](https://scholar.google.com/citations?user=gFtI-8QAAAAJ\u0026hl)\u003cbr\u003e\n\u003e Nanjing University, Shanghai AI Lab, CAS\u003cbr\u003e\n\n## News\n**[2024.09.19]** Checkpoints have been migrated to Hugging Face. You can obtain weights from [VideoMAEv2-hf](https://huggingface.co/OpenGVLab/VideoMAE2/tree/main).\u003cbr\u003e\n**[2023.05.29]** VideoMAE V2-g features for THUMOS14 and FineAction datasets are available at [TAD.md](docs/TAD.md) now.\u003cbr\u003e\n**[2023.05.11]** We have supported testing of our distilled models at MMAction2 (dev version)! See [PR#2460](https://github.com/open-mmlab/mmaction2/pull/2460).\u003cbr\u003e\n**[2023.05.11]** The feature extraction script for TAD datasets has been released! See instructions at [TAD.md](docs/TAD.md).\u003cbr\u003e\n**[2023.04.19]** ViT-giant model weights have been released! You can get the download links from [MODEL_ZOO.md](docs/MODEL_ZOO.md).\u003cbr\u003e\n**[2023.04.18]** Code and the distilled models (vit-s \u0026 vit-b) have been released!\u003cbr\u003e\n**[2023.04.03]** ~~Code and models will be released soon.~~\u003cbr\u003e\n\n\n## Model Zoo\n\nWe now provide the model weights in [MODEL_ZOO.md](docs/MODEL_ZOO.md). We have additionally provided distilled models in MODEL_ZOO.\n\n|  Model  | Dataset | Teacher Model | \\#Frame | K710 Top-1 | K400 Top-1 | K600 Top-1 |\n| :-----: | :-----: | :-----------: | :-----: | :--------: | :--------: | :--------: |\n| ViT-small | K710 | vit_g_hybrid_pt_1200e_k710_ft | 16x5x3 | 77.6 | 83.7 | 83.1 |\n| ViT-base | K710 | vit_g_hybrid_pt_1200e_k710_ft | 16x5x3 | 81.5 | 86.6 | 85.9 |\n\n## Installation\n\nPlease follow the instructions in [INSTALL.md](docs/INSTALL.md).\n\n## Data Preparation\n\nPlease follow the instructions in [DATASET.md](docs/DATASET.md) for data preparation.\n\n## Pre-training\n\nThe pre-training instruction is in [PRETRAIN.md](docs/PRETRAIN.md).\n\n## Fine-tuning\n\nThe fine-tuning instruction is in [FINETUNE.md](docs/FINETUNE.md).\n\n## Citation\n\nIf you find this repository useful, please use the following BibTeX entry for citation.\n\n```latex\n@InProceedings{wang2023videomaev2,\n    author    = {Wang, Limin and Huang, Bingkun and Zhao, Zhiyu and Tong, Zhan and He, Yinan and Wang, Yi and Wang, Yali and Qiao, Yu},\n    title     = {VideoMAE V2: Scaling Video Masked Autoencoders With Dual Masking},\n    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},\n    month     = {June},\n    year      = {2023},\n    pages     = {14549-14560}\n}\n\n@misc{videomaev2,\n      title={VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking},\n      author={Limin Wang and Bingkun Huang and Zhiyu Zhao and Zhan Tong and Yinan He and Yi Wang and Yali Wang and Yu Qiao},\n      year={2023},\n      eprint={2303.16727},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV}\n}\n```\n","funding_links":[],"categories":["2 Foundation Models"],"sub_categories":["2.2 Vision Foundation Models"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FOpenGVLab%2FVideoMAEv2","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FOpenGVLab%2FVideoMAEv2","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FOpenGVLab%2FVideoMAEv2/lists"}