{"id":13737968,"url":"https://github.com/Alpha-VL/ConvMAE","last_synced_at":"2025-05-08T15:32:13.204Z","repository":{"id":37740853,"uuid":"487770089","full_name":"Alpha-VL/ConvMAE","owner":"Alpha-VL","description":"ConvMAE: Masked Convolution Meets Masked Autoencoders","archived":false,"fork":false,"pushed_at":"2023-03-14T15:17:35.000Z","size":8947,"stargazers_count":483,"open_issues_count":25,"forks_count":41,"subscribers_count":11,"default_branch":"main","last_synced_at":"2024-11-15T06:33:02.501Z","etag":null,"topics":["backbone","computer-vision","mae","masked-image-modeling","object-detection","semantic-segmentation"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Alpha-VL.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2022-05-02T08:27:27.000Z","updated_at":"2024-11-12T09:14:05.000Z","dependencies_parsed_at":"2024-01-07T20:10:54.440Z","dependency_job_id":null,"html_url":"https://github.com/Alpha-VL/ConvMAE","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Alpha-VL%2FConvMAE","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Alpha-VL%2FConvMAE/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Alpha-VL%2FConvMAE/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Alpha-VL%2FConvMAE/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Alpha-VL","download_url":"https://codeload.github.com/Alpha-VL/ConvMAE/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253096296,"owners_count":21853571,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["backbone","computer-vision","mae","masked-image-modeling","object-detection","semantic-segmentation"],"created_at":"2024-08-03T03:02:07.467Z","updated_at":"2025-05-08T15:32:08.973Z","avatar_url":"https://github.com/Alpha-VL.png","language":"Python","funding_links":[],"categories":["Python","Fundamental MIM Methods"],"sub_categories":["MIM for Transformers"],"readme":"\u003cdiv align=\"center\"\u003e\n\u003ch3\u003e[NeurIPS 2022] MCMAE: Masked Convolution Meets Masked Autoencoders\u003c/h3\u003e\n\n[Peng Gao](https://scholar.google.com/citations?user=miFIAFMAAAAJ\u0026hl=en\u0026oi=ao)\u003csup\u003e1\u003c/sup\u003e, [Teli Ma](https://scholar.google.com/citations?user=arny77IAAAAJ\u0026hl=en\u0026oi=ao)\u003csup\u003e1\u003c/sup\u003e, [Hongsheng Li](https://scholar.google.com/citations?user=BN2Ze-QAAAAJ\u0026hl=en\u0026oi=ao)\u003csup\u003e2\u003c/sup\u003e, [Ziyi Lin](https://scholar.google.com/citations?user=-VOnnzUAAAAJ\u0026hl=en)\u003csup\u003e2\u003c/sup\u003e, [Jifeng Dai](https://scholar.google.com/citations?user=SH_-B_AAAAAJ\u0026hl=en\u0026oi=ao)\u003csup\u003e3\u003c/sup\u003e, [Yu Qiao](https://scholar.google.com/citations?user=gFtI-8QAAAAJ\u0026hl=en\u0026oi=ao)\u003csup\u003e1\u003c/sup\u003e,\n\n\u003csup\u003e1\u003c/sup\u003e [Shanghai AI Laboratory](https://www.shlab.org.cn/), \u003csup\u003e2\u003c/sup\u003e [MMLab, CUHK](https://mmlab.ie.cuhk.edu.hk/), \u003csup\u003e3\u003c/sup\u003e [Sensetime Research](https://www.sensetime.com/cn).\n\n\u003c/div\u003e\n\n\\* We change the project name from **ConvMAE** to **MCMAE**.\n\nThis repo is the official implementation of [MCMAE: Masked Convolution Meets Masked Autoencoders](https://arxiv.org/abs/2205.03892). It currently concludes codes and models for the following tasks:\n\u003e **ImageNet Pretrain**: See [PRETRAIN.md](PRETRAIN.md).\\\n\u003e **ImageNet Finetune**: See [FINETUNE.md](FINETUNE.md).\\\n\u003e **Object Detection**: See [DETECTION.md](DET/DETECTION.md).\\\n\u003e **Semantic Segmentation**: See [SEGMENTATION.md](SEG/SEGMENTATION.md). \\\n\u003e **Video Classification**: See [VideoConvMAE](https://github.com/Alpha-VL/VideoConvMAE).\n\n## Updates\n\n***14/Mar/2023***\n\nMR-MCMAE (a.k.a. ConvMAE-v2) paper released: [Mimic before Reconstruct: Enhancing Masked Autoencoders with Feature Mimicking](https://arxiv.org/abs/2303.05475).\n\n***15/Sep/2022***\n\nPaper accepted at NeurIPS 2022.\n\n***9/Sep/2022***\n\nConvMAE-v2 pretrained checkpoints are released.\n\n***21/Aug/2022***\n\n[Official-ConvMAE-Det](https://github.com/OpenGVLab/Official-ConvMAE-Det) which follows official ViTDet codebase is released. \n\n***08/Jun/2022***\n\n🚀FastConvMAE🚀: significantly accelerates the pretraining hours (4000 single GPU hours =\u003e 200 single GPU hours). The code is going to be released at [FastConvMAE](https://github.com/Alpha-VL/FastConvMAE).\n\n***27/May/2022***\n\n1. The supported codes for ImageNet-1K pretraining.\n2. The supported codes and models for semantic segmentation are provided.\n\n***20/May/2022***\n\nUpdate results on video classification.\n\n***16/May/2022***\n\nThe supported codes and models for COCO object detection and instance segmentation are available.\n\n***11/May/2022***\n\n1. Pretrained models on ImageNet-1K for ConvMAE.\n2. The supported codes and models for ImageNet-1K finetuning and linear probing are provided.\n\n***08/May/2022***\n\nThe preprint version is public at [arxiv](https://arxiv.org/abs/2205.03892).\n\n## Introduction\nConvMAE framework demonstrates that multi-scale hybrid convolution-transformer can learn more discriminative representations via the mask auto-encoding scheme. \n* We present the strong and efficient self-supervised framework ConvMAE, which is easy to implement but show outstanding performances on downstream tasks.\n* ConvMAE naturally generates hierarchical representations and exhibit promising performances on object detection and segmentation.\n* ConvMAE-Base improves the ImageNet finetuning accuracy by 1.4% compared with MAE-Base.\nOn object detection with Mask-RCNN, ConvMAE-Base achieves 53.2 box AP and 47.1 mask AP with a 25-epoch training schedule while MAE-Base attains 50.3 box AP and 44.9 mask AP with 100 training epochs. On ADE20K with UperNet, ConvMAE-Base surpasses MAE-Base by 3.6 mIoU (48.1 vs. 51.7).\n\n\n![tenser](figures/ConvMAE.png)\n\n## Pretrain on ImageNet-1K\nThe following table provides pretrained checkpoints and logs used in the paper.\n| | ConvMAE-Base|\n| :---: | :---: |\n| pretrained checkpoints| [download](https://drive.google.com/file/d/1AEPivXw0A0b_m5EwEi6fg2pOAoDr8C31/view?usp=sharing) |\n| logs | [download](https://drive.google.com/file/d/1Je9ClIGCQP43xC3YURVFPnaMRC0-ax1h/view?usp=sharing) |\n\nThe following results are for ConvMAE-v2 (pretrained for 200 epochs on ImageNet-1k).\n| model | pretrained checkpoints | ft. acc. on ImageNet-1k |\n| :---: | :---: | :---: |\n| ConvMAE-v2-Small | [download](https://drive.google.com/file/d/1LqU-0tajhxYMSTN6WVFwiIveFjETVvKb/view?usp=sharing) | 83.6 |\n| ConvMAE-v2-Base  | [download](https://drive.google.com/file/d/1gykVKNDlRn8eiuXk5bUj1PbSnHXFzLnI/view?usp=sharing) | 85.7 |\n| ConvMAE-v2-Large | [download](https://drive.google.com/file/d/1RN3ZseDseWGwuUwrVTkel17_iYFvZL6m/view?usp=sharing) | 86.8 |\n| ConvMAE-v2-Huge  | [download](https://drive.google.com/file/d/1k1OBhNTLzRI9c6ReSgK7_7vqGZr-2Cpd/view?usp=sharing) | 88.0 |\n\n## Main Results on ImageNet-1K\n| Models | #Params(M) | Supervision | Encoder Ratio | Pretrain Epochs | FT acc@1(%) | LIN acc@1(%) | FT logs/weights | LIN logs/weights |\n| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |\n| BEiT | 88 | DALLE | 100% | 300 | 83.0 | 37.6 | - | - |\n| MAE | 88 | RGB | 25% | 1600 | 83.6 | 67.8 | - | - |\n| SimMIM | 88 | RGB | 100% | 800 | 84.0 | 56.7 | - | - |\n| MaskFeat | 88 | HOG | 100% | 300 | 83.6 | N/A | - | - |\n| data2vec | 88 | RGB | 100% | 800 | 84.2 | N/A | - | - |\n| ConvMAE-B | 88 | RGB | 25% | 1600 | 85.0 | 70.9 | [log](https://drive.google.com/file/d/1nzAOD5UR3b9QqwD2vMMz0Bx3170sypuy/view?usp=sharing)/[weight](https://drive.google.com/file/d/19F6vQUlITpzNLvXLKi5NRxRLOmKRxqFi/view?usp=sharing) |\n\n\n\n## Main Results on COCO\n### Mask R-CNN\n| Models | Pretrain | Pretrain Epochs | Finetune Epochs | #Params(M)| FLOPs(T) | box AP | mask AP | logs/weights |\n| :---: | :---: | :---: |:---: | :---: | :---: | :---: | :---: | :---: |\n| Swin-B | IN21K w/ labels | 90 | 36 | 109 | 0.7 | 51.4 | 45.4 | - | \n| Swin-L | IN21K w/ labels | 90 | 36 | 218 | 1.1 | 52.4 | 46.2 | - | \n| MViTv2-B | IN21K w/ labels | 90 | 36 | 73 | 0.6 | 53.1 | 47.4 | - | \n| MViTv2-L | IN21K w/ labels | 90 | 36 | 239 | 1.3 | 53.6 | 47.5 | - | \n| Benchmarking-ViT-B | IN1K w/o labels | 1600 | 100 | 118 | 0.9 | 50.4 | 44.9 | - |\n| Benchmarking-ViT-L | IN1K w/o labels | 1600 | 100 | 340 | 1.9 | 53.3 | 47.2 | - |\n| ViTDet | IN1K w/o labels | 1600 | 100 | 111 | 0.8 | 51.2 | 45.5 | - |\n| MIMDet-ViT-B | IN1K w/o labels | 1600 | 36 | 127 | 1.1 | 51.5 | 46.0 | - |\n| MIMDet-ViT-L | IN1K w/o labels | 1600 | 36 | 345 | 2.6 | 53.3 | 47.5 | - |\n| ConvMAE-B | IN1K w/o lables | 1600 | 25 | 104 | 0.9 | 53.2 | 47.1 | [log](https://drive.google.com/file/d/1vQ9ps-TxeS_8BRfSWZh-X-5Kki7mgIgR/view?usp=sharing)/[weight](https://drive.google.com/file/d/17gy2mlrRVpIlQN9ERSHh98VkHhWINn-m/view?usp=sharing) |\n\n\n\n## Main Results on ADE20K\n### UperNet\n| Models | Pretrain | Pretrain Epochs| Finetune Iters | #Params(M)| FLOPs(T) | mIoU | logs/weights |\n| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |\n| DeiT-B | IN1K w/ labels | 300 | 16K | 163 | 0.6 | 45.6 | - |\n| Swin-B | IN1K w/ labels | 300 | 16K | 121 | 0.3 | 48.1 | - |\n| MoCo V3 | IN1K | 300 | 16K | 163 | 0.6 | 47.3 | -  |\n| DINO | IN1K | 400 | 16K | 163 | 0.6 | 47.2 | -  |\n| BEiT | IN1K+DALLE | 1600 | 16K | 163 | 0.6 | 47.1 | -  |\n| PeCo | IN1K | 300 | 16K | 163 | 0.6 | 46.7 | -  |\n| CAE | IN1K+DALLE | 800 | 16K | 163 | 0.6 | 48.8 | -  |\n| MAE | IN1K | 1600 | 16K | 163 | 0.6 | 48.1 | -  |\n| ConvMAE-B | IN1K | 1600 | 16K | 153 | 0.6 | 51.7 | [log](https://drive.google.com/file/d/1N3LEhEd2FLx8777Kn5tVn5gxYiBTz00A/view?usp=sharing)/[weight](https://drive.google.com/file/d/1aQR_CmZBzN2eHWYgzPUDm4ulme-g9cIR/view?usp=sharing)  |\n\n## Main Results on Kinetics-400\n\n|         Models          | Pretrain Epochs |    Finetune Epochs    | #Params(M) | Top1 | Top5 | logs/weights |\n| :---------------------: | :-------------: | :-------------------: | :--------: | :--: | :--: | :----------: |\n|       VideoMAE-B        |       200       |          100          |     87     | 77.8 |      |              |\n|       VideoMAE-B        |       800       |          100          |     87     | 79.4 |      |              |\n|       VideoMAE-B        |      1600       |          100          |     87     | 79.8 |      |              |\n|       VideoMAE-B        |      1600       | 100 (w/ Repeated Aug) |     87     | 80.7 | 94.7 |              |\n| SpatioTemporalLearner-B |       800       | 150 (w/ Repeated Aug) |     87     | 81.3 | 94.9 |              |\n|     VideoConvMAE-B      |       200       |          100          |     86     | 80.1 | 94.3 |     Soon     |\n|     VideoConvMAE-B      |       800       |          100          |     86     | 81.7 | 95.1 |     Soon     |\n|   VideoConvMAE-B-MSD    |       800       |          100          |     86     | 82.7 | 95.5 |     Soon     |\n\n## Main Results on Something-Something V2\n\n|       Models       | Pretrain Epochs | Finetune Epochs | #Params(M) | Top1 | Top5 | logs/weights |\n| :----------------: | :-------------: | :-------------: | :--------: | :--: | :--: | :----------: |\n|     VideoMAE-B     |       200       |       40        |     87     | 66.1 |      |              |\n|     VideoMAE-B     |       800       |       40        |     87     | 69.3 |      |              |\n|     VideoMAE-B     |       2400      |       40        |     87     | 70.3 |      |              |\n|   VideoConvMAE-B   |       200       |       40        |     86     | 67.7 | 91.2 |     Soon     |\n|   VideoConvMAE-B   |       800       |       40        |     86     | 69.9 | 92.4 |     Soon     |\n| VideoConvMAE-B-MSD |       800       |       40        |     86     | 70.7 | 93.0 |     Soon     |\n\n\n## Getting Started\n### Prerequisites\n* Linux\n* Python 3.7+\n* CUDA 10.2+\n* GCC 5+\n\n### Training and evaluation\n* See [PRETRAIN.md](PRETRAIN.md) for pretraining.\n* See [FINETUNE.md](FINETUNE.md) for pretrained model finetuning and linear probing. \n* See [DETECTION.md](DET/DETECTION.md) for using pretrained backbone on [Mask RCNN](https://openaccess.thecvf.com/content_iccv_2017/html/He_Mask_R-CNN_ICCV_2017_paper.html).\n* See [SEGMENTATION.md](SEG/SEGMENTATION.md) for using pretrained backbone on [UperNet](https://openaccess.thecvf.com/content_ECCV_2018/html/Tete_Xiao_Unified_Perceptual_Parsing_ECCV_2018_paper.html).\n* See [VideoConvMAE](https://github.com/Alpha-VL/VideoConvMAE) for video classification.\n\n## Visualization\n![tenser](figures/feat_map.JPG)\n\n## Acknowledgement\nThe pretraining and finetuning of our project are based on [DeiT](https://github.com/facebookresearch/deit) and [MAE](https://github.com/facebookresearch/mae). The object detection and semantic segmentation parts are based on [MIMDet](https://github.com/hustvl/MIMDet) and [MMSegmentation](https://github.com/open-mmlab/mmsegmentation) respectively. Thanks for their wonderful work.\n\n## License\nConvMAE is released under the [MIT License](https://github.com/Alpha-VL/ConvMAE/blob/main/LICENSE).\n\n## Citation\n\n```bash\n@article{gao2022convmae,\n  title={ConvMAE: Masked Convolution Meets Masked Autoencoders},\n  author={Gao, Peng and Ma, Teli and Li, Hongsheng and Dai, Jifeng and Qiao, Yu},\n  journal={arXiv preprint arXiv:2205.03892},\n  year={2022}\n}\n```\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FAlpha-VL%2FConvMAE","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FAlpha-VL%2FConvMAE","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FAlpha-VL%2FConvMAE/lists"}