{"id":13445230,"url":"https://github.com/sail-sg/poolformer","last_synced_at":"2025-05-16T00:09:16.139Z","repository":{"id":37390081,"uuid":"430541106","full_name":"sail-sg/poolformer","owner":"sail-sg","description":"PoolFormer: MetaFormer Is Actually What You Need for Vision (CVPR 2022 Oral)","archived":false,"fork":false,"pushed_at":"2024-06-01T15:19:56.000Z","size":473,"stargazers_count":1324,"open_issues_count":14,"forks_count":116,"subscribers_count":22,"default_branch":"main","last_synced_at":"2025-04-08T14:02:18.534Z","etag":null,"topics":["image-classification","mlp","pooling","pytorch","transformer"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2111.11418","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sail-sg.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-11-22T02:47:03.000Z","updated_at":"2025-04-02T14:26:05.000Z","dependencies_parsed_at":"2024-12-07T23:04:04.586Z","dependency_job_id":"dd1ba6e7-4f83-4148-8c70-35365ffc8397","html_url":"https://github.com/sail-sg/poolformer","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sail-sg%2Fpoolformer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sail-sg%2Fpoolformer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sail-sg%2Fpoolformer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sail-sg%2Fpoolformer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sail-sg","download_url":"https://codeload.github.com/sail-sg/poolformer/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254442856,"owners_count":22071878,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["image-classification","mlp","pooling","pytorch","transformer"],"created_at":"2024-07-31T05:00:27.186Z","updated_at":"2025-05-16T00:09:11.104Z","avatar_url":"https://github.com/sail-sg.png","language":"Python","funding_links":[],"categories":["3. Perception","Python","Jupyter Notebook","其他_机器视觉"],"sub_categories":["3.1.1 Vision based","网络服务_其他"],"readme":"# PoolFormer: [MetaFormer Is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418) (CVPR 2022 Oral)\n\n\u003cp align=\"center\"\u003e\n\u003ca href=\"https://arxiv.org/abs/2111.11418\" alt=\"arXiv\"\u003e\n    \u003cimg src=\"https://img.shields.io/badge/arXiv-2111.11418-b31b1b.svg?style=flat\" /\u003e\u003c/a\u003e\n\u003ca href=\"https://huggingface.co/spaces/akhaliq/poolformer\" alt=\"Hugging Face Spaces\"\u003e\n    \u003cimg src=\"https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue\" /\u003e\u003c/a\u003e\n\u003ca href=\"https://colab.research.google.com/drive/1n1UK4ihfiySTWTDuusAhm_6CLm1h4bTj?usp=sharing\" alt=\"Colab\"\u003e\n    \u003cimg src=\"https://colab.research.google.com/assets/colab-badge.svg\" /\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\n---\n:fire: :fire: Our follow-up work \"[MetaFormer Baselines for Vision](https://arxiv.org/abs/2210.13452)\" (code: [metaformer](https://github.com/sail-sg/metaformer)) introduces more MetaFormer baselines including\n+ **IdentityFormer** with token mixer of identity mapping surprisingly achieve \u003e80% accuracy.\n+ **RandFormer** achieves \u003e81% accuracy by random token mixing, demonstrating MetaForemr works well with arbitrary token mixers.\n+ **ConvFormer** with token mixer of separable convolution significantly outperforms ConvNeXt by large margin.\n+ **CAFormer** with token mixers of separable convolutions and vanilla self-attention sets new record on ImageNet-1K.\n\n---\n\n\nThis is a PyTorch implementation of **PoolFormer** proposed by our paper \"[MetaFormer Is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418)\" (CVPR 2022 Oral).\n\n\n**Note**: Instead of designing complicated token mixer to achieve SOTA performance, the target of this work is to demonstrate the competence of Transformer models largely stem from the general architecture MetaFormer. Pooling/PoolFormer are just the tools to support our claim. \n\n![MetaFormer](https://user-images.githubusercontent.com/49296856/177275244-13412754-3d49-43ef-a8bd-17c0874c02c1.png)\nFigure 1: **MetaFormer and performance of MetaFormer-based models on ImageNet-1K validation set.** \nWe argue that the competence of Transformer/MLP-like models primarily stem from the general architecture MetaFormer instead of the equipped specific token mixers.\nTo demonstrate this, we exploit an embarrassingly simple non-parametric operator, pooling, to conduct extremely basic token mixing. \nSurprisingly, the resulted model PoolFormer consistently outperforms the DeiT and ResMLP as shown in (b), which well supports that MetaFormer is actually what we need to achieve competitive performance. RSB-ResNet in (b) means the results are from “ResNet Strikes Back” where ResNet is trained with improved training procedure for 300 epochs.\n\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://user-images.githubusercontent.com/49296856/205430159-54bba545-520e-4ab8-8a77-278d90b54ec4.png\" alt=\"PoolFormer\"/\u003e\n\u003c/p\u003e\n\nFigure 2: (a) **The overall framework of PoolFormer.** (b) **The architecture of PoolFormer block.** Compared with Transformer block, it replaces attention with an extremely simple non-parametric operator, pooling, to conduct only basic token mixing.\n\n## Bibtex\n```\n@inproceedings{yu2022metaformer,\n  title={Metaformer is actually what you need for vision},\n  author={Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng},\n  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},\n  pages={10819--10829},\n  year={2022}\n}\n```\n\n**Detection and instance segmentation on COCO** configs and trained models are [here](detection/).\n\n**Semantic segmentation on ADE20K** configs and trained models are [here](segmentation/).\n\nThe code to visualize Grad-CAM activation maps of PoolFomer, DeiT, ResMLP, ResNet and Swin are [here](misc/cam_image.py).\n\nThe code to measure MACs are [here](misc/mac_count_with_fvcore.py).\n\n## Image Classification\n### 1. Requirements\n\ntorch\u003e=1.7.0; torchvision\u003e=0.8.0; pyyaml; [apex-amp](https://github.com/NVIDIA/apex) (if you want to use fp16); [timm](https://github.com/rwightman/pytorch-image-models) (`pip install git+https://github.com/rwightman/pytorch-image-models.git@9d6aad44f8fd32e89e5cca503efe3ada5071cc2a`)\n\ndata prepare: ImageNet with the following folder structure, you can extract ImageNet by this [script](https://gist.github.com/BIGBALLON/8a71d225eff18d88e469e6ea9b39cef4).\n\n```\n│imagenet/\n├──train/\n│  ├── n01440764\n│  │   ├── n01440764_10026.JPEG\n│  │   ├── n01440764_10027.JPEG\n│  │   ├── ......\n│  ├── ......\n├──val/\n│  ├── n01440764\n│  │   ├── ILSVRC2012_val_00000293.JPEG\n│  │   ├── ILSVRC2012_val_00002138.JPEG\n│  │   ├── ......\n│  ├── ......\n```\n\n\n\n### 2. PoolFormer Models\n\n| Model    |  #Params | Image resolution | #MACs* | Top1 Acc| Download | \n| :---     |   :---:    |  :---: |  :---: |  :---:  |  :---:  |\n| poolformer_s12  |    12M     |   224  |  1.8G |  77.2  | [here](https://github.com/sail-sg/poolformer/releases/download/v1.0/poolformer_s12.pth.tar) |\n| poolformer_s24 |   21M     |   224 | 3.4G | 80.3  | [here](https://github.com/sail-sg/poolformer/releases/download/v1.0/poolformer_s24.pth.tar) |\n| poolformer_s36  |   31M     |   224 | 5.0G | 81.4  | [here](https://github.com/sail-sg/poolformer/releases/download/v1.0/poolformer_s36.pth.tar) |\n| poolformer_m36 |   56M     |   224 | 8.8G | 82.1  | [here](https://github.com/sail-sg/poolformer/releases/download/v1.0/poolformer_m36.pth.tar) |\n| poolformer_m48  |   73M     |   224 | 11.6G | 82.5  | [here](https://github.com/sail-sg/poolformer/releases/download/v1.0/poolformer_m48.pth.tar) | \n\n\nAll the pretrained models can also be downloaded by [BaiDu Yun](https://pan.baidu.com/s/1HSaJtxgCkUlawurQLq87wQ) (password: esac). * For convenient comparison with future models, we update the numbers of MACs counted by [fvcore](https://github.com/facebookresearch/fvcore) library ([example code](misc/mac_count_with_fvcore.py)) which are also reported in the [new arXiv version](https://arxiv.org/abs/2111.11418).\n\n\n#### Web Demo\n\nIntegrated into [Huggingface Spaces 🤗](https://huggingface.co/spaces) using [Gradio](https://github.com/gradio-app/gradio). Try out the Web Demo: [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/akhaliq/poolformer)\n\n\n\n#### Usage\nWe also provide a Colab notebook which run the steps to perform inference with poolformer: [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1n1UK4ihfiySTWTDuusAhm_6CLm1h4bTj?usp=sharing)\n\n\n### 3. Validation\n\nTo evaluate our PoolFormer models, run:\n\n```bash\nMODEL=poolformer_s12 # poolformer_{s12, s24, s36, m36, m48}\npython3 validate.py /path/to/imagenet  --model $MODEL -b 128 \\\n  --pretrained # or --checkpoint /path/to/checkpoint \n```\n\n\n\n### 4. Train\nWe show how to train PoolFormers on 8 GPUs. The relation between learning rate and batch size is lr=bs/1024*1e-3.\nFor convenience, assuming the batch size is 1024, then the learning rate is set as 1e-3 (for batch size of 1024, setting the learning rate as 2e-3 sometimes sees better performance). \n\n\n```bash\nMODEL=poolformer_s12 # poolformer_{s12, s24, s36, m36, m48}\nDROP_PATH=0.1 # drop path rates [0.1, 0.1, 0.2, 0.3, 0.4] responding to model [s12, s24, s36, m36, m48]\nCUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./distributed_train.sh 8 /path/to/imagenet \\\n  --model $MODEL -b 128 --lr 1e-3 --drop-path $DROP_PATH --apex-amp\n```\n\n### 5. Visualization\n![gradcam](https://user-images.githubusercontent.com/15921929/201674709-024a5356-42f2-433d-89e7-801c23646211.png)\n\nThe code to visualize Grad-CAM activation maps of PoolFomer, DeiT, ResMLP, ResNet and Swin are [here](misc/cam_image.py).\n\n\n## Acknowledgment\nOur implementation is mainly based on the following codebases. We gratefully thank the authors for their wonderful works.\n\n[pytorch-image-models](https://github.com/rwightman/pytorch-image-models), [mmdetection](https://github.com/open-mmlab/mmdetection), [mmsegmentation](https://github.com/open-mmlab/mmsegmentation).\n\n\nBesides, Weihao Yu would like to thank TPU Research Cloud (TRC) program for the support of partial computational resources.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsail-sg%2Fpoolformer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsail-sg%2Fpoolformer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsail-sg%2Fpoolformer/lists"}