{"id":17349567,"url":"https://github.com/yan-hao-tian/ConTNet","last_synced_at":"2025-02-26T02:31:54.410Z","repository":{"id":52878563,"uuid":"354468574","full_name":"yan-hao-tian/ConTNet","owner":"yan-hao-tian","description":"This repo contains the code of \"ConTNet: Why not use convolution and transformer at the same time?\"","archived":false,"fork":false,"pushed_at":"2021-05-25T06:15:53.000Z","size":2169,"stargazers_count":95,"open_issues_count":7,"forks_count":14,"subscribers_count":5,"default_branch":"main","last_synced_at":"2024-10-16T18:17:42.542Z","etag":null,"topics":["convolution","downstream-tasks","imagenet","transformer"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2104.13497","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/yan-hao-tian.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-04-04T06:01:27.000Z","updated_at":"2024-06-08T07:21:26.000Z","dependencies_parsed_at":"2022-08-23T04:31:12.616Z","dependency_job_id":null,"html_url":"https://github.com/yan-hao-tian/ConTNet","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yan-hao-tian%2FConTNet","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yan-hao-tian%2FConTNet/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yan-hao-tian%2FConTNet/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yan-hao-tian%2FConTNet/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/yan-hao-tian","download_url":"https://codeload.github.com/yan-hao-tian/ConTNet/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":240780754,"owners_count":19856416,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["convolution","downstream-tasks","imagenet","transformer"],"created_at":"2024-10-15T16:56:21.825Z","updated_at":"2025-02-26T02:31:54.292Z","avatar_url":"https://github.com/yan-hao-tian.png","language":"Python","funding_links":[],"categories":["Table of Contents"],"sub_categories":["微软Transformer霸榜模型"],"readme":"# ConTNet\n\n## Introduction\n\n\u003c!-- **ConTNet** (**Con**vlution-**T**ranformer Network) is proposed mainly in response to the following two issues: (1) ConvNets lack a large receptive field, limiting the performance of ConvNets on downstream tasks. (2) Transformer-based model is not robust enough and requires special training settings or hundreds of millions of images as the pretrain dataset, thereby limiting their adoption. **ConTNet** combines convolution and transformer alternately, which is very robust and can be optimized like ResNet unlike the recently-proposed transformer-based models (e.g., ViT, DeiT) that are sensitive to hyper-parameters and need many tricks when trained from scratch on a midsize dataset (e.g., ImageNet).\n  --\u003e\n\n**ConTNet** (**Con**vlution-**T**ranformer Network) is a neural network built by stacking convolutional layers and transformers alternately. This architecture is proposed in response to the following two issues: **(1)** The receptive field of convolution is limited by a local window (3x3), which potentially impairs the performance of ConvNets on downstream tasks. **(2)** Transformer-based models suffers from insufficient robustness, as a result, the training course requires multiple training tricks and tons of regularization strategies. In our ConTNet, these drawbacks are alleviated through the combination of convolution and transformer. Two perspectives are offered to understand the motivation. **From the view of ConvNet**, the transformer sub-layer is inserted between any two conv layers to enhance the non-local interactions of ConvNet. **From the view of Transformer**, the presence of convolution layers reintroduces the inductive bias as a cause of under-fitting. Through numerical experiments, we find that ConTNet achieves competitive performance on image recognition and downstream tasks. More notably, ConTNet can be optimized easily even in the same way as ResNet.\n\u003c!-- ![image](https://user-images.githubusercontent.com/81896692/119272384-2b904e00-bc38-11eb-87a5-193275cc8be2.png) --\u003e\n![image](https://github.com/yan-hao-tian/ConTNet/blob/main/arch5.png)\n![image](https://github.com/yan-hao-tian/ConTNet/blob/main/block2.png)\n![image](https://github.com/yan-hao-tian/ConTNet/blob/main/block3.png)\n## Training \u0026 Validation with this Repo\nWe give an example of one machine multi-gpus training.\n```\nCUDA_VISIBLE_DEVICES=0,1,2,3 python3 -m torch.distributed.launch --nproc_per_node=4 --master_port 29501 main.py --arch ConT-M --batch_size 256 --save_path debug_trial_cont_m --save_best True \n```\nTo validate a model, please add the arg ```--eval ```.\n```\nCUDA_VISIBLE_DEVICES=0 python3 -m torch.distributed.launch --nproc_per_node=1 --master_port 29501 main.py --arch ConT-M --batch_size 256 --save_path debug_trial --eval ./debug_trial_cont_m/checkpoint_bestTop1.pth\n```\nTo implement resume training, please add the arg ```--resume```.\n```\nCUDA_VISIBLE_DEVICES=0,1,2,3 python3 -m torch.distributed.launch --nproc_per_node=4 --master_port 29501 main.py --arch ConT-M --batch_size 256 --save_path debug_trial --save_best True --resume ./debug_trial_cont_m/checkpoint_bestTop1.pth\n```\n## Pretrained Weights on ImageNet\nImageNet-pretrained weights are available from [Google Drive][1] or [Baidu Cloud][2](the code is 3k3s).\n\n## Main Results on ImageNet\n\n|  name   |   resolution  |   acc@1   |   #params(M) |   FLOPs(G)   |   model   |\n| ----  |   ----    |   ----    |   ----    |   ----    |   ----    |\n|   Res-18  |   224x224 |  71.5     |   11.7    |   1.8 |       |\n|   ConT-S  |   224x224 |  **74.9** |   10.1    |   1.5 |       |\n|   Res-50  |   224x224 |  77.1     |   25.6    |   4.0 |       |\n|   ConT-M  |   224x224 |  **77.6** |   19.2    |   3.1 |       |\n|   Res-101 |   224x224 |  **78.2** |   44.5    |   7.6 |       |\n|   ConT-B  |   224x224 |   77.9    |   39.6    |   6.4 |       |\n|   DeiT-Ti\u003csup\u003e*\u003c/sup\u003e  |   224x224 |  72.2    |   5.7    |   1.3 |       |\n|   ConT-Ti\u003csup\u003e*\u003c/sup\u003e  |   224x224 |  **74.9**|   5.8    |   0.8 |       |\n|   Res-18\u003csup\u003e*\u003c/sup\u003e  |   224x224 |  73.2     |   11.7    |   1.8 |       |\n|   ConT-S\u003csup\u003e*\u003c/sup\u003e  |   224x224 |  **76.5** |   10.1    |   1.5 |       |\n|   Res-50\u003csup\u003e*\u003c/sup\u003e  |   224x224 |  78.6     |   25.6    |   4.0 |       |\n|   DeiT-S\u003csup\u003e*\u003c/sup\u003e  |   224x224 |  79.8     |   22.1    |   4.6 |       |\n|   ConT-M\u003csup\u003e*\u003c/sup\u003e  |   224x224 |  **80.2** |   19.2    |   3.1 |       |\n|   Res-101\u003csup\u003e*\u003c/sup\u003e |   224x224 |  80.0     |   44.5    |   7.6 |       |\n|   DeiT-B\u003csup\u003e*\u003c/sup\u003e  |   224x224 |  **81.8** |   86.6    |   17.6|       |\n|   ConT-B\u003csup\u003e*\u003c/sup\u003e  |   224x224 |  **81.8** |   39.6    |   6.4 |       |\n\nNote: \u003csup\u003e*\u003c/sup\u003e indicates training with strong augmentations(auto-augmentation and mixup).\n\n## Main Results on Downstream Tasks\n\nObject detection results on COCO.\n\n| method  | backbone  | #params(M)  | FLOPs(G)  | AP    | AP\u003c/sup\u003es\u003csup\u003e  | AP\u003c/sup\u003em\u003csup\u003e  | AP\u003c/sup\u003el\u003csup\u003e  |\n| ----    | ----      | ----        | ----      | ----  | --------        | -----           | -----           |\n|RetinaNet| Res-50 \u003cbr\u003e ConTNet-M|  32.0 \u003cbr\u003e 27.0  | 235.6 \u003cbr\u003e 217.2  | 36.5 \u003cbr\u003e **37.9**  | 20.4 \u003cbr\u003e **23.0** | 40.3 \u003cbr\u003e **40.6** | 48.1 \u003cbr\u003e **50.4** |\n| FCOS    | Res-50 \u003cbr\u003e ConTNet-M|  32.2 \u003cbr\u003e 27.2  | 242.9 \u003cbr\u003e 228.4  | 38.7 \u003cbr\u003e **40.8**  | 22.9 \u003cbr\u003e **25.1** | 42.5 \u003cbr\u003e **44.6** | 50.1 \u003cbr\u003e **53.0** |\n| faster rcnn | Res-50 \u003cbr\u003e ConTNet-M|  41.5 \u003cbr\u003e 36.6  | 241.0 \u003cbr\u003e 225.6  | 37.4 \u003cbr\u003e **40.0**  | 21.2 \u003cbr\u003e **25.4** | 41.0 \u003cbr\u003e **43.0** | 48.1 \u003cbr\u003e **52.0** |\n  \nInstance segmentation results on Cityscapes based on Mask-RCNN.\n| backbone  | AP\u003csup\u003ebb\u003c/sup\u003e | AP\u003csub\u003es\u003c/sub\u003e\u003csup\u003ebb\u003c/sup\u003e | AP\u003csub\u003em\u003c/sub\u003e\u003csup\u003ebb\u003c/sup\u003e | AP\u003csub\u003el\u003c/sub\u003e\u003csup\u003ebb\u003c/sup\u003e | AP\u003csup\u003emk\u003c/sup\u003e | AP\u003csub\u003es\u003c/sub\u003e\u003csup\u003emk\u003c/sup\u003e | AP\u003csub\u003em\u003c/sub\u003e\u003csup\u003emk\u003c/sup\u003e | AP\u003csub\u003el\u003c/sub\u003e\u003csup\u003emk\u003c/sup\u003e |\n| ----      | ----    | ----  | ----  | ----  | ----  | ----  | ----  | ----  |\n| Res-50 \u003cbr\u003e ConT-M  | 38.2 \u003cbr\u003e **40.5**  | 21.9 \u003cbr\u003e **25.1**  | 40.9 \u003cbr\u003e **44.4** | 49.5 \u003cbr\u003e **52.7** | 34.7 \u003cbr\u003e **38.1** | 18.3 \u003cbr\u003e **20.9** | 37.4 \u003cbr\u003e **41.0** | 47.2 \u003cbr\u003e **50.3** |\n\nSemantic segmentation results on cityscapes.\n| model | mIOU  |\n| ----- | ----  |\n|PSP-Res50| 77.12 |\n|PSP-ConTM| **78.28** |\n\n## Bib Citing \n```\n@article{yan2021contnet,\n    title={ConTNet: Why not use convolution and transformer at the same time?},\n    author={Haotian Yan and Zhe Li and Weijian Li and Changhu Wang and Ming Wu and Chuang Zhang},\n    year={2021},\n    journal={arXiv preprint arXiv:2104.13497}\n}\n```\n\n[1]: https://drive.google.com/drive/folders/1ZXu--Bis3LTYLjf2pkmDtZH0TjuWWamO?usp=sharing\n[2]: https://pan.baidu.com/s/1thKK36jTFln1KcAuEkzleg\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyan-hao-tian%2FConTNet","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fyan-hao-tian%2FConTNet","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyan-hao-tian%2FConTNet/lists"}