{"id":13721560,"url":"https://github.com/microsoft/CvT","last_synced_at":"2025-05-07T13:33:16.316Z","repository":{"id":38260618,"uuid":"370805731","full_name":"microsoft/CvT","owner":"microsoft","description":"This is an official implementation of CvT: Introducing Convolutions to Vision Transformers.","archived":false,"fork":false,"pushed_at":"2023-05-16T08:07:07.000Z","size":167,"stargazers_count":518,"open_issues_count":22,"forks_count":117,"subscribers_count":9,"default_branch":"main","last_synced_at":"2024-05-10T14:36:27.009Z","etag":null,"topics":["classification","computer-vision","cvt","deep-learning","imagenet"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/microsoft.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null}},"created_at":"2021-05-25T19:26:14.000Z","updated_at":"2024-05-08T08:47:08.000Z","dependencies_parsed_at":"2022-07-14T03:20:35.969Z","dependency_job_id":"fb67b6d6-6155-4cd1-a219-b14a04276bdb","html_url":"https://github.com/microsoft/CvT","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FCvT","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FCvT/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FCvT/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FCvT/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/microsoft","download_url":"https://codeload.github.com/microsoft/CvT/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224605035,"owners_count":17339249,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["classification","computer-vision","cvt","deep-learning","imagenet"],"created_at":"2024-08-03T01:01:18.627Z","updated_at":"2024-11-14T10:31:44.967Z","avatar_url":"https://github.com/microsoft.png","language":"Python","funding_links":[],"categories":["DLA"],"sub_categories":["CvT"],"readme":"# Introduction\nThis is an official implementation of [CvT: Introducing Convolutions to Vision Transformers](https://arxiv.org/abs/2103.15808). We present a new architecture, named Convolutional vision Transformers (CvT), that improves Vision Transformers (ViT) in performance and efficienty by introducing convolutions into ViT to yield the best of both designs. This is accomplished through two primary modifications: a hierarchy of Transformers containing a new convolutional token embedding, and a convolutional Transformer block leveraging a convolutional projection. These changes introduce desirable properties of convolutional neural networks (CNNs) to the ViT architecture (e.g. shift, scale, and distortion invariance) while maintaining the merits of Transformers (e.g. dynamic attention, global context, and better generalization). We validate CvT by conducting extensive experiments, showing that this approach achieves state-of-the-art performance over other Vision Transformers and ResNets on ImageNet-1k, with fewer parameters and lower FLOPs. In addition, performance gains are maintained when pretrained on larger dataset (e.g. ImageNet-22k) and fine-tuned to downstream tasks. Pre-trained on ImageNet-22k, our CvT-W24 obtains a top-1 accuracy of 87.7% on the ImageNet-1k val set. Finally, our results show that the positional encoding, a crucial component in existing Vision Transformers, can be safely removed in our model, simplifying the design for higher resolution vision tasks. \n\n![](figures/pipeline.svg)\n\n# Main results\n## Models pre-trained on ImageNet-1k\n| Model  | Resolution | Param | GFLOPs | Top-1 |\n|--------|------------|-------|--------|-------|\n| CvT-13 | 224x224    | 20M   | 4.5    | 81.6  |\n| CvT-21 | 224x224    | 32M   | 7.1    | 82.5  |\n| CvT-13 | 384x384    | 20M   | 16.3   | 83.0  |\n| CvT-21 | 384x384    | 32M   | 24.9   | 83.3  |\n\n## Models pre-trained on ImageNet-22k\n| Model   | Resolution | Param | GFLOPs | Top-1 |\n|---------|------------|-------|--------|-------|\n| CvT-13  | 384x384    | 20M   | 16.3   | 83.3  |\n| CvT-21  | 384x384    | 32M   | 24.9   | 84.9  |\n| CvT-W24 | 384x384    | 277M  | 193.2  | 87.6  |\n\nYou can download all the models from our [model zoo](https://1drv.ms/u/s!AhIXJn_J-blW9RzF3rMW7SsLHa8h?e=blQ0Al).\n\n\n# Quick start\n## Installation\nAssuming that you have installed PyTorch and TorchVision, if not, please follow the [officiall instruction](https://pytorch.org/) to install them firstly. \nIntall the dependencies using cmd:\n\n``` sh\npython -m pip install -r requirements.txt --user -q\n```\n\nThe code is developed and tested using pytorch 1.7.1. Other versions of pytorch are not fully tested.\n\n## Data preparation\nPlease prepare the data as following:\n\n``` sh\n|-DATASET\n  |-imagenet\n    |-train\n    | |-class1\n    | | |-img1.jpg\n    | | |-img2.jpg\n    | | |-...\n    | |-class2\n    | | |-img3.jpg\n    | | |-...\n    | |-class3\n    | | |-img4.jpg\n    | | |-...\n    | |-...\n    |-val\n      |-class1\n      | |-img5.jpg\n      | |-...\n      |-class2\n      | |-img6.jpg\n      | |-...\n      |-class3\n      | |-img7.jpg\n      | |-...\n      |-...\n```\n\n\n## Run\nEach experiment is defined by a yaml config file, which is saved under the directory of `experiments`. The directory of `experiments` has a tree structure like this:\n\n``` sh\nexperiments\n|-{DATASET_A}\n| |-{ARCH_A}\n| |-{ARCH_B}\n|-{DATASET_B}\n| |-{ARCH_A}\n| |-{ARCH_B}\n|-{DATASET_C}\n| |-{ARCH_A}\n| |-{ARCH_B}\n|-...\n```\n\nWe provide a `run.sh` script for running jobs in local machine.\n\n``` sh\nUsage: run.sh [run_options]\nOptions:\n  -g|--gpus \u003c1\u003e - number of gpus to be used\n  -t|--job-type \u003caml\u003e - job type (train|test)\n  -p|--port \u003c9000\u003e - master port\n  -i|--install-deps - If install dependencies (default: False)\n```\n\n### Training on local machine\n\n``` sh\nbash run.sh -g 8 -t train --cfg experiments/imagenet/cvt/cvt-13-224x224.yaml\n```\n\nYou can also modify the config parameters from the command line. For example, if you want to change the lr rate to 0.1, you can run the command:\n``` sh\nbash run.sh -g 8 -t train --cfg experiments/imagenet/cvt/cvt-13-224x224.yaml TRAIN.LR 0.1\n```\n\nNotes:\n- The checkpoint, model, and log files will be saved in OUTPUT/{dataset}/{training config} by default.\n\n### Testing pre-trained models\n\n``` sh\nbash run.sh -t test --cfg experiments/imagenet/cvt/cvt-13-224x224.yaml TEST.MODEL_FILE ${PRETRAINED_MODLE_FILE}\n```\n\n# Citation\nIf you find this work or code is helpful in your research, please cite:\n\n```\n@article{wu2021cvt,\n  title={Cvt: Introducing convolutions to vision transformers},\n  author={Wu, Haiping and Xiao, Bin and Codella, Noel and Liu, Mengchen and Dai, Xiyang and Yuan, Lu and Zhang, Lei},\n  journal={arXiv preprint arXiv:2103.15808},\n  year={2021}\n}\n```\n## Contributing\n\nThis project welcomes contributions and suggestions.  Most contributions require you to agree to a\nContributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us\nthe rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.\n\nWhen you submit a pull request, a CLA bot will automatically determine whether you need to provide\na CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions\nprovided by the bot. You will only need to do this once across all repos using our CLA.\n\nThis project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).\nFor more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or\ncontact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.\n\n## Trademarks\n\nThis project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft \ntrademarks or logos is subject to and must follow \n[Microsoft's Trademark \u0026 Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).\nUse of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.\nAny use of third-party trademarks or logos are subject to those third-party's policies.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2FCvT","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmicrosoft%2FCvT","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2FCvT/lists"}