{"id":15293059,"url":"https://github.com/emalagoli92/cvt-tensorflow","last_synced_at":"2025-04-13T12:27:35.953Z","repository":{"id":61611515,"uuid":"552403870","full_name":"EMalagoli92/CvT-TensorFlow","owner":"EMalagoli92","description":"TensorFlow 2.X reimplementation of CvT: Introducing Convolutions to Vision Transformers, Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, Lei Zhang.","archived":false,"fork":false,"pushed_at":"2023-01-26T17:21:09.000Z","size":199,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-21T19:44:21.605Z","etag":null,"topics":["computer-vision","deep-learning","image-classification","python","pytorch","tensorflow","transformers"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/EMalagoli92.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-10-16T14:16:33.000Z","updated_at":"2024-05-05T08:07:04.000Z","dependencies_parsed_at":"2023-02-14T20:01:10.010Z","dependency_job_id":null,"html_url":"https://github.com/EMalagoli92/CvT-TensorFlow","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EMalagoli92%2FCvT-TensorFlow","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EMalagoli92%2FCvT-TensorFlow/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EMalagoli92%2FCvT-TensorFlow/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EMalagoli92%2FCvT-TensorFlow/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/EMalagoli92","download_url":"https://codeload.github.com/EMalagoli92/CvT-TensorFlow/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":240304560,"owners_count":19780312,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["computer-vision","deep-learning","image-classification","python","pytorch","tensorflow","transformers"],"created_at":"2024-09-30T16:38:47.232Z","updated_at":"2025-02-23T10:32:16.418Z","avatar_url":"https://github.com/EMalagoli92.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n  \u003ca href=\"https://www.tensorflow.org\"\u003e![TensorFLow](https://img.shields.io/badge/TensorFlow-2.X-orange?style=for-the-badge) \n  \u003ca href=\"https://github.com/EMalagoli92/CvT-TensorFlow/blob/main/LICENSE\"\u003e![License](https://img.shields.io/github/license/EMalagoli92/CvT-TensorFlow?style=for-the-badge) \n  \u003ca href=\"https://www.python.org\"\u003e![Python](https://img.shields.io/badge/python-%3E%3D%203.9-blue?style=for-the-badge)\u003c/a\u003e  \n  \n\u003c/div\u003e\n\n# CvT-TensorFlow\nTensorFlow 2.X reimplementation of [CvT: Introducing Convolutions to Vision Transformers](https://arxiv.org/abs/2103.15808), Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, Lei Zhang.\n- Exact TensorFlow reimplementation of official PyTorch repo, including `timm` modules used by authors, preserving models and layers structure.\n- ImageNet pretrained weights ported from PyTorch official implementation.\n\n## Table of contents\n- [Abstract](#abstract)\n- [Results](#results)\n- [Installation](#installation)\n- [Usage](#usage)\n- [Acknowledgement](#acknowledgement)\n- [Citations](#citations)\n- [License](#license)\n\n\u003cdiv id=\"abstract\"/\u003e\n\n## Abstract\nConvolutional vision Transformers (CvT), improves Vision Transformers (ViT) in \nperformance and efficienty by introducing convolutions into ViT to yield the \nbest of both designs. This is accomplished through two primary modifications: \na hierarchy of Transformers containing a new convolutional token embedding, \nand a convolutional Transformer block leveraging a convolutional projection. \nThese changes introduce desirable properties of convolutional neural networks \n(CNNs) to the ViT architecture (e.g. shift, scale, and distortion invariance) \nwhile maintaining the merits of Transformers (e.g. dynamic attention, \nglobal context, and better generalization). \nMoreover the achieved results show that the positional encoding, \na crucial component in existing Vision Transformers, can be safely removed \nin the model, simplifying the design for higher resolution vision tasks.\n\n\n![Alt text](https://raw.githubusercontent.com/EMalagoli92/CvT-TensorFlow/266afd1057827d10f0dfb842f8ef73f5b19e471d/assets/images/pipeline.svg)\n\u003cp align = \"center\"\u003e\u003csub\u003eThe pipeline of the CvT architecture. (a) Overall architecture, showing the hierarchical multi-stage\nstructure facilitated by the Convolutional Token Embedding layer. (b) Details of the Convolutional Transformer Block,\nwhich contains the convolution projection as the first layer.\u003c/sub\u003e\u003c/p\u003e\n\n\u003cdiv id=\"results\"/\u003e\n\n## Results\nTensorFlow implementation and ImageNet ported weights have been compared to the official PyTorch implementation on [ImageNet-V2](https://www.tensorflow.org/datasets/catalog/imagenet_v2) test set.\n\n### Models pre-trained on ImageNet-1K\n| Configuration  | Resolution | Top-1 (Original) | Top-1 (Ported) | Top-5 (Original) | Top-5 (Ported) | #Params\n| ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- |\n| CvT-13 | 224x224 | 69.81 | 69.81 | 89.13 | 89.13 | 20M |\n| CvT-13 | 384x384 | 71.31 | 71.31 | 89.97 | 89.97 | 20M |\n| CvT-21 | 224x224 | 71.18 | 71.17 | 89.31 | 89.31 | 32M |\n| CvT-21 | 384x384 | 71.61 | 71.61 | 89.71 | 89.71 | 32M |\n\n\n### Models pre-trained on ImageNet-22K\n| Configuration  | Resoluton | Top-1 (Original) | Top-1 (Ported) | Top-5 (Original) | Top-5 (Ported) | #Params\n| ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- |\n| CvT-13 | 384x284 | 71.76 | 71.76 | 91.39 | 91.39 | 20M |\n| CvT-21 | 384x384 | 74.97 | 74.97 | 92.63 | 92.63 | 32M |\n| CvT-W24 | 384x384 | 78.15 | 78.15 | 94.48 | 94.48 | 277M | \n\nMax metrics difference: `9e-5`.\n\n\u003cdiv id=\"installation\"/\u003e\n\n## Installation\n- Install from PyPI\n```\npip install cvt-tensorflow\n```\n- Install from Github\n```\npip install git+https://github.com/EMalagoli92/CvT-TensorFlow\n```\n- Clone the repo and install necessary packages \n```\ngit clone https://github.com/EMalagoli92/CvT-TensorFlow.git\npip install -r requirements.txt\n```\n\nTested on *Ubuntu 20.04.4 LTS x86_64*, *python 3.9.7*.\n\n\u003cdiv id=\"usage\"/\u003e\n\n## Usage\n- Define a custom CvT configuration.\n```python\nfrom cvt_tensorflow import CvT\n\n# Define a custom CvT configuration\nmodel = CvT(\n    in_chans=3,\n    num_classes=1000,\n    classifier_activation=\"softmax\",\n    data_format=\"channels_last\",\n    spec={\n        \"INIT\": \"trunc_norm\",\n        \"NUM_STAGES\": 3,\n        \"PATCH_SIZE\": [7, 3, 3],\n        \"PATCH_STRIDE\": [4, 2, 2],\n        \"PATCH_PADDING\": [2, 1, 1],\n        \"DIM_EMBED\": [64, 192, 384],\n        \"NUM_HEADS\": [1, 3, 6],\n        \"DEPTH\": [1, 2, 10],\n        \"MLP_RATIO\": [4.0, 4.0, 4.0],\n        \"ATTN_DROP_RATE\": [0.0, 0.0, 0.0],\n        \"DROP_RATE\": [0.0, 0.0, 0.0],\n        \"DROP_PATH_RATE\": [0.0, 0.0, 0.1],\n        \"QKV_BIAS\": [True, True, True],\n        \"CLS_TOKEN\": [False, False, True],\n        \"QKV_PROJ_METHOD\": [\"dw_bn\", \"dw_bn\", \"dw_bn\"],\n        \"KERNEL_QKV\": [3, 3, 3],\n        \"PADDING_KV\": [1, 1, 1],\n        \"STRIDE_KV\": [2, 2, 2],\n        \"PADDING_Q\": [1, 1, 1],\n        \"STRIDE_Q\": [1, 1, 1],\n    },\n)\n```\n- Use a predefined CvT configuration.\n```python\nfrom cvt_tensorflow import CvT\n\nmodel = CvT(\n    configuration=\"cvt-21\", data_format=\"channels_last\", classifier_activation=\"softmax\"\n)\nmodel.build((None, 224, 224, 3))\nprint(model.summary())\n```\n```\nModel: \"cvt-21\"\n_________________________________________________________________\n Layer (type)                Output Shape              Param #   \n=================================================================\n stage0 (VisionTransformer)  multiple                  62080     \n                                                                 \n stage1 (VisionTransformer)  multiple                  1920576   \n                                                                 \n stage2 (VisionTransformer)  ((None, 384, 14, 14),     29296128  \n                              (None, 1, 384))                    \n                                                                 \n norm (LayerNorm_)           (None, 1, 384)            768       \n                                                                 \n head (Linear_)              (None, 1000)              385000    \n                                                                 \n pred (Activation)           (None, 1000)              0         \n                                                                 \n=================================================================\nTotal params: 31,664,552\nTrainable params: 31,622,696\nNon-trainable params: 41,856\n_________________________________________________________________\n```\n- Train from scratch the model.\n```python    \n# Example\nmodel.compile(\n    optimizer=\"sgd\",\n    loss=\"sparse_categorical_crossentropy\",\n    metrics=[\"accuracy\", \"sparse_top_k_categorical_accuracy\"],\n)\nmodel.fit(x, y)\n```\n- Use ported ImageNet pretrained weights\n```python\n# Example\nfrom cvt_tensorflow import CvT\n\n# Use cvt-13-384x384_22k ImageNet pretrained weights\nmodel = CvT(\n    configuration=\"cvt-13\",\n    pretrained=True,\n    pretrained_resolution=384,\n    pretrained_version=\"22k\",\n    classifier_activation=\"softmax\",\n)\ny_pred = model(image)\n```\n\n\u003cdiv id=\"acknowledgement\"/\u003e\n\n## Acknowledgement\n[CvT](https://github.com/microsoft/CvT) (Official PyTorch implementation)\n\n\n\u003cdiv id=\"citations\"/\u003e\n\n## Citations\n```bibtex\n@article{wu2021cvt,\n  title={Cvt: Introducing convolutions to vision transformers},\n  author={Wu, Haiping and Xiao, Bin and Codella, Noel and Liu, Mengchen and Dai, Xiyang and Yuan, Lu and Zhang, Lei},\n  journal={arXiv preprint arXiv:2103.15808},\n  year={2021}\n}\n```\n\n\u003cdiv id=\"license\"/\u003e\n\n## License\nThis work is made available under the [MIT License](https://github.com/EMalagoli92/CvT-TensorFlow/blob/main/LICENSE)","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Femalagoli92%2Fcvt-tensorflow","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Femalagoli92%2Fcvt-tensorflow","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Femalagoli92%2Fcvt-tensorflow/lists"}