{"id":13721598,"url":"https://github.com/microsoft/Focal-Transformer","last_synced_at":"2025-05-07T13:33:24.625Z","repository":{"id":40615937,"uuid":"384661236","full_name":"microsoft/Focal-Transformer","owner":"microsoft","description":"[NeurIPS 2021 Spotlight] Official code for \"Focal Self-attention for Local-Global Interactions in Vision Transformers\"","archived":false,"fork":false,"pushed_at":"2022-03-27T05:21:56.000Z","size":342,"stargazers_count":554,"open_issues_count":18,"forks_count":61,"subscribers_count":15,"default_branch":"main","last_synced_at":"2025-04-30T12:46:31.043Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/microsoft.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null}},"created_at":"2021-07-10T09:34:31.000Z","updated_at":"2025-03-29T14:52:53.000Z","dependencies_parsed_at":"2022-07-14T22:46:49.148Z","dependency_job_id":null,"html_url":"https://github.com/microsoft/Focal-Transformer","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FFocal-Transformer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FFocal-Transformer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FFocal-Transformer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FFocal-Transformer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/microsoft","download_url":"https://codeload.github.com/microsoft/Focal-Transformer/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252887278,"owners_count":21819870,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T01:01:19.073Z","updated_at":"2025-05-07T13:33:24.252Z","avatar_url":"https://github.com/microsoft.png","language":"Python","funding_links":[],"categories":["DLA","Python"],"sub_categories":["Focal-T"],"readme":"# Focal Transformer \\[NeurIPS 2021 Spotlight\\]\n\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/focal-self-attention-for-local-global/object-detection-on-coco-minival)](https://paperswithcode.com/sota/object-detection-on-coco-minival?p=focal-self-attention-for-local-global)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/focal-self-attention-for-local-global/object-detection-on-coco)](https://paperswithcode.com/sota/object-detection-on-coco?p=focal-self-attention-for-local-global)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/focal-self-attention-for-local-global/instance-segmentation-on-coco-minival)](https://paperswithcode.com/sota/instance-segmentation-on-coco-minival?p=focal-self-attention-for-local-global)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/focal-self-attention-for-local-global/instance-segmentation-on-coco)](https://paperswithcode.com/sota/instance-segmentation-on-coco?p=focal-self-attention-for-local-global)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/focal-self-attention-for-local-global/semantic-segmentation-on-ade20k-val)](https://paperswithcode.com/sota/semantic-segmentation-on-ade20k-val?p=focal-self-attention-for-local-global)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/focal-self-attention-for-local-global/semantic-segmentation-on-ade20k)](https://paperswithcode.com/sota/semantic-segmentation-on-ade20k?p=focal-self-attention-for-local-global)\n\nThis is the official implementation of our [Focal Transformer -- \"Focal Self-attention for Local-Global Interactions in Vision Transformers\"](https://arxiv.org/pdf/2107.00641.pdf), \nby Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan and Jianfeng Gao.\n\n## Introduction\n\n![focal-transformer-teaser](figures/focal-transformer-teaser.png)\n\nOur Focal Transfomer introduced a new self-attention mechanism called **focal self-attention** for vision transformers. \nIn this new mechanism, **each token attends the closest surrounding tokens at fine granularity but the tokens far away at coarse granularity**, \nand thus can capture both short- and long-range visual dependencies efficiently and effectively. \n\nWith our Focal Transformers, we achieved superior performance over the state-of-the-art vision Transformers on a range of public benchmarks. \nIn particular, our Focal Transformer models with a moderate size of 51.1M and a larger size of 89.8M achieve `83.6 and 84.0` Top-1 accuracy, respectively, \non ImageNet classification at 224x224 resolution. \nUsing Focal Transformers as the backbones, we obtain consistent and substantial improvements over the current state-of-the-art methods \nfor 6 different object detection methods trained with standard 1x and 3x schedules. \nOur largest Focal Transformer yields `58.7/58.9 box mAPs` and `50.9/51.3 mask mAPs` on COCO mini-val/test-dev, \nand `55.4 mIoU` on ADE20K for semantic segmentation.\n\n:film_strip: [Video by The AI Epiphany](https://www.google.com/url?sa=t\u0026rct=j\u0026q=\u0026esrc=s\u0026source=web\u0026cd=\u0026cad=rja\u0026uact=8\u0026ved=2ahUKEwjzk6Wm8NHyAhVCqlsKHYepD9wQtwJ6BAgDEAM\u0026url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DYH319yyeoVw\u0026usg=AOvVaw27s7EE-txctmc6_BwKnnfE)\n\n## Next Generation Architecture\n\nWe had developed [FocalNet](https://arxiv.org/abs/2203.11926), a next generation of architecture built based on the focal mechanism. It is much faster and more effective. Check it out at: [https://github.com/microsoft/FocalNet](https://github.com/microsoft/FocalNet)!\n\n## Faster Focal Transformer\n\nAs you may notice, though the theoritical GFLOPs of our Focal Transformer is comparable to prior works, its wall-clock efficiency lags behind. Therefore, we are releasing a faster version of Focal Transformer, which discard all the rolling and unfolding operations used in our first version.\n\n| Model | Pretrain | Use Conv | Resolution | acc@1 | acc@5 | #params | FLOPs | Throughput (imgs/s) | Checkpoint | Config |\n| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |:---: | :---: | :---: |\n| Focal-T | IN-1K | No | 224 | 82.2 | 95.9 | 28.9M   | 4.9G   | 319 | [download](https://projects4jw.blob.core.windows.net/model/focal-transformer/imagenet1k/focal-tiny-is224-ws7.pth) | [yaml](configs/focal_tiny_patch4_window7_224.yaml) |\n| Focal-fast-T | IN-1K | Yes  | 224 | 82.4 | 96.0 | 30.2M   | 5.0G   | 483 | [download](https://projects4jw.blob.core.windows.net/model/focal-transformer/imagenet1k/focalv2-tiny-useconv-is224-ws7.pth) | [yaml](configs/focalv2_tiny_useconv_patch4_window7_224.yaml) |\n| Focal-S | IN-1K | No | 224 | 83.6 | 96.2 | 51.1M   | 9.4G   | 192 | [download](https://projects4jw.blob.core.windows.net/model/focal-transformer/imagenet1k/focal-small-is224-ws7.pth) |[yaml](configs/focal_small_patch4_window7_224.yaml) |\n| Focal-fast-S | IN-1K | Yes | 224 | 83.6 | 96.4 | 51.5M   | 9.4G  | 293  | [download](https://projects4jw.blob.core.windows.net/model/focal-transformer/imagenet1k/focalv2-small-useconv-is224-ws7.pth) |[yaml](configs/focalv2_small_useconv_patch4_window7_224.yaml) |\n| Focal-B | IN-1K | No | 224 | 84.0 | 96.5 | 89.8M   | 16.4G  | 138 | [download](https://projects4jw.blob.core.windows.net/model/focal-transformer/imagenet1k/focal-base-is224-ws7.pth) | [yaml](configs/focal_base_patch4_window7_224.yaml) |\n| Focal-fast-B | IN-1K | Yes | 224 | 84.0 | 96.6 | 91.2M   | 16.4G  | 203 | [download](https://projects4jw.blob.core.windows.net/model/focal-transformer/imagenet1k/focalv2-base-useconv-is224-ws7.pth) | [yaml](configs/focalv2_base_useconv_patch4_window7_224.yaml) |\n\n## Benchmarking \n\n### Image Classification Throughput with Image Resolution\n\n| Model | Top-1 Acc. | GLOPs (224x224) | 224x224 | 448x448 | 896 x 896  |\n| :---: | :---: | :---: | :---: | :---: | :---: |\nDeiT-Small/16 | 79.8 | 4.6 | 939 | 101 | 20\nPVT-Small\t| 79.8\t| 3.8\t| 794\t| 172\t| 31 |\nCvT-13\t   | 81.6\t| 4.5\t| 746\t| 125\t| 14 |\nViL-Small  | 82.0\t| 5.1\t| 397\t| 87\t| 17 |\nSwin-Tiny\t| 81.2\t| 4.5\t| 760\t| 189\t| 48 |\nFocal-Tiny\t| 82.2\t| 4.9\t| 319\t| 105\t| 27 |\nPVT-Medium\t| 81.2\t| 6.7\t| 517\t| 111\t| 20 |\nCvT-21\t| 82.5\t| 7.1\t| 480\t| 85\t|  10 |\nViL-Medium\t| 83.3\t| 9.1\t| 251\t| 53\t| 8 |\nSwin-Small\t| 83.1\t| 8.7\t| 435\t| 111\t| 28 |\nFocal-Small\t| 83.6\t| 9.4\t| 192\t| 63\t| 17 |\nViT-Base/16\t| 77.9\t| 17.6\t| 291\t| 57\t| 8 |\nDeit-Base/16 | 81.8\t| 17.6\t| 291\t| 57\t| 8 |\nPVT-Large\t| 81.7\t| 9.8\t| 352\t| 77\t| 14 |\nViL-Base\t| 83.2\t| 13.4\t| 145\t| 35\t| 5 |\nSwin-Base\t| 83.4\t| 15.4\t| 291\t| 70\t| 17|\nFocal-Base\t| 84.0\t| 16.4\t| 138\t| 44\t| 11|\n\n\n### Image Classification on [ImageNet-1K](https://www.image-net.org/)\n\n| Model | Pretrain | Use Conv | Resolution | acc@1 | acc@5 | #params | FLOPs | Checkpoint | Config |\n| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |:---: | :---: |\n| Focal-T | IN-1K | No | 224 | 82.2 | 95.9 | 28.9M   | 4.9G   | [download](https://projects4jw.blob.core.windows.net/model/focal-transformer/imagenet1k/focal-tiny-is224-ws7.pth) | [yaml](configs/focal_tiny_patch4_window7_224.yaml) |\n| Focal-T | IN-1K | Yes  | 224 | 82.7 | 96.1 | 30.8M   | 5.2G   | [download](https://projects4jw.blob.core.windows.net/model/focal-transformer/imagenet1k/focal-tiny-useconv-is224-ws7.pth) | [yaml](configs/focal_tiny_useconv_patch4_window7_224.yaml) |\n| Focal-S | IN-1K | No | 224 | 83.6 | 96.2 | 51.1M   | 9.4G   | [download](https://projects4jw.blob.core.windows.net/model/focal-transformer/imagenet1k/focal-small-is224-ws7.pth) |[yaml](configs/focal_small_patch4_window7_224.yaml) |\n| Focal-S | IN-1K | Yes | 224 | 83.8 | 96.5 | 53.1M   | 9.7G   | [download](https://projects4jw.blob.core.windows.net/model/focal-transformer/imagenet1k/focal-small-useconv-is224-ws7.pth) |[yaml](configs/focal_small_useconv_patch4_window7_224.yaml) |\n| Focal-B | IN-1K | No | 224 | 84.0 | 96.5 | 89.8M   | 16.4G  | [download](https://projects4jw.blob.core.windows.net/model/focal-transformer/imagenet1k/focal-base-is224-ws7.pth) | [yaml](configs/focal_base_patch4_window7_224.yaml) |\n| Focal-B | IN-1K | Yes | 224 | 84.2 | 97.1 | 93.3M   | 16.8G  | [download](https://projects4jw.blob.core.windows.net/model/focal-transformer/imagenet1k/focal-base-useconv-is224-ws7.pth) | [yaml](configs/focal_base_useconv_patch4_window7_224.yaml) |\n\n### Object Detection and Instance Segmentation on [COCO](https://cocodataset.org/#home)\n\n#### [Mask R-CNN](https://openaccess.thecvf.com/content_ICCV_2017/papers/He_Mask_R-CNN_ICCV_2017_paper.pdf)\n\n| Backbone | Pretrain | Lr Schd | #params | FLOPs | box mAP | mask mAP | \n| :---: | :---: | :---: | :---: | :---: | :---: | :---: |\n| Focal-T | ImageNet-1K | 1x | 49M | 291G | 44.8 | 41.0 | \n| Focal-T | ImageNet-1K | 3x | 49M | 291G | 47.2 | 42.7 | \n| Focal-S | ImageNet-1K | 1x | 71M | 401G | 47.4 | 42.8 | \n| Focal-S | ImageNet-1K | 3x | 71M | 401G | 48.8 | 43.8 | \n| Focal-B | ImageNet-1K | 1x | 110M | 533G | 47.8 | 43.2 | \n| Focal-B | ImageNet-1K | 3x | 110M | 533G | 49.0 | 43.7 | \n\n#### [RetinaNet](https://openaccess.thecvf.com/content_ICCV_2017/papers/Lin_Focal_Loss_for_ICCV_2017_paper.pdf)\n\n| Backbone | Pretrain | Lr Schd | #params | FLOPs | box mAP | \n| :---: | :---: | :---: | :---: | :---: | :---: |\n| Focal-T | ImageNet-1K | 1x | 39M | 265G | 43.7 |\n| Focal-T | ImageNet-1K | 3x | 39M | 265G | 45.5 | \n| Focal-S | ImageNet-1K | 1x | 62M | 367G | 45.6 | \n| Focal-S | ImageNet-1K | 3x | 62M | 367G | 47.3 | \n| Focal-B | ImageNet-1K | 1x | 101M | 514G | 46.3 | \n| Focal-B | ImageNet-1K | 3x | 101M | 514G | 46.9 | \n\n#### Other detection methods\n\n| Backbone | Pretrain | Method | Lr Schd | #params | FLOPs | box mAP | \n| :---: | :---: | :---: | :---: | :---: | :---: | :---: |\n| Focal-T | ImageNet-1K | [Cascade Mask R-CNN](https://arxiv.org/abs/1712.00726) | 3x | 87M  | 770G | 51.5 | \n| Focal-T | ImageNet-1K | [ATSS](https://arxiv.org/pdf/1912.02424.pdf)           | 3x | 37M  | 239G | 49.5 |\n| Focal-T | ImageNet-1K | [RepPointsV2](https://arxiv.org/pdf/2007.08508.pdf)    | 3x | 45M  | 491G | 51.2 | \n| Focal-T | ImageNet-1K | [Sparse R-CNN](https://arxiv.org/pdf/2011.12450.pdf)   | 3x | 111M | 196G | 49.0 | \n\n### Semantic Segmentation on [ADE20K](https://groups.csail.mit.edu/vision/datasets/ADE20K/)\n\n| Backbone | Pretrain  | Method | Resolution | Iters | #params | FLOPs | mIoU | mIoU (MS) | \n| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |\n| Focal-T | ImageNet-1K  | [UPerNet](https://arxiv.org/pdf/1807.10221.pdf) | 512x512 | 160k | 62M  | 998G | 45.8 | 47.0 | \n| Focal-S | ImageNet-1K  | [UPerNet](https://arxiv.org/pdf/1807.10221.pdf) | 512x512 | 160k | 85M | 1130G | 48.0 | 50.0 | \n| Focal-B | ImageNet-1K  | [UPerNet](https://arxiv.org/pdf/1807.10221.pdf) | 512x512 | 160k | 126M | 1354G | 49.0 | 50.5 | \n| Focal-L | ImageNet-22K | [UPerNet](https://arxiv.org/pdf/1807.10221.pdf) | 640x640 | 160k | 240M | 3376G | 54.0 | 55.4 | \n\n## Getting Started\n\n* Please follow [get_started_for_image_classification.md](./classification/get_started.md) to get started for image classification.\n* Please follow [get_started_for_object_detection.md](./detection/get_started.md) to get started for object detection.\n* Please follow [get_started_for_semantic_segmentation.md](./segmentation/get_started.md) to get started for semantic segmentation.\n\n## Citation\n\nIf you find this repo useful to your project, please consider to cite it with following bib:\n\n    @misc{yang2021focal,\n        title={Focal Self-attention for Local-Global Interactions in Vision Transformers}, \n        author={Jianwei Yang and Chunyuan Li and Pengchuan Zhang and Xiyang Dai and Bin Xiao and Lu Yuan and Jianfeng Gao},\n        year={2021},\n        eprint={2107.00641},\n        archivePrefix={arXiv},\n        primaryClass={cs.CV}\n    }\n\n## Acknowledgement\n\nOur codebase is built based on [Swin-Transformer](https://github.com/microsoft/Swin-Transformer). We thank the authors for the nicely organized code!\n\n## Contributing\n\nThis project welcomes contributions and suggestions.  Most contributions require you to agree to a\nContributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us\nthe rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.\n\nWhen you submit a pull request, a CLA bot will automatically determine whether you need to provide\na CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions\nprovided by the bot. You will only need to do this once across all repos using our CLA.\n\nThis project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).\nFor more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or\ncontact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.\n\n## Trademarks\n\nThis project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft \ntrademarks or logos is subject to and must follow \n[Microsoft's Trademark \u0026 Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).\nUse of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.\nAny use of third-party trademarks or logos are subject to those third-party's policies.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2FFocal-Transformer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmicrosoft%2FFocal-Transformer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2FFocal-Transformer/lists"}