{"id":20305537,"url":"https://github.com/vitae-transformer/qformer","last_synced_at":"2025-05-07T16:10:06.279Z","repository":{"id":149314563,"uuid":"620052669","full_name":"ViTAE-Transformer/QFormer","owner":"ViTAE-Transformer","description":"The official repo for [TPAMI'23] \"Vision Transformer with Quadrangle Attention\"","archived":false,"fork":false,"pushed_at":"2024-04-10T23:27:50.000Z","size":2338,"stargazers_count":211,"open_issues_count":3,"forks_count":10,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-05-07T16:09:59.430Z","etag":null,"topics":["attention-mechanism","backbone","classification","deep-learning","object-detection","pose-estimation","semantic-segmentation","vision-transformer"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ViTAE-Transformer.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-03-27T23:56:59.000Z","updated_at":"2025-05-06T09:53:16.000Z","dependencies_parsed_at":"2024-04-11T00:31:25.136Z","dependency_job_id":"beb55f46-92c9-422a-b9b8-4cb195fab74a","html_url":"https://github.com/ViTAE-Transformer/QFormer","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ViTAE-Transformer%2FQFormer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ViTAE-Transformer%2FQFormer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ViTAE-Transformer%2FQFormer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ViTAE-Transformer%2FQFormer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ViTAE-Transformer","download_url":"https://codeload.github.com/ViTAE-Transformer/QFormer/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252912996,"owners_count":21824066,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["attention-mechanism","backbone","classification","deep-learning","object-detection","pose-estimation","semantic-segmentation","vision-transformer"],"created_at":"2024-11-14T17:08:53.298Z","updated_at":"2025-05-07T16:10:06.254Z","avatar_url":"https://github.com/ViTAE-Transformer.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003ch1 align=\"center\"\u003e[TPAMI 2023] Vision Transformer with Quadrangle Attention\u003ca href=\"https://arxiv.org/abs/2303.15105\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-Paper-\u003cCOLOR\u003e.svg\" \u003e\u003c/a\u003e\u003c/h1\u003e\n\u003cp align=\"center\"\u003e\n\u003ch4 align=\"center\"\u003eThis is the official repository of the paper \u003ca href=\"https://arxiv.org/abs/2303.15105\"\u003eVision Transformer with Quadrangle Attention\u003c/a\u003e.\u003c/h4\u003e\n\u003ch5 align=\"center\"\u003e\u003cem\u003eQiming Zhang, Jing Zhang, Yufei Xu, and Dacheng Tao\u003c/em\u003e\u003c/h5\u003e\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"#news\"\u003eNews\u003c/a\u003e |\n  \u003ca href=\"#abstract\"\u003eAbstract\u003c/a\u003e |\n  \u003ca href=\"#method\"\u003eMethod\u003c/a\u003e |\n  \u003ca href=\"#usage\"\u003eUsage\u003c/a\u003e |\n  \u003ca href=\"#results\"\u003eResults\u003c/a\u003e |\n  \u003ca href=\"#statement\"\u003eStatement\u003c/a\u003e\n\u003c/p\u003e\n\n# Current applications\n\n\u003e **Classification**: Hierarchical models has been released; Plain ones will be released soon.\n\n\u003e **Object Detection**: Will be released soon;\n\n\u003e **Semantic Segmentation**: Will be released soon;\n\n\u003e **Human Pose**: Will be released soon\n\n\n# News\n***24/01/2024***\n- The code of hierarchical models on classification has been released.\n\n***30/12/2023***\n- The paper is accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence\n(TPAMI) with IF 24.314.\n\n***27/03/2023***\n- The paper is post on arxiv! The code will be made public available once cleaned up.\n\n# Abstract\n\n\u003cp align=\"left\"\u003eThis repository contains the code, models, test results for the paper \u003ca href=\"https://arxiv.org/abs/2303.15105\"\u003eVision Transformer with Quadrangle Attention\u003c/a\u003e, which is an substantial extention of our ECCV 2022 paper \u003ca href=\"https://arxiv.org/pdf/2204.08446.pdf\"\u003eVSA\u003c/a\u003e. We extends the window-based attention to a general quadrangle formulation and propose a novel quadrangle attention. We employs an end-to-end learnable quadrangle regression module that predicts a transformation matrix to transform default windows into target quadrangles for token sampling and attention calculation, enabling the network to model various targets with different shapes and orientations and capture rich context information. With minor code modifications and negligible extra computational cost, our QFormer outperforms existing representative (hierarchical and plain) vision transformers on various vision tasks, including classification, object detection, semantic segmentation, and pose estimation.\n\n# Method\n\u003cfigure\u003e\n\u003cimg src=\"figs/opening.jpg\"\u003e\n\u003cfigcaption align = \"center\"\u003e\u003cb\u003eFig.1 - The comparison of the current design (hand-crafted windows) and Quadrange attention.\u003c/b\u003e\u003c/figcaption\u003e\n\u003c/figure\u003e\n\n\u003cfigure\u003e\n\u003cimg src=\"figs/pipeline-QA.jpg\"\u003e\n\u003cfigcaption align = \"center\"\u003e\u003cb\u003eFig.2 - The pipeline of our proposed quadrangle attention (QA).\u003c/b\u003e\u003c/figcaption\u003e\n\u003c/figure\u003e\n\n\u003cfigure\u003e\n\u003cimg src=\"figs/transformation.jpg\"\u003e\n\u003cfigcaption align = \"center\"\u003e\u003cb\u003eFig.3 - The transformation process in quadrangle attention.\u003c/b\u003e\u003c/figcaption\u003e\n\u003c/figure\u003e\n\n\u003cfigure\u003e\n\u003cimg src=\"figs/model.jpg\"\u003e\n\u003cfigcaption align = \"center\"\u003e\u003cb\u003eFig.4 - The architecture of our plain QFormer\u003csub\u003ep\u003c/sub\u003e (a) and hierarchical QFormer\u003csub\u003eh\u003c/sub\u003e (b).\u003c/b\u003e\u003c/figcaption\u003e\n\u003c/figure\u003e\n\n# Usage\n## Requirements\n\n- PyTorch==1.7.1\n- torchvision==0.8.2\n- timm==0.3.2\n\nThe [Apex](https://github.com/NVIDIA/apex) is optional for faster training speed. \n\n```\ngit clone https://github.com/NVIDIA/apex\ncd apex\npip install -v --disable-pip-version-check --no-cache-dir --global-option=\"--cpp_ext\" --global-option=\"--cuda_ext\" ./\n```\n\nOther Requirements\n\n```\npip install opencv-python==4.4.0.46 termcolor==1.1.0 yacs==0.1.8 timm==0.4.9\npip install einops\n```\n\n## Train \u0026 Eval\n\nFor classification on ImageNet-1K, to train from scratch, run:\n\n```\nCUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \\\npython -m torch.distributed.launch \\\n  --nnodes ${NNODES} \\\n  --node_rank ${SLURM_NODEID} \\\n  --master_addr ${MHOST} \\\n  --master_port 25901 \\\n  --nproc_per_node 8 \\\n  ./main.py \\\n  --cfg configs/swin/qformer_tiny_patch4_window7_224.yaml \\\n  --data-path ${IMAGE_PATH} \\\n  --batch-size 128 \\\n  --tag 1024-dpr20-coords_lambda1e-1 \\\n  --distributed \\\n  --coords_lambda 1e-1 \\\n  --drop_path_rate 0.2 \\\n```\n\nFor single GPU training, run\n```\npython ./main.py \\\n  --cfg configs/swin/qformer_tiny_patch4_window7_224.yaml \\\n  --data-path ${IMAGE_PATH} \\\n  --batch-size 128 \\\n  --tag 1024-dpr20-coords_lambda1e-1 \\\n  --coords_lambda 1e-1 \\\n  --drop_path_rate 0.2 \\\n```\n\n\nTo evaluate, run:\n\n```\nCUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \\\npython -m torch.distributed.launch \\\n  --nnodes ${NNODES} \\\n  --node_rank ${SLURM_NODEID} \\\n  --master_addr ${MHOST} \\\n  --master_port 25901 \\\n  --nproc_per_node 8 \\\n  ./main.py \\\n  --cfg configs/swin/qformer_tiny_patch4_window7_224.yaml \\\n  --data-path ${IMAGE_PATH} \\\n  --batch-size 128 \\\n  --tag eval \\\n  --distributed \\\n  --resume ${MODEL PATH} \\\n  --eval\n```\nFor single GPU evaluation,\nrun\n```\npython ./main.py \\\n  --cfg configs/swin/qformer_tiny_patch4_window7_224.yaml \\\n  --data-path ${IMAGE_PATH} \\\n  --batch-size 128 \\\n  --tag eval \\\n  --resume ${MODEL PATH} \\\n  --eval\n```\n\n# Results\n# Results on plain models\n\n### Classification results on ImageNet-1K with MAE pretrained models\n| model | resolution | acc@1 | Weights \u0026 Logs |\n| :---: | :---: | :---: | :---: |\n| ViT-B + Window attn | 224x224 | 81.2 | \\ | \n| ViT-B + Shifted window | 224x224 | 82.0 | \\ |\n| QFormer\u003csub\u003ep\u003c/sub\u003e-B | 224x224 | 82.9 | Coming soon |\n\n### Detection results on COCO with MAE pretrained models and the Mask RCNN detector, following \u003ca href=\"https://arxiv.org/abs/2203.16527\"\u003eViTDet\u003c/a\u003e\n\n| model | box mAP | mask mAP | Params | Weights \u0026 Logs |\n| :---: | :---: | :---: | :---: | :---: |\n| ViTDet-B | 51.6 | 45.9 | 111M | \\ |\n| QFormer\u003csub\u003ep\u003c/sub\u003e-B | 52.3 | 46.6 | 111M | Coming soon |\n\n### Semantic segmentation results on ADE20k with MAE pretrained models and the UPerNet segmentor\n| model | image size | mIoU | mIoU* | Weights \u0026 Logs |\n| :---: | :---: | :---: | :---: | :---: | \n| ViT-B + window attn | 512x512 | 39.7 | 41.8 | \\ |\n| ViT-B + shifted window attn | 512x512 | 41.6 | 43.6 | \\ |\n| QFormer\u003csub\u003ep\u003c/sub\u003e-B | 512x512 | 43.6 | 45.0 | Coming soon |\n| ViT-B + window attn | 640x640 | 40.2 | 41.5 | \\ |\n| ViT-B + shifted window attn | 640x640 | 42.3 | 43.5 | \\ |\n| QFormer\u003csub\u003ep\u003c/sub\u003e-B | 640x640 | 44.9 | 46.0 | Coming soon |\n\n### Human pose estimation results on COCO with MAE pretrained models, following \u003ca href=\"https://arxiv.org/abs/2204.12484\"\u003eViTPose\u003c/a\u003e\n| attention | model | AP | AP\u003csub\u003e50\u003c/sub\u003e | AR | AR\u003csub\u003e50\u003c/sub\u003e | Weights \u0026 Logs |\n| :---: | :---: | :---: | :---: | :---: |  :---: |  :---: | \n| Window | ViT-B | 66.4 | 87.7 | 72.9 | 91.9 | \\ |\n| Shifted window | ViT-B | 76.4 | 90.9 | 81.6 | 94.5 | \\ |\n| Quadrangle | ViT-B | 77.0 | 90.9 | 82.0 | 94.7| Coming soon |\n| Window + Full | ViT-B | 76.9 | 90.8 | 82.1 | 94.7 | \\ |\n| Shifted window + Full | ViT-B | 77.2 | 90.9 | 82.2 | 94.7 | \\ |\n| Quadrangle + Full | ViT-B | 77.4 | 91.0 | 82.4 | 94.9 | Coming soon |\n\n# Results on hierarchical models\n\n### Main Results on ImageNet-1K\n| name | resolution | acc@1 | acc@5 | acc@RealTop-1 | Weights \u0026 Logs |\n| :---: | :---: | :---: | :---: | :---: | :---: |\n| Swin-T | 224x224 | 81.2 | \\ | \\ | \\ |\n| DW-T | 224x224 | 82.0 | \\ | \\ | \\ |\n| Focal-T | 224x224 | 82.2 | 95.9 |\n| QFormer\u003csub\u003eh\u003c/sub\u003e-T | 224x224 | 82.5 | 96.2 | 87.5 | [model](https://1drv.ms/u/s!AimBgYV7JjTlgcoM6jhtunQzDKSzTQ?e=rVap2P) \u0026 [logs](logs/QFormer-T.txt) |\n| Swin-S | 224x224 | 83.2 | 96.2 | \\ | \\ |\n| Focal-S | 224x224 | 83.5 | 96.2 | \\ | \\ |\n| QFormer\u003csub\u003eh\u003c/sub\u003e-S | 224x224 | 84.0 |  96.8 | 88.6 | [model](https://1drv.ms/u/s!AimBgYV7JjTlgcoN7NjoE1Pza3dU2A?e=XBh2tl) \u0026 [logs](logs/QFormer-S.txt) |\n| Swin-B | 224x224 | 83.4 | 96.5 | \\ | \\ |\n| DW-B | 224x224 | 83.4 | \\ | \\ | \\ |\n| Focal-B | 224x224 | 83.8 | 96.5 | \\ | \\ |\n| QFormer\u003csub\u003eh\u003c/sub\u003e-B | 224x224 | 84.1 |  96.8 | 88.7 | [model](https://1drv.ms/u/s!AimBgYV7JjTlgcoOX-Wc-CQU_9QDsg?e=xCrdE4) \u0026 [logs](logs/QFormer-B.txt) |\n\n\n## Object Detection Results\n### Mask R-CNN\n\n| Backbone | Pretrain | Lr Schd | box mAP | mask mAP | #params | config | log | model |\n| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |:---: |\n| Swin-T | ImageNet-1K | 1x | 43.7 | 39.8 | 48M | \\ | \\ | \\ |\n| DAT-T | ImageNet-1K | 1x | 44.4 | 40.4 | 48M | \\ | \\ | \\ |\n| Focal-T | ImageNet-1K | 1x | 44.8 | 41.0 | 49M | \\ | \\ | \\ |\n| QFormer\u003csub\u003eh\u003c/sub\u003e-T | ImageNet-1K | 1x | 45.9 | 41.5 | 49M | [config](https://github.com/RogerZhangzz/qformer-hierarchical-detection/tree/main/configs/swin/mask_rcnn_qformer_tiny_patch4_window7_mstrain_480-800_adamw_1x_coco.py) | [log](https://github.com/RogerZhangzz/qformer-hierarchical-detection/tree/main/logs/mask_rcnn_qformer_tiny_patch4_window7_mstrain_480-800_adamw_1x_coco.log) | [onedrive](https://1drv.ms/f/s!AimBgYV7JjTlgcp07_aXUvQI1QHbHQ?e=QrE47h) |\n| Swin-T | ImageNet-1K | 3x | 46.0 | 41.6 | 48M | \\ | \\ | \\ |\n| DW-T | ImageNet-1K | 3x | 46.7 | 42.4 | 49M | \\ | \\ | \\ |\n| DAT-T | ImageNet-1K | 3x | 47.1 | 42.4 | 48M | \\ | \\ | \\ |\n| DAT-T | ImageNet-1K | 3x | 47.1 | 42.4 | 48M | \\ | \\ | \\ |\n| QFormer\u003csub\u003eh\u003c/sub\u003e-T  | ImageNet-1K | 3x | 47.5 | 42.7 | 49M | [config](https://github.com/RogerZhangzz/qformer-hierarchical-detection/tree/main/configs/swin/mask_rcnn_qformer_tiny_patch4_window7_mstrain_480-800_adamw_3x_coco.py) | [log](https://github.com/RogerZhangzz/qformer-hierarchical-detection/tree/main/logs/mask_rcnn_qformer_tiny_patch4_window7_mstrain_480-800_adamw_3x_coco.log) | [onedrive](https://1drv.ms/f/s!AimBgYV7JjTlgcp1T-I8qPj5r0_kGQ?e=gSPRpm) |\n| Swin-S | ImageNet-1K | 3x | 48.5 | 43.3 | 69M | \\ | \\ | \\ |\n| Focal-S | ImageNet-1K | 3x | 48.8 | 43.8 | 71M | \\ | \\ | \\ |\n| DAT-S | ImageNet-1K | 3x | 49.0 | 44.0 | 69M | \\ | \\ | \\ |\n| QFormer\u003csub\u003eh\u003c/sub\u003e-S  | ImageNet-1K | 3x | 49.5 | 44.2 | 70M | [config](https://github.com/RogerZhangzz/qformer-hierarchical-detection/tree/main/configs/swin/mask_rcnn_qformer_small_patch4_window7_mstrain_480-800_adamw_3x_coco.py) | [log](https://github.com/RogerZhangzz/qformer-hierarchical-detection/tree/main/logs/mask_rcnn_qformer_small_patch4_window7_mstrain_480-800_adamw_3x_coco.log) | [onedrive](https://1drv.ms/f/s!AimBgYV7JjTlgcpzjvjlHZgb98ovqA?e=CFrdJA) |\n\n\n### Cascade Mask R-CNN\n\n| Backbone | Pretrain | Lr Schd | box mAP | mask mAP | #params | config | log | model |\n| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |:---: |\n| Swin-T | ImageNet-1K | 1x | 48.1 | 41.7 | 86M | \\ | \\ | \\ |\n| DAT-T | ImageNet-1K | 1x | 49.1 | 42.5 | 86M | \\ | \\ | \\ |\n| QFormer\u003csub\u003eh\u003c/sub\u003e-T | ImageNet-1K | 1x | 49.8 | 43.0 | 87M | [config](https://github.com/RogerZhangzz/qformer-hierarchical-detection/tree/main/configs/swin/cascade_mask_rcnn_qformer_tiny_patch4_window7_mstrain_480-800_giou_4conv1f_adamw_1x_coco.py) | [log](https://github.com/RogerZhangzz/qformer-hierarchical-detection/tree/main/logs/cascade_mask_rcnn_qformer_tiny_patch4_window7_mstrain_480-800_giou_4conv1f_adamw_1x_coco.log) | [onedrive](https://1drv.ms/f/s!AimBgYV7JjTlgcsFJx0Df3PlfTeukg?e=BsENkW) |\n| Swin-T | ImageNet-1K | 3x | 50.2 | 43.7 | 86M | \\ | \\ | \\ |\n| QFormer\u003csub\u003eh\u003c/sub\u003e-T | ImageNet-1K | 3x | 51.4 | 44.7 | 87M | [config](https://github.com/RogerZhangzz/qformer-hierarchical-detection/tree/main/configs/swin/cascade_mask_rcnn_qformer_tiny_patch4_window7_mstrain_480-800_giou_4conv1f_adamw_3x_coco.py) | [log](https://github.com/RogerZhangzz/qformer-hierarchical-detection/tree/main/logs/cascade_mask_rcnn_qformer_tiny_patch4_window7_mstrain_480-800_giou_4conv1f_adamw_1x_coco.log) | [onedrive](https://1drv.ms/f/s!AimBgYV7JjTlgcsGNhlt6Fd186OAFw?e=pzhiEt) |\n| Swin-S | ImageNet-1K | 3x | 51.9 | 45.0 | 107M | \\ | \\ | \\ |\n| QFormer\u003csub\u003eh\u003c/sub\u003e-S | ImageNet-1K | 3x | 52.8 | 45.7 | 108M | [config](https://github.com/RogerZhangzz/qformer-hierarchical-detection/tree/main/configs/swin/cascade_mask_rcnn_qformer_small_patch4_window7_mstrain_480-800_giou_4conv1f_adamw_3x_coco.py) | [log](https://github.com/RogerZhangzz/qformer-hierarchical-detection/tree/main/logs/cascade_mask_rcnn_qformer_small_patch4_window7_mstrain_480-800_giou_4conv1f_adamw_3x_coco.log) | [onedrive](https://1drv.ms/f/s!AimBgYV7JjTlgcsEvl0J1X2GBdzAKg?e=DdTBw4) |\n\n\n## Semantic Segmentation Results for ADE20k\n### UperNet\n\n| Backbone | Pretrain | Lr Schd | mIoU | mIoU* | #params | config | log | model |\n| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |:---: |\n| Swin-T | ImageNet-1k | 160k | 44.5 | 45.8 | 60M | \\ | \\ | \\ |\n| DAT-T | ImageNet-1k | 160k | 45.5 | 46.4 | 60M | \\ | \\ | \\ |\n| DW-T | ImageNet-1k | 160k | 45.7 | 46.9 | 61M | \\ | \\ | \\ |\n| Focal-T | ImageNet-1k | 160k | 45.8 | 47.0 | 62M | \\ | \\ | \\ |\n| QFormer\u003csub\u003eh\u003c/sub\u003e-T | ImageNet-1k | 160k | 46.9 | 48.1 | 61M | Coming soon | Coming soon | Coming soon |\n| Swin-S | ImageNet-1k | 160k | 47.6 | 49.5 | 81M | \\ | \\ | \\ |\n| DAT-S | ImageNet-1k | 160k | 48.3 | 49.8 | 81M | \\ | \\ | \\ |\n| Focal-S | ImageNet-1k | 160k | 48.0 | 50.0 | 61M | \\ | \\ | \\ |\n| QFormer\u003csub\u003eh\u003c/sub\u003e-S | ImageNet-1k | 160k | 48.9 | 50.3 | 82M | Coming soon | Coming soon | Coming soon |\n| Swin-B | ImageNet-1k | 160k | 48.1 | 49.7 | 121M | \\ | \\ | \\ |\n| DW-B | ImageNet-1k | 160k | 48.7 | 50.3 | 125M | \\ | \\ | \\ |\n| Focal-B | ImageNet-1k | 160k | 49.0 | 50.5 | 126M | \\ | \\ | \\ |\n| QFormer\u003csub\u003eh\u003c/sub\u003e-B | ImageNet-1k | 160k | 49.5 | 50.6 | 123M | Coming soon | Coming soon | Coming soon |\n\n\n# Statement\nThis project is for research purpose only. For any other questions please contact [qmzhangzz at hotmail.com](mailto:qmzhangzz@hotmail.com).\n\nThe code base is borrowed from [Swin](https://github.com/microsoft/Swin-Transformer).\n\n## Citing QFormer, VSA and ViTAE\n```\n@article{zhang2023vision,\n  title={Vision Transformer with Quadrangle Attention},\n  author={Zhang, Qiming and Zhang, Jing and Xu, Yufei and Tao, Dacheng},\n  journal={arXiv preprint arXiv:2303.15105},\n  year={2023}\n}\n@inproceedings{zhang2022vsa,\n  title={VSA: learning varied-size window attention in vision transformers},\n  author={Zhang, Qiming and Xu, Yufei and Zhang, Jing and Tao, Dacheng},\n  booktitle={Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23--27, 2022, Proceedings, Part XXV},\n  pages={466--483},\n  year={2022},\n  organization={Springer}\n}\n@article{zhang2023vitaev2,\n  title={Vitaev2: Vision transformer advanced by exploring inductive bias for image recognition and beyond},\n  author={Zhang, Qiming and Xu, Yufei and Zhang, Jing and Tao, Dacheng},\n  journal={International Journal of Computer Vision},\n  pages={1--22},\n  year={2023},\n  publisher={Springer}\n}\n@article{xu2021vitae,\n  title={Vitae: Vision transformer advanced by exploring intrinsic inductive bias},\n  author={Xu, Yufei and Zhang, Qiming and Zhang, Jing and Tao, Dacheng},\n  journal={Advances in Neural Information Processing Systems},\n  volume={34},\n  year={2021}\n}\n```\n\n# Our other Transformer works\n\n\u003e **ViTPose**: Please see \u003ca href=\"https://github.com/ViTAE-Transformer/ViTPose\"\u003eBaseline model ViTPose for human pose estimation\u003c/a\u003e;\n\n\u003e **VSA**: Please see \u003ca href=\"https://github.com/ViTAE-Transformer/ViTAE-VSA\"\u003eViTAE-Transformer for Image Classification and Object Detection\u003c/a\u003e;\n\n\u003e **ViTAE \u0026 ViTAEv2**: Please see \u003ca href=\"https://github.com/ViTAE-Transformer/ViTAE-Transformer\"\u003eViTAE-Transformer for Image Classification, Object Detection, and Sementic Segmentation\u003c/a\u003e;\n\n\u003e **Matting**: Please see \u003ca href=\"https://github.com/ViTAE-Transformer/ViTAE-Transformer-Matting\"\u003eViTAE-Transformer for matting\u003c/a\u003e;\n\n\u003e **Remote Sensing**: Please see \u003ca href=\"https://github.com/ViTAE-Transformer/ViTAE-Transformer-Remote-Sensing\"\u003eViTAE-Transformer for Remote Sensing\u003c/a\u003e; \u003ca href=\"https://github.com/ViTAE-Transformer/Remote-Sensing-RVSA\"\u003eAdvancing Plain Vision Transformer Towards Remote Sensing Foundation Model\n\u003c/a\u003e; \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvitae-transformer%2Fqformer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvitae-transformer%2Fqformer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvitae-transformer%2Fqformer/lists"}