{"id":13577543,"url":"https://github.com/microsoft/esvit","last_synced_at":"2025-04-05T19:13:28.432Z","repository":{"id":40955615,"uuid":"382134551","full_name":"microsoft/esvit","owner":"microsoft","description":"EsViT: Efficient self-supervised Vision Transformers","archived":false,"fork":false,"pushed_at":"2023-08-28T01:38:18.000Z","size":1976,"stargazers_count":410,"open_issues_count":15,"forks_count":44,"subscribers_count":11,"default_branch":"main","last_synced_at":"2025-03-29T18:07:14.019Z","etag":null,"topics":["self-supervised-learning","vision-transformers"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/microsoft.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null}},"created_at":"2021-07-01T19:19:39.000Z","updated_at":"2025-03-11T09:33:41.000Z","dependencies_parsed_at":"2022-09-05T16:41:29.222Z","dependency_job_id":"53290402-c6d8-4c46-972f-f25735b0e591","html_url":"https://github.com/microsoft/esvit","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Fesvit","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Fesvit/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Fesvit/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Fesvit/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/microsoft","download_url":"https://codeload.github.com/microsoft/esvit/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247386265,"owners_count":20930619,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["self-supervised-learning","vision-transformers"],"created_at":"2024-08-01T15:01:22.375Z","updated_at":"2025-04-05T19:13:28.401Z","avatar_url":"https://github.com/microsoft.png","language":"Python","readme":"# Efficient Self-Supervised Vision Transformers (EsViT)\n\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/efficient-self-supervised-vision-transformers/self-supervised-image-classification-on)](https://paperswithcode.com/sota/self-supervised-image-classification-on?p=efficient-self-supervised-vision-transformers)\n\n[[Paper]](https://arxiv.org/abs/2106.09785) [[Slides]](http://chunyuan.li/assets/pdf/esvit_talk_chunyl.pdf)\n\nPyTorch implementation for [EsViT](https://arxiv.org/abs/2106.09785) (accepted in ICLR, 2022), built with two techniques: \n\n- A multi-stage Transformer architecture. Three multi-stage Transformer variants are implemented under the folder [`models`](./models).\n- A non-contrastive region-level matching pre-train task. The region-level matching task is implemented in function `DDINOLoss(nn.Module)`  (Line 648) in [`main_esvit.py`](./main_esvit.py). Please use `--use_dense_prediction True`, otherwise only the view-level task is used.\n\n\n\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg width=\"90%\" alt=\"Efficiency vs accuracy comparison under the linear classification protocol on ImageNet with EsViT\" src=\"./plot/esvit_sota.png\"\u003e\n\u003c/div\u003e\nFigure: Efficiency vs accuracy comparison under the linear classification protocol on ImageNet. Left: Throughput of all SoTA SSL vision systems, circle sizes indicates model parameter counts; Right: performance over varied parameter counts for models with moderate (throughout/#parameters) ratio. Please refer Section 4.1 for details.\n\n\n## Updates\n\n* [08/19/2022] Organizing ECCV Workshop [*Computer Vision in the Wild (CVinW)*](https://computer-vision-in-the-wild.github.io/eccv-2022/), where two challenges are hosted to evaluate the zero-shot, few-shot and full-shot performance of pre-trained vision models in downstream tasks:\n  - [``*Image Classification in the Wild (ICinW)*''](https://eval.ai/web/challenges/challenge-page/1832/overview) Challenge evaluates on 20 image classification tasks.\n  - [``*Object Detection in the Wild (ODinW)*''](https://eval.ai/web/challenges/challenge-page/1839/overview) Challenge evaluates on 35 object detection tasks.\n\n$\\qquad$ [ \u003cimg src=\"https://computer-vision-in-the-wild.github.io/eccv-2022/static/eccv2022/img/ECCV-logo3.png\" width=10%/\u003e [Workshop]](https://computer-vision-in-the-wild.github.io/eccv-2022/)    $\\qquad$    [\u003cimg src=\"https://evalai.s3.amazonaws.com/media/logos/4e939412-a9c0-46bd-9797-5ba0bd0a9095.jpg\" width=10%/\u003e [IC Challenge] ](https://eval.ai/web/challenges/challenge-page/1832/overview)\n$\\qquad$    [\u003cimg src=\"https://evalai.s3.amazonaws.com/media/logos/3a31ae6e-a990-48fb-b2c3-1e7da9d17a20.jpg\" width=10%/\u003e [OD Challenge] ](https://eval.ai/web/challenges/challenge-page/1839/overview)\n\n\n\n* [06/19/2022] Released the evaluation benchmark used in EsVIT. It contains 20 downstream image classification tasks. [[ELEVATER Benchmark]](https://computer-vision-in-the-wild.github.io/ELEVATER/) [[Toolkit]](https://github.com/Computer-Vision-in-the-Wild/Elevater_Toolkit_IC)  [[Paper]](https://arxiv.org/abs/2204.08790)\n\n## Pretrained models\nYou can download the full checkpoint (trained with both view-level and region-level tasks, batch size=512 and ImageNet-1K.), which contains backbone and projection head weights for both student and teacher networks. \n\nNote:  The data is on Azure Storage Blob, a SAS with Read permission is provided. Please append the following SAS at the end of each link to download: \n```bash\n?sp=r\u0026st=2023-08-28T01:36:35Z\u0026se=3023-08-28T09:36:35Z\u0026sv=2022-11-02\u0026sr=c\u0026sig=coos9vSl4Xk6S6KvqZffkVCUb7Ug%2FFR9cfyc3xacMJI%3D\n```\n\n\n\n\n- EsViT (Swin) with network configurations of increased model capacities, **pre-trained with both view-level and region-level tasks**. ResNet-50 trained with both tasks is shown as a reference.\n\n\u003ctable\u003e\n  \u003ctr\u003e\n    \u003cth\u003earch\u003c/th\u003e\n    \u003cth\u003eparams\u003c/th\u003e\n    \u003cth\u003etasks\u003c/th\u003e\n    \u003cth\u003elinear\u003c/th\u003e\n    \u003cth\u003ek-nn\u003c/th\u003e\n    \u003cth colspan=\"1\"\u003edownload\u003c/th\u003e\n    \u003cth colspan=\"3\"\u003elogs\u003c/th\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eResNet-50\u003c/td\u003e\n    \u003ctd\u003e23M\u003c/td\u003e\n    \u003ctd\u003eV+R\u003c/td\u003e\n    \u003ctd\u003e75.7%\u003c/td\u003e\n    \u003ctd\u003e71.3%\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/resnet/resnet50/bl_lr0.0005_gpu16_bs64_multicrop_epoch300_dino_aug/resume_from_ckpt0200/checkpoint.pth\"\u003efull ckpt\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/resnet/resnet50/bl_lr0.0005_gpu16_bs64_multicrop_epoch300_dino_aug/resume_from_ckpt0200/log.txt\"\u003etrain\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/resnet/resnet50/bl_lr0.0005_gpu16_bs64_multicrop_epoch300_dino_aug/resume_from_ckpt0200/lincls/epoch_last/lr0.01/log.txt\"\u003elinear\u003c/a\u003e\u003c/td\u003e \n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/resnet/resnet50/bl_lr0.0005_gpu16_bs64_multicrop_epoch300_dino_aug/resume_from_ckpt0200/features/epoch0300/log.txt\"\u003eknn\u003c/a\u003e\u003c/td\u003e    \n  \u003c/tr\u003e  \n  \u003ctr\u003e\n    \u003ctd\u003eEsViT (Swin-T, W=7)\u003c/td\u003e\n    \u003ctd\u003e28M\u003c/td\u003e\n    \u003ctd\u003eV+R\u003c/td\u003e\n    \u003ctd\u003e78.0%\u003c/td\u003e\n    \u003ctd\u003e75.7%\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/swin/swin_tiny/bl_lr0.0005_gpu16_bs32_dense_multicrop_epoch300/checkpoint_best.pth\"\u003efull ckpt\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/swin/swin_tiny/bl_lr0.0005_gpu16_bs32_dense_multicrop_epoch300/log.txt\"\u003etrain\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/swin/swin_tiny/bl_lr0.0005_gpu16_bs32_dense_multicrop_epoch300/lincls/epoch0300/log.txt\"\u003elinear\u003c/a\u003e\u003c/td\u003e \n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/swin/swin_tiny/bl_lr0.0005_gpu16_bs32_dense_multicrop_epoch300/features/epoch0280/log.txt\"\u003eknn\u003c/a\u003e\u003c/td\u003e    \n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eEsViT (Swin-S, W=7)\u003c/td\u003e\n    \u003ctd\u003e49M\u003c/td\u003e\n    \u003ctd\u003eV+R\u003c/td\u003e\n    \u003ctd\u003e79.5%\u003c/td\u003e\n    \u003ctd\u003e77.7%\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/swin/swin_small/bl_lr0.0005_gpu16_bs32_dense_multicrop_epoch300/checkpoint_best.pth\"\u003efull ckpt\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/swin/swin_small/bl_lr0.0005_gpu16_bs32_dense_multicrop_epoch300/log.txt\"\u003etrain\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/swin/swin_small/bl_lr0.0005_gpu16_bs32_dense_multicrop_epoch300/lincls/epoch0300/lr_0.003_n_last_blocks4/log.txt\"\u003elinear\u003c/a\u003e\u003c/td\u003e \n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/swin/swin_small/bl_lr0.0005_gpu16_bs32_dense_multicrop_epoch300/features/epoch0280/log.txt\"\u003eknn\u003c/a\u003e\u003c/td\u003e   \n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eEsViT (Swin-B, W=7)\u003c/td\u003e\n    \u003ctd\u003e87M\u003c/td\u003e\n    \u003ctd\u003eV+R\u003c/td\u003e\n    \u003ctd\u003e80.4%\u003c/td\u003e\n    \u003ctd\u003e78.9%\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/swin/swin_base/bl_lr0.0005_gpu16_bs32_multicrop_epoch300_dino_aug/continued_from0200_dense/checkpoint_best.pth\"\u003efull ckpt\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/swin/swin_base/bl_lr0.0005_gpu16_bs32_multicrop_epoch300_dino_aug/continued_from0200_dense/log.txt\"\u003etrain\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/swin/swin_base/bl_lr0.0005_gpu16_bs32_multicrop_epoch300_dino_aug/continued_from0200_dense/lincls/epoch0260/lr_0.001_n_last_blocks4/log.txt\"\u003elinear\u003c/a\u003e\u003c/td\u003e \n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/swin/swin_base/bl_lr0.0005_gpu16_bs32_multicrop_epoch300_dino_aug/continued_from0200_dense/features/epoch0260/log.txt\"\u003eknn\u003c/a\u003e\u003c/td\u003e       \n    \n    \n    \n  \u003ctr\u003e\n    \u003ctd\u003eEsViT (Swin-T, W=14)\u003c/td\u003e\n    \u003ctd\u003e28M\u003c/td\u003e\n    \u003ctd\u003eV+R\u003c/td\u003e\n    \u003ctd\u003e78.7%\u003c/td\u003e\n    \u003ctd\u003e77.0%\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/swin/swin_tiny/bl_lr0.0005_gpu16_bs32_multicrop_epoch300_dino_aug_window14/continued_from0200_dense/checkpoint_best.pth\"\u003efull ckpt\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/swin/swin_tiny/bl_lr0.0005_gpu16_bs32_multicrop_epoch300_dino_aug_window14/continued_from0200_dense/log.txt\"\u003etrain\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/swin/swin_tiny/bl_lr0.0005_gpu16_bs32_multicrop_epoch300_dino_aug_window14/continued_from0200_dense/lincls/epoch_last/log.txt\"\u003elinear\u003c/a\u003e\u003c/td\u003e \n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/swin/swin_tiny/bl_lr0.0005_gpu16_bs32_multicrop_epoch300_dino_aug_window14/continued_from0200_dense/features/epoch0300/log.txt\"\u003eknn\u003c/a\u003e\u003c/td\u003e       \n  \u003c/tr\u003e\n  \n\n  \n  \u003ctr\u003e\n    \u003ctd\u003eEsViT (Swin-S, W=14)\u003c/td\u003e\n    \u003ctd\u003e49M\u003c/td\u003e\n    \u003ctd\u003eV+R\u003c/td\u003e\n    \u003ctd\u003e80.8%\u003c/td\u003e\n    \u003ctd\u003e79.1%\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/swin/swin_small/bl_lr0.0005_gpu16_bs16_multicrop_epoch300_dino_aug_window14/continued_from0180_dense/checkpoint_best.pth\"\u003efull ckpt\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/swin/swin_small/bl_lr0.0005_gpu16_bs16_multicrop_epoch300_dino_aug_window14/continued_from0180_dense/log.txt\"\u003etrain\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/swin/swin_small/bl_lr0.0005_gpu16_bs16_multicrop_epoch300_dino_aug_window14/continued_from0180_dense/lincls/epoch0250/log.txt\"\u003elinear\u003c/a\u003e\u003c/td\u003e \n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/swin/swin_small/bl_lr0.0005_gpu16_bs16_multicrop_epoch300_dino_aug_window14/continued_from0180_dense/features/epoch0250/log.txt\"\u003eknn\u003c/a\u003e\u003c/td\u003e      \n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eEsViT (Swin-B, W=14)\u003c/td\u003e\n    \u003ctd\u003e87M\u003c/td\u003e\n    \u003ctd\u003eV+R\u003c/td\u003e\n    \u003ctd\u003e81.3%\u003c/td\u003e\n    \u003ctd\u003e79.3%\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/swin/swin_base/bl_lr0.0005_nodes4_gpu16_bs8_multicrop_epoch300_dino_aug_window14_lv/continued_from_epoch0200_dense_norm_true/checkpoint_best.pth\"\u003efull ckpt\u003c/a\u003e\u003c/td\u003e  \n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/swin/swin_base/bl_lr0.0005_nodes4_gpu16_bs8_multicrop_epoch300_dino_aug_window14_lv/continued_from_epoch0200_dense_norm_true/log.txt\"\u003etrain\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/swin/swin_base/bl_lr0.0005_nodes4_gpu16_bs8_multicrop_epoch300_dino_aug_window14_lv/continued_from_epoch0200_dense_norm_true/lincls/epoch0240/lr_0.001_n_last_blocks4/log.txt\"\u003elinear\u003c/a\u003e\u003c/td\u003e \n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/swin/swin_base/bl_lr0.0005_nodes4_gpu16_bs8_multicrop_epoch300_dino_aug_window14_lv/continued_from_epoch0200_dense_norm_true/features/epoch0240/log.txt\"\u003eknn\u003c/a\u003e\u003c/td\u003e       \n  \u003c/tr\u003e  \n\u003c/table\u003e\n\n\n- EsViT with view-level task only\n\n\n\u003ctable\u003e\n  \u003ctr\u003e\n    \u003cth\u003earch\u003c/th\u003e\n    \u003cth\u003eparams\u003c/th\u003e\n    \u003cth\u003etasks\u003c/th\u003e\n    \u003cth\u003elinear\u003c/th\u003e\n    \u003cth\u003ek-nn\u003c/th\u003e\n    \u003cth colspan=\"1\"\u003edownload\u003c/th\u003e\n    \u003cth colspan=\"3\"\u003elogs\u003c/th\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eResNet-50\u003c/td\u003e\n    \u003ctd\u003e23M\u003c/td\u003e\n    \u003ctd\u003eV\u003c/td\u003e\n    \u003ctd\u003e75.0%\u003c/td\u003e\n    \u003ctd\u003e69.1%\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/resnet/resnet50/bl_lr0.0005_gpu16_bs64_multicrop_epoch300_dino_aug/checkpoint.pth\"\u003efull ckpt\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/resnet/resnet50/bl_lr0.0005_gpu16_bs64_multicrop_epoch300_dino_aug/log.txt\"\u003etrain\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/resnet/resnet50/bl_lr0.0005_gpu16_bs64_multicrop_epoch300_dino_aug/lincls/epoch_last/log.txt\"\u003elinear\u003c/a\u003e\u003c/td\u003e \n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/resnet/resnet50/bl_lr0.0005_gpu16_bs64_multicrop_epoch300_dino_aug/features/epoch0300/log.txt\"\u003eknn\u003c/a\u003e\u003c/td\u003e    \n  \u003c/tr\u003e  \n  \u003ctr\u003e\n    \u003ctd\u003eEsViT (Swin-T, W=7)\u003c/td\u003e\n    \u003ctd\u003e28M\u003c/td\u003e\n    \u003ctd\u003eV\u003c/td\u003e\n    \u003ctd\u003e77.0%\u003c/td\u003e\n    \u003ctd\u003e74.2%\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/swin/swin_tiny/bl_lr0.0005_gpu16_bs32_multicrop_epoch300_dino_aug_window7/checkpoint.pth\"\u003efull ckpt\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/swin/swin_tiny/bl_lr0.0005_gpu16_bs32_multicrop_epoch300_dino_aug_window7/log.txt\"\u003etrain\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/swin/swin_tiny/bl_lr0.0005_gpu16_bs32_multicrop_epoch300_dino_aug_window7/lincls/epoch0300/log.txt\"\u003elinear\u003c/a\u003e\u003c/td\u003e \n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/swin/swin_tiny/bl_lr0.0005_gpu16_bs32_multicrop_epoch300_dino_aug_window7/features/epoch0300/log.txt\"\u003eknn\u003c/a\u003e\u003c/td\u003e    \n  \u003c/tr\u003e  \n  \u003ctr\u003e\n    \u003ctd\u003eEsViT (Swin-S, W=7)\u003c/td\u003e\n    \u003ctd\u003e49M\u003c/td\u003e\n    \u003ctd\u003eV\u003c/td\u003e\n    \u003ctd\u003e79.2%\u003c/td\u003e\n    \u003ctd\u003e76.9%\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/swin/swin_small/bl_lr0.0005_gpu16_bs32_multicrop_epoch300/checkpoint.pth\"\u003efull ckpt\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/swin/swin_small/bl_lr0.0005_gpu16_bs32_multicrop_epoch300/log.txt\"\u003etrain\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/swin/swin_small/bl_lr0.0005_gpu16_bs32_multicrop_epoch300/lincls/epoch0300/lr_0.003_n_last_blocks4/log.txt\"\u003elinear\u003c/a\u003e\u003c/td\u003e \n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/swin/swin_small/bl_lr0.0005_gpu16_bs32_multicrop_epoch300/features/epoch0280/log.txt\"\u003eknn\u003c/a\u003e\u003c/td\u003e   \n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eEsViT (Swin-B, W=7)\u003c/td\u003e\n    \u003ctd\u003e87M\u003c/td\u003e\n    \u003ctd\u003eV\u003c/td\u003e\n    \u003ctd\u003e79.6%\u003c/td\u003e\n    \u003ctd\u003e77.7%\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/swin/swin_base/bl_lr0.0005_gpu16_bs32_multicrop_epoch300_dino_aug/checkpoint.pth\"\u003efull ckpt\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/swin/swin_base/bl_lr0.0005_gpu16_bs32_multicrop_epoch300_dino_aug/log.txt\"\u003etrain\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/swin/swin_base/bl_lr0.0005_gpu16_bs32_multicrop_epoch300_dino_aug/lincls/epoch0260/lr_0.001_n_last_blocks4/log.txt\"\u003elinear\u003c/a\u003e\u003c/td\u003e \n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/swin/swin_base/bl_lr0.0005_gpu16_bs32_multicrop_epoch300_dino_aug/continued_from0200_dense/features/epoch0260/log.txt\"\u003eknn\u003c/a\u003e\u003c/td\u003e       \n    \n  \u003c/tr\u003e  \n\u003c/table\u003e\n\n\n\n- EsViT (Swin-T, W=7) with different pre-train datasets (view-level task only)\n\n\u003ctable\u003e\n  \u003ctr\u003e\n    \u003cth\u003earch\u003c/th\u003e\n    \u003cth\u003eparams\u003c/th\u003e\n    \u003cth\u003ebatch size\u003c/th\u003e\n    \u003cth\u003epre-train dataset\u003c/th\u003e\n    \u003cth\u003elinear\u003c/th\u003e\n    \u003cth\u003ek-nn\u003c/th\u003e\n    \u003cth colspan=\"1\"\u003edownload\u003c/th\u003e\n    \u003cth colspan=\"3\"\u003elogs\u003c/th\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eEsViT\u003c/td\u003e\n    \u003ctd\u003e28M\u003c/td\u003e\n    \u003ctd\u003e1024\u003c/td\u003e\n    \u003ctd\u003eImageNet-1K\u003c/td\u003e\n    \u003ctd\u003e77.1%\u003c/td\u003e\n    \u003ctd\u003e73.7%\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/swin/swin_tiny/bl_lr0.0005_gpu16_bs64_multicrop_epoch300/checkpoint.pth\"\u003efull ckpt\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/swin/swin_tiny/bl_lr0.0005_gpu16_bs64_multicrop_epoch300/log.txt\"\u003etrain\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/swin/swin_tiny/bl_lr0.0005_gpu16_bs64_multicrop_epoch300/lincls/epoch0300/log.txt\"\u003elinear\u003c/a\u003e\u003c/td\u003e \n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/swin/swin_tiny/bl_lr0.0005_gpu16_bs64_multicrop_epoch300/features/epoch0280/log.txt\"\u003eknn\u003c/a\u003e\u003c/td\u003e    \n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eEsViT\u003c/td\u003e\n    \u003ctd\u003e28M\u003c/td\u003e\n    \u003ctd\u003e1024\u003c/td\u003e\n    \u003ctd\u003eWebVision-v1\u003c/td\u003e\n    \u003ctd\u003e75.4%\u003c/td\u003e  \n    \u003ctd\u003e69.4%\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/swin/swin_tiny/bl_lr0.0005_gpu16_bs64_multicrop_epoch150_dino_aug_window7_webvision1_debug/checkpoint.pth\"\u003efull ckpt\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/swin/swin_tiny/bl_lr0.0005_gpu16_bs64_multicrop_epoch150_dino_aug_window7_webvision1_debug/log.txt\"\u003etrain\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/swin/swin_tiny/bl_lr0.0005_gpu16_bs64_multicrop_epoch150_dino_aug_window7_webvision1_debug/lincls/epoch_last/log.txt\"\u003elinear\u003c/a\u003e\u003c/td\u003e \n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/swin/swin_tiny/bl_lr0.0005_gpu16_bs64_multicrop_epoch150_dino_aug_window7_webvision1_debug/features/epoch0150/log.txt\"\u003eknn\u003c/a\u003e\u003c/td\u003e   \n  \u003c/tr\u003e\n  \n  \u003ctr\u003e\n    \u003ctd\u003eEsViT\u003c/td\u003e\n    \u003ctd\u003e28M\u003c/td\u003e\n    \u003ctd\u003e1024\u003c/td\u003e\n    \u003ctd\u003eOpenImages-v4\u003c/td\u003e \n    \u003ctd\u003e69.6%\u003c/td\u003e   \n    \u003ctd\u003e60.3%\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/swin/swin_tiny/bl_lr0.0005_gpu16_bs64_multicrop_epoch150_dino_aug_window7_openimages_v4_debug/checkpoint.pth\"\u003efull ckpt\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/swin/swin_tiny/bl_lr0.0005_gpu16_bs64_multicrop_epoch150_dino_aug_window7_openimages_v4_debug/log.txt\"\u003etrain\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/swin/swin_tiny/bl_lr0.0005_gpu16_bs64_multicrop_epoch150_dino_aug_window7_openimages_v4_debug/lincls/epoch_last/log.txt\"\u003elinear\u003c/a\u003e\u003c/td\u003e \n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/swin/swin_tiny/bl_lr0.0005_gpu16_bs64_multicrop_epoch150_dino_aug_window7_openimages_v4_debug/features/epoch050/log.txt\"\u003eknn\u003c/a\u003e\u003c/td\u003e   \n  \u003c/tr\u003e\n  \n  \u003ctr\u003e\n    \u003ctd\u003eEsViT\u003c/td\u003e\n    \u003ctd\u003e28M\u003c/td\u003e\n    \u003ctd\u003e1024\u003c/td\u003e\n    \u003ctd\u003eImageNet-22K\u003c/td\u003e\n    \u003ctd\u003e73.5%\u003c/td\u003e    \n    \u003ctd\u003e66.1%\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/swin/swin_tiny/bl_lr0.0005_gpu16_bs64_multicrop_epoch300_dino_aug_window7_imagenet22k_debug/checkpoint.pth\"\u003efull ckpt\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/swin/swin_tiny/bl_lr0.0005_gpu16_bs64_multicrop_epoch300_dino_aug_window7_imagenet22k_debug/log.txt\"\u003etrain\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/swin/swin_tiny/bl_lr0.0005_gpu16_bs64_multicrop_epoch300_dino_aug_window7_imagenet22k_debug/lincls/epoch_last/log.txt\"\u003elinear\u003c/a\u003e\u003c/td\u003e \n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/swin/swin_tiny/bl_lr0.0005_gpu16_bs64_multicrop_epoch300_dino_aug_window7_imagenet22k_debug/features/epoch0030/log.txt\"\u003eknn\u003c/a\u003e\u003c/td\u003e   \n  \u003c/tr\u003e  \n\u003c/table\u003e\n\n\n- EsViT with more multi-stage vision Transformer architectures, pre-trained with **V**iew-level and **R**egion-level tasks.\n\n\u003ctable\u003e\n  \u003ctr\u003e\n    \u003cth\u003earch\u003c/th\u003e\n    \u003cth\u003eparams\u003c/th\u003e\n    \u003cth\u003epre-train task\u003c/th\u003e\n    \u003cth\u003elinear\u003c/th\u003e\n    \u003cth\u003ek-nn\u003c/th\u003e\n    \u003cth colspan=\"1\"\u003edownload\u003c/th\u003e\n    \u003cth colspan=\"3\"\u003elogs\u003c/th\u003e\n  \u003c/tr\u003e\n\n  \u003ctr\u003e\n    \u003ctd\u003eEsViT (ViL, W=7)\u003c/td\u003e\n    \u003ctd\u003e28M\u003c/td\u003e\n    \u003ctd\u003eV\u003c/td\u003e\n    \u003ctd\u003e77.3%\u003c/td\u003e\n    \u003ctd\u003e73.9%\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/vil/vil_2262/bl_lr0.0005_gpu16_bs32_multicrop_epoch300/vil_mode0/checkpoint.pth\"\u003efull ckpt\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/vil/vil_2262/bl_lr0.0005_gpu16_bs32_multicrop_epoch300/vil_mode0/log.txt\"\u003etrain\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/vil/vil_2262/bl_lr0.0005_gpu16_bs64_multicrop_epoch300/lincls/epoch0300/4_last_blocks/log.txt\"\u003elinear\u003c/a\u003e\u003c/td\u003e \n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/vil/vil_2262/bl_lr0.0005_gpu16_bs32_multicrop_epoch300/vil_mode0/features/epoch300/log.txt\"\u003eknn\u003c/a\u003e\u003c/td\u003e       \n    \n    \n    \n  \u003ctr\u003e\n    \u003ctd\u003eEsViT (ViL, W=7)\u003c/td\u003e\n    \u003ctd\u003e28M\u003c/td\u003e\n    \u003ctd\u003eV+R\u003c/td\u003e\n    \u003ctd\u003e77.5%\u003c/td\u003e\n    \u003ctd\u003e74.5%\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/vil/vil_2262/bl_lr0.0005_gpu16_bs64_multicrop_epoch300/continued_from0200_dense/checkpoint.pth\"\u003efull ckpt\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/vil/vil_2262/bl_lr0.0005_gpu16_bs64_multicrop_epoch300/continued_from0200_dense/log.txt\"\u003etrain\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/vil/vil_2262/bl_lr0.0005_gpu16_bs64_multicrop_epoch300/continued_from0200_dense/log.txt\"\u003elinear\u003c/a\u003e\u003c/td\u003e \n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/vil/vil_2262/bl_lr0.0005_gpu16_bs64_multicrop_epoch300/continued_from0200_dense/features/epoch300/log.txt\"\u003eknn\u003c/a\u003e\u003c/td\u003e       \n  \u003c/tr\u003e\n  \n\n  \n  \u003ctr\u003e\n    \u003ctd\u003eEsViT (CvT, W=7)\u003c/td\u003e\n    \u003ctd\u003e29M\u003c/td\u003e\n    \u003ctd\u003eV\u003c/td\u003e\n    \u003ctd\u003e77.6%\u003c/td\u003e\n    \u003ctd\u003e74.8%\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/cvt/cvt_tiny/bl_lr0.0005_gpu16_bs64_multicrop_epoch300_dino_aug/checkpoint.pth\"\u003efull ckpt\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/cvt/cvt_tiny/bl_lr0.0005_gpu16_bs64_multicrop_epoch300_dino_aug/log.txt\"\u003etrain\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/cvt/cvt_tiny/bl_lr0.0005_gpu16_bs64_multicrop_epoch300_dino_aug/lincls/epoch0300/log.txt\"\u003elinear\u003c/a\u003e\u003c/td\u003e \n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/cvt/cvt_tiny/bl_lr0.0005_gpu16_bs64_multicrop_epoch300_dino_aug/features/epoch300/log.txt\"\u003eknn\u003c/a\u003e\u003c/td\u003e      \n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eEsViT (CvT, W=7)\u003c/td\u003e\n    \u003ctd\u003e29M\u003c/td\u003e\n    \u003ctd\u003eV+R\u003c/td\u003e\n    \u003ctd\u003e78.5%\u003c/td\u003e\n    \u003ctd\u003e76.7%\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/cvt/cvt_tiny/bl_lr0.0005_gpu16_bs64_multicrop_epoch300_dino_aug/continued_from0200_dense/checkpoint.pth\"\u003efull ckpt\u003c/a\u003e\u003c/td\u003e  \n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/cvt/cvt_tiny/bl_lr0.0005_gpu16_bs64_multicrop_epoch300_dino_aug/continued_from0200_dense/log.txt\"\u003etrain\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/cvt/cvt_tiny/bl_lr0.0005_gpu16_bs64_multicrop_epoch300_dino_aug/continued_from0200_dense/lincls/epoch0300/log.txt\"\u003elinear\u003c/a\u003e\u003c/td\u003e \n    \u003ctd\u003e\u003ca href=\"https://chunyleu.blob.core.windows.net/output/ckpts/esvit/cvt/cvt_tiny/bl_lr0.0005_gpu16_bs64_multicrop_epoch300_dino_aug/continued_from0200_dense/features/epoch300/log.txt\"\u003eknn\u003c/a\u003e\u003c/td\u003e       \n  \u003c/tr\u003e  \n\u003c/table\u003e\n\n\n## Pre-training\n\n### One-node training\nTo train on 1 node with 16 GPUs for Swin-T model size:\n```\nPROJ_PATH=your_esvit_project_path\nDATA_PATH=$PROJ_PATH/project/data/imagenet\n\nOUT_PATH=$PROJ_PATH/output/esvit_exp/ssl/swin_tiny_imagenet/\npython -m torch.distributed.launch --nproc_per_node=16 main_esvit.py --arch swin_tiny --data_path $DATA_PATH/train --output_dir $OUT_PATH --batch_size_per_gpu 32 --epochs 300 --teacher_temp 0.07 --warmup_epochs 10 --warmup_teacher_temp_epochs 30 --norm_last_layer false --use_dense_prediction True --cfg experiments/imagenet/swin/swin_tiny_patch4_window7_224.yaml \n```\n\nThe main training script is [`main_esvit.py`](./main_esvit.py) and conducts the training loop, taking the following options (among others) as arguments:\n\n- `--use_dense_prediction`: whether or not to use the region matching task in pre-training\n- `--arch`: switch between different sparse self-attention in the multi-stage Transformer architecture. Example architecture choices for EsViT training include [`swin_tiny`, `swin_small`, `swin_base`, `swin_large`,`cvt_tiny`, `vil_2262`]. The configuration files should be adjusted accrodingly, we provide example below. One may specify the network configuration by editing the `YAML` file under `experiments/imagenet/*/*.yaml`. The default window size=7; To consider a multi-stage architecture with window size=14, please choose yaml files with `window14` in filenames.\n\n\nTo train on 1 node with 16 GPUs for Convolutional vision Transformer (CvT) models:\n```\npython -m torch.distributed.launch --nproc_per_node=16 main_evsit.py --arch cvt_tiny --data_path $DATA_PATH/train --output_dir $OUT_PATH --batch_size_per_gpu 32 --epochs 300 --teacher_temp 0.07 --warmup_epochs 10 --warmup_teacher_temp_epochs 30 --norm_last_layer false --use_dense_prediction True --aug-opt dino_aug --cfg experiments/imagenet/cvt_v4/s1.yaml\n```\n\nTo train on 1 node with 16 GPUs for Vision Longformer (ViL) models:\n```\npython -m torch.distributed.launch --nproc_per_node=16 main_evsit.py --arch vil_2262 --data_path $DATA_PATH/train --output_dir $OUT_PATH --batch_size_per_gpu 32 --epochs 300 --teacher_temp 0.07 --warmup_epochs 10 --warmup_teacher_temp_epochs 30 --norm_last_layer false --use_dense_prediction True --aug-opt dino_aug --cfg experiments/imagenet/vil/vil_small/base.yaml MODEL.SPEC.MSVIT.ARCH 'l1,h3,d96,n2,s1,g1,p4,f7,a0_l2,h6,d192,n2,s1,g1,p2,f7,a0_l3,h12,d384,n6,s0,g1,p2,f7,a0_l4,h24,d768,n2,s0,g0,p2,f7,a0' MODEL.SPEC.MSVIT.MODE 1 MODEL.SPEC.MSVIT.VIL_MODE_SWITCH 0.75\n```\n\n\n### Multi-node training\nTo train on 2 nodes with 16 GPUs each (total 32 GPUs) for Swin-Small model size:\n```\nOUT_PATH=$PROJ_PATH/exp_output/esvit_exp/swin/swin_small/bl_lr0.0005_gpu16_bs16_multicrop_epoch300_dino_aug_window14\npython main_evsit_mnodes.py --num_nodes 2 --num_gpus_per_node 16 --data_path $DATA_PATH/train --output_dir $OUT_PATH/continued_from0200_dense --batch_size_per_gpu 16 --arch swin_small --zip_mode True --epochs 300 --teacher_temp 0.07 --warmup_epochs 10 --warmup_teacher_temp_epochs 30 --norm_last_layer false --cfg experiments/imagenet/swin/swin_small_patch4_window14_224.yaml --use_dense_prediction True --pretrained_weights_ckpt $OUT_PATH/checkpoint0200.pth\n```\n\n## Evaluation: \n\n### k-NN and Linear classification on ImageNet\n\n\nTo train a supervised linear classifier on frozen weights on a single node with 4 gpus, run `eval_linear.py`. To train a k-NN classifier on frozen weights on a single node with 4 gpus, run `eval_knn.py`. Please specify `--arch`, `--cfg` and `--pretrained_weights` to  choose a pre-trained checkpoint. If you want to evaluate the last checkpoint of EsViT with Swin-T, you can run for example:\n\n\n```\nPROJ_PATH=your_esvit_project_path\nDATA_PATH=$PROJ_PATH/project/data/imagenet\n\nOUT_PATH=$PROJ_PATH/exp_output/esvit_exp/swin/swin_tiny/bl_lr0.0005_gpu16_bs32_dense_multicrop_epoch300\nCKPT_PATH=$PROJ_PATH/exp_output/esvit_exp/swin/swin_tiny/bl_lr0.0005_gpu16_bs32_dense_multicrop_epoch300/checkpoint.pth\n\npython -m torch.distributed.launch --nproc_per_node=4 eval_linear.py --data_path $DATA_PATH --output_dir $OUT_PATH/lincls/epoch0300 --pretrained_weights $CKPT_PATH --checkpoint_key teacher --batch_size_per_gpu 256 --arch swin_tiny --cfg experiments/imagenet/swin/swin_tiny_patch4_window7_224.yaml --n_last_blocks 4 --num_labels 1000 MODEL.NUM_CLASSES 0\n\npython -m torch.distributed.launch --nproc_per_node=4 eval_knn.py --data_path $DATA_PATH --dump_features $OUT_PATH/features/epoch0300 --pretrained_weights $CKPT_PATH --checkpoint_key teacher --batch_size_per_gpu 256 --arch swin_tiny --cfg experiments/imagenet/swin/swin_tiny_patch4_window7_224.yaml MODEL.NUM_CLASSES 0\n```\n\n\n\n## Analysis/Visualization of correspondence and attention maps\nYou can analyze the learned models by running `python run_analysis.py`. One example to analyze EsViT (Swin-T) is shown.\n\nFor an invidiual image (with path `--image_path $IMG_PATH`), we visualize the attention maps and correspondence of the last layer:\n\n```\npython run_analysis.py --arch swin_tiny --image_path $IMG_PATH --output_dir $OUT_PATH --pretrained_weights $CKPT_PATH --learning ssl --seed $SEED --cfg experiments/imagenet/swin/swin_tiny_patch4_window7_224.yaml --vis_attention True --vis_correspondence True MODEL.NUM_CLASSES 0 \n```\n\nFor an image dataset (with path `--data_path $DATA_PATH`), we quantatively measure the correspondence:\n\n```\npython run_analysis.py --arch swin_tiny --data_path $DATA_PATH --output_dir $OUT_PATH --pretrained_weights $CKPT_PATH --learning ssl --seed $SEED --cfg experiments/imagenet/swin/swin_tiny_patch4_window7_224.yaml  --measure_correspondence True MODEL.NUM_CLASSES 0 \n```\n\nFor more examples, please see `scripts/scripts_local/run_analysis.sh`.\n\n## Citation\n\nIf you find this repository useful, please consider giving a star :star:   and citation :beer::\n\n```\n@article{li2021esvit,\n  title={Efficient Self-supervised Vision Transformers for Representation Learning},\n  author={Li, Chunyuan and Yang, Jianwei and Zhang, Pengchuan and Gao, Mei and Xiao, Bin and Dai, Xiyang and Yuan, Lu and Gao, Jianfeng},\n  journal={International Conference on Learning Representations (ICLR)},\n  year={2022}\n}\n```\n\n#### Related Projects/Codebase\n\n[[Swin Transformers](https://github.com/microsoft/Swin-Transformer)]  [[Vision Longformer](https://github.com/microsoft/vision-longformer)]  [[Convolutional vision Transformers (CvT)](https://github.com/microsoft/CvT)]  [[Focal Transformers](https://github.com/microsoft/Focal-Transformer)]\n\n#### Acknowledgement \nOur implementation is built partly upon packages: [[Dino](https://github.com/facebookresearch/dino)]  [[Timm](https://github.com/rwightman/pytorch-image-models)]\n\n\n\n\n## Contributing\n\nThis project welcomes contributions and suggestions.  Most contributions require you to agree to a\nContributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us\nthe rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.\n\nWhen you submit a pull request, a CLA bot will automatically determine whether you need to provide\na CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions\nprovided by the bot. You will only need to do this once across all repos using our CLA.\n\nThis project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).\nFor more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or\ncontact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.\n\n## Trademarks\n\nThis project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft \ntrademarks or logos is subject to and must follow \n[Microsoft's Trademark \u0026 Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).\nUse of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.\nAny use of third-party trademarks or logos are subject to those third-party's policies.\n","funding_links":[],"categories":["Python"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2Fesvit","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmicrosoft%2Fesvit","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2Fesvit/lists"}