{"id":13486268,"url":"https://github.com/SwinTransformer/Video-Swin-Transformer","last_synced_at":"2025-03-27T20:32:59.039Z","repository":{"id":39332707,"uuid":"380026780","full_name":"SwinTransformer/Video-Swin-Transformer","owner":"SwinTransformer","description":"This is an official implementation for \"Video Swin Transformers\".","archived":false,"fork":true,"pushed_at":"2023-03-08T07:48:46.000Z","size":43027,"stargazers_count":1415,"open_issues_count":69,"forks_count":198,"subscribers_count":9,"default_branch":"master","last_synced_at":"2024-09-27T04:01:49.918Z","etag":null,"topics":["swin-transformer","video-recognition"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2106.13230","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"open-mmlab/mmaction2","license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/SwinTransformer.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":".github/CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":".github/CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-06-24T19:09:41.000Z","updated_at":"2024-09-27T03:25:56.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/SwinTransformer/Video-Swin-Transformer","commit_stats":null,"previous_names":[],"tags_count":11,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SwinTransformer%2FVideo-Swin-Transformer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SwinTransformer%2FVideo-Swin-Transformer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SwinTransformer%2FVideo-Swin-Transformer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SwinTransformer%2FVideo-Swin-Transformer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/SwinTransformer","download_url":"https://codeload.github.com/SwinTransformer/Video-Swin-Transformer/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":222313893,"owners_count":16965407,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["swin-transformer","video-recognition"],"created_at":"2024-07-31T18:00:42.870Z","updated_at":"2024-10-30T21:31:29.048Z","avatar_url":"https://github.com/SwinTransformer.png","language":"Python","funding_links":[],"categories":["Uncategorized","Frameworks and Libraries"],"sub_categories":["Uncategorized","Video Action Recognition"],"readme":"# Video Swin Transformer\n\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-swin-transformer/action-classification-on-kinetics-400)](https://paperswithcode.com/sota/action-classification-on-kinetics-400?p=video-swin-transformer)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-swin-transformer/action-classification-on-kinetics-600)](https://paperswithcode.com/sota/action-classification-on-kinetics-600?p=video-swin-transformer)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-swin-transformer/action-recognition-in-videos-on-something)](https://paperswithcode.com/sota/action-recognition-in-videos-on-something?p=video-swin-transformer)\n\nBy [Ze Liu](https://github.com/zeliu98/)\\*, [Jia Ning](https://github.com/hust-nj)\\*, [Yue Cao](http://yue-cao.me),  [Yixuan Wei](https://github.com/weiyx16), [Zheng Zhang](https://stupidzz.github.io/), [Stephen Lin](https://scholar.google.com/citations?user=c3PYmxUAAAAJ\u0026hl=en) and [Han Hu](https://ancientmooner.github.io/).\n\nThis repo is the official implementation of [\"Video Swin Transformer\"](https://arxiv.org/abs/2106.13230). It is based on [mmaction2](https://github.com/open-mmlab/mmaction2).\n\n## Updates\n\n***06/25/2021*** Initial commits\n\n## Introduction\n\n**Video Swin Transformer** is initially described in [\"Video Swin Transformer\"](https://arxiv.org/abs/2106.13230), which advocates an inductive bias of locality in video Transformers, leading to a better speed-accuracy trade-off compared to previous approaches which compute self-attention globally even with spatial-temporal factorization. The locality of the proposed video architecture is realized by adapting the Swin Transformer designed for the image domain, while continuing to leverage the power of pre-trained image models. Our approach achieves state-of-the-art accuracy on a broad range of video recognition benchmarks, including action recognition (`84.9` top-1 accuracy on Kinetics-400 and `86.1` top-1 accuracy on Kinetics-600 with `~20x` less pre-training data and `~3x` smaller model size) and temporal modeling (`69.6` top-1 accuracy on Something-Something v2).\n\n\n![teaser](figures/teaser.png)\n\n## Results and Models\n\n### Kinetics 400\n\n| Backbone |  Pretrain   | Lr Schd | spatial crop | acc@1 | acc@5 | #params | FLOPs | config | model |\n| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |\n|  Swin-T  | ImageNet-1K |  30ep   |     224      |  78.8  |  93.6  |   28M   |  87.9G  |  [config](configs/recognition/swin/swin_tiny_patch244_window877_kinetics400_1k.py)  | [github](https://github.com/SwinTransformer/storage/releases/download/v1.0.4/swin_tiny_patch244_window877_kinetics400_1k.pth)/[baidu](https://pan.baidu.com/s/1mIqRzk8RILeRsP2KB5T6fg) |\n|  Swin-S  | ImageNet-1K |  30ep   |     224      |  80.6  |  94.5  |   50M   |  165.9G  |  [config](configs/recognition/swin/swin_small_patch244_window877_kinetics400_1k.py)   | [github](https://github.com/SwinTransformer/storage/releases/download/v1.0.4/swin_small_patch244_window877_kinetics400_1k.pth)/[baidu](https://pan.baidu.com/s/1imq7LFNtSu3VkcRjd04D4Q) |\n|  Swin-B  | ImageNet-1K |  30ep   |     224      |  80.6  |  94.6  |   88M   |  281.6G  |  [config](configs/recognition/swin/swin_base_patch244_window877_kinetics400_1k.py)   | [github](https://github.com/SwinTransformer/storage/releases/download/v1.0.4/swin_base_patch244_window877_kinetics400_1k.pth)/[baidu](https://pan.baidu.com/s/1bD2lxGxqIV7xECr1n2slng) |\n|  Swin-B  | ImageNet-22K |  30ep   |     224      |  82.7  |  95.5  |   88M   |  281.6G  |  [config](configs/recognition/swin/swin_base_patch244_window877_kinetics400_22k.py)   | [github](https://github.com/SwinTransformer/storage/releases/download/v1.0.4/swin_base_patch244_window877_kinetics400_22k.pth)/[baidu](https://pan.baidu.com/s/1CcCNzJAIud4niNPcREbDbQ) |\n\n### Kinetics 600\n\n| Backbone |  Pretrain   | Lr Schd | spatial crop | acc@1 | acc@5 | #params | FLOPs | config | model |\n| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |\n|  Swin-B  | ImageNet-22K |  30ep   |     224      |  84.0  |  96.5  |   88M   |  281.6G  |  [config](configs/recognition/swin/swin_base_patch244_window877_kinetics600_22k.py)   | [github](https://github.com/SwinTransformer/storage/releases/download/v1.0.4/swin_base_patch244_window877_kinetics600_22k.pth)/[baidu](https://pan.baidu.com/s/1ZMeW6ylELTje-o3MiaZ-MQ) |\n\n### Something-Something V2\n\n| Backbone |  Pretrain   | Lr Schd | spatial crop | acc@1 | acc@5 | #params | FLOPs | config | model |\n| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |\n|  Swin-B  | Kinetics 400 |  60ep  |     224      |  69.6  |  92.7  |   89M   |  320.6G  |  [config](configs/recognition/swin/swin_base_patch244_window1677_sthv2.py)   | [github](https://github.com/SwinTransformer/storage/releases/download/v1.0.4/swin_base_patch244_window1677_sthv2.pth)/[baidu](https://pan.baidu.com/s/18MOGf6L3LeUjrLoQEeA52Q) |\n\n**Notes**:\n\n- **Pre-trained image models can be downloaded from [Swin Transformer for ImageNet Classification](https://github.com/microsoft/Swin-Transformer)**.\n- The pre-trained model of SSv2 could be downloaded at [github](https://github.com/SwinTransformer/storage/releases/download/v1.0.4/swin_base_patch244_window1677_kinetics400_22k.pth)/[baidu](https://pan.baidu.com/s/1ZnJuX7-x2BflDKHpuvdLUg).\n- Access code for baidu is `swin`.\n\n## Usage\n\n###  Installation\n\nPlease refer to [install.md](docs/install.md) for installation.\n\nWe also provide docker file [cuda10.1](docker/docker_10.1) ([image url](https://hub.docker.com/layers/ninja0/mmdet/pytorch1.7.1-py37-cuda10.1-openmpi-mmcv1.3.3-apex-timm/images/sha256-06d745934cb255e7fdf4fa55c47b192c81107414dfb3d0bc87481ace50faf90b?context=repo)) and [cuda11.0](docker/docker_11.0) ([image url](https://hub.docker.com/layers/ninja0/mmdet/pytorch1.7.1-py37-cuda11.0-openmpi-mmcv1.3.3-apex-timm/images/sha256-79ec3ec5796ca154a66d85c50af5fa870fcbc48357c35ee8b612519512f92828?context=repo)) for convenient usage.\n\n###  Data Preparation\n\nPlease refer to [data_preparation.md](docs/data_preparation.md) for a general knowledge of data preparation.\nThe supported datasets are listed in [supported_datasets.md](docs/supported_datasets.md).\n\nWe also share our Kinetics-400 annotation file [k400_val](https://github.com/SwinTransformer/storage/releases/download/v1.0.6/k400_val.txt), [k400_train](https://github.com/SwinTransformer/storage/releases/download/v1.0.6/k400_train.txt) for better comparison.\n\n### Inference\n```\n# single-gpu testing\npython tools/test.py \u003cCONFIG_FILE\u003e \u003cCHECKPOINT_FILE\u003e --eval top_k_accuracy\n\n# multi-gpu testing\nbash tools/dist_test.sh \u003cCONFIG_FILE\u003e \u003cCHECKPOINT_FILE\u003e \u003cGPU_NUM\u003e --eval top_k_accuracy\n```\n\n### Training\n\nTo train a video recognition model with pre-trained image models (for Kinetics-400 and Kineticc-600 datasets), run:\n```\n# single-gpu training\npython tools/train.py \u003cCONFIG_FILE\u003e --cfg-options model.backbone.pretrained=\u003cPRETRAIN_MODEL\u003e [model.backbone.use_checkpoint=True] [other optional arguments]\n\n# multi-gpu training\nbash tools/dist_train.sh \u003cCONFIG_FILE\u003e \u003cGPU_NUM\u003e --cfg-options model.backbone.pretrained=\u003cPRETRAIN_MODEL\u003e [model.backbone.use_checkpoint=True] [other optional arguments]\n```\nFor example, to train a `Swin-T` model for Kinetics-400 dataset  with  8 gpus, run:\n```\nbash tools/dist_train.sh configs/recognition/swin/swin_tiny_patch244_window877_kinetics400_1k.py 8 --cfg-options model.backbone.pretrained=\u003cPRETRAIN_MODEL\u003e \n```\n\nTo train a video recognizer with pre-trained video models (for Something-Something v2 datasets), run:\n```\n# single-gpu training\npython tools/train.py \u003cCONFIG_FILE\u003e --cfg-options load_from=\u003cPRETRAIN_MODEL\u003e [model.backbone.use_checkpoint=True] [other optional arguments]\n\n# multi-gpu training\nbash tools/dist_train.sh \u003cCONFIG_FILE\u003e \u003cGPU_NUM\u003e --cfg-options load_from=\u003cPRETRAIN_MODEL\u003e [model.backbone.use_checkpoint=True] [other optional arguments]\n```\nFor example, to train a `Swin-B` model for SSv2 dataset with 8 gpus, run:\n```\nbash tools/dist_train.sh configs/recognition/swin/swin_base_patch244_window1677_sthv2.py 8 --cfg-options load_from=\u003cPRETRAIN_MODEL\u003e\n```\n\n**Note:** `use_checkpoint` is used to save GPU memory. Please refer to [this page](https://pytorch.org/docs/stable/checkpoint.html) for more details.\n\n\n### Apex (optional):\nWe use apex for mixed precision training by default. To install apex, use our provided docker or run:\n```\ngit clone https://github.com/NVIDIA/apex\ncd apex\npip install -v --disable-pip-version-check --no-cache-dir --global-option=\"--cpp_ext\" --global-option=\"--cuda_ext\" ./\n```\nIf you would like to disable apex, comment out the following code block in the [configuration files](configs/recognition/swin):\n```\n# do not use mmcv version fp16\nfp16 = None\noptimizer_config = dict(\n    type=\"DistOptimizerHook\",\n    update_interval=1,\n    grad_clip=None,\n    coalesce=True,\n    bucket_size_mb=-1,\n    use_fp16=True,\n)\n```\n\n## Citation\nIf you find our work useful in your research, please cite:\n\n```\n@article{liu2021video,\n  title={Video Swin Transformer},\n  author={Liu, Ze and Ning, Jia and Cao, Yue and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Hu, Han},\n  journal={arXiv preprint arXiv:2106.13230},\n  year={2021}\n}\n\n@article{liu2021Swin,\n  title={Swin Transformer: Hierarchical Vision Transformer using Shifted Windows},\n  author={Liu, Ze and Lin, Yutong and Cao, Yue and Hu, Han and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Guo, Baining},\n  journal={arXiv preprint arXiv:2103.14030},\n  year={2021}\n}\n```\n\n## Other Links\n\n\u003e **Image Classification**: See [Swin Transformer for Image Classification](https://github.com/microsoft/Swin-Transformer).\n\n\u003e **Object Detection**: See [Swin Transformer for Object Detection](https://github.com/SwinTransformer/Swin-Transformer-Object-Detection).\n\n\u003e **Semantic Segmentation**: See [Swin Transformer for Semantic Segmentation](https://github.com/SwinTransformer/Swin-Transformer-Semantic-Segmentation).\n\n\u003e **Self-Supervised Learning**: See [MoBY with Swin Transformer](https://github.com/SwinTransformer/Transformer-SSL).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FSwinTransformer%2FVideo-Swin-Transformer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FSwinTransformer%2FVideo-Swin-Transformer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FSwinTransformer%2FVideo-Swin-Transformer/lists"}