{"id":19217000,"url":"https://github.com/hustvl/tevit","last_synced_at":"2025-04-13T06:37:31.653Z","repository":{"id":41103604,"uuid":"472185849","full_name":"hustvl/TeViT","owner":"hustvl","description":"Temporally Efficient Vision Transformer for Video Instance Segmentation, CVPR 2022, Oral","archived":false,"fork":false,"pushed_at":"2023-03-04T01:13:21.000Z","size":59203,"stargazers_count":239,"open_issues_count":10,"forks_count":18,"subscribers_count":8,"default_branch":"main","last_synced_at":"2025-04-03T20:37:37.916Z","etag":null,"topics":["instance-segmentation","video-instance-segmentation","video-understanding"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2204.08412","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hustvl.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":".github/CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":".github/CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2022-03-21T04:27:53.000Z","updated_at":"2024-12-12T16:57:54.000Z","dependencies_parsed_at":"2024-01-14T04:06:37.618Z","dependency_job_id":null,"html_url":"https://github.com/hustvl/TeViT","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hustvl%2FTeViT","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hustvl%2FTeViT/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hustvl%2FTeViT/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hustvl%2FTeViT/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hustvl","download_url":"https://codeload.github.com/hustvl/TeViT/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248675351,"owners_count":21143763,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["instance-segmentation","video-instance-segmentation","video-understanding"],"created_at":"2024-11-09T14:19:48.763Z","updated_at":"2025-04-13T06:37:31.624Z","avatar_url":"https://github.com/hustvl.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Temporally Efficient Vision Transformer for Video Instance Segmentation\n\n\u003e [**Temporally Efficient Vision Transformer for Video Instance Segmentation**](https://arxiv.org/abs/2204.08412) (CVPR 2022, Oral)\n\u003e\n\u003e by [Shusheng Yang](https://bit.ly/shushengyang_googlescholar)\u003csup\u003e1,3\u003c/sup\u003e, [Xinggang Wang](https://xinggangw.info/)\u003csup\u003e1 :email:\u003c/sup\u003e, [Yu Li](https://yu-li.github.io/)\u003csup\u003e4\u003c/sup\u003e, [Yuxin Fang](https://bit.ly/YuxinFang_GoogleScholar)\u003csup\u003e1\u003c/sup\u003e, [Jiemin Fang](https://jaminfong.cn/)\u003csup\u003e1,2\u003c/sup\u003e, [Wenyu Liu](http://eic.hust.edu.cn/professor/liuwenyu/)\u003csup\u003e1\u003c/sup\u003e, [Xun Zhao](https://scholar.google.com.hk/citations?user=KF-uZFYAAAAJ)\u003csup\u003e3\u003c/sup\u003e, [Ying Shan](https://scholar.google.com/citations?user=4oXBp9UAAAAJ\u0026hl=en)\u003csup\u003e3\u003c/sup\u003e.\n\u003e \n\u003e \u003csup\u003e1\u003c/sup\u003e [School of EIC, HUST](http://eic.hust.edu.cn/English/Home.htm), \u003csup\u003e2\u003c/sup\u003e [AIA, HUST](http://english.aia.hust.edu.cn/), \u003csup\u003e3\u003c/sup\u003e [ARC Lab, Tencent PCG](https://arc.tencent.com/en/index), \u003csup\u003e4\u003c/sup\u003e [IDEA](https://idea.edu.cn/en).\n\u003e\n\u003e (\u003csup\u003e:email:\u003c/sup\u003e) corresponding author.\n\u003e\n\n\u003cimg src=\"resources/gif/0b97736357.gif\" width=\"33%\"/\u003e\u003cimg src=\"resources/gif/00f88c4f0a.gif\" width=\"33%\"/\u003e\u003cimg src=\"resources/gif/2e21c7e59b.gif\" width=\"33%\"/\u003e\n\u003cimg src=\"resources/gif/4b1a561480.gif\" width=\"33%\"/\u003e\u003cimg src=\"resources/gif/49fcb27427.gif\" width=\"33%\"/\u003e\u003cimg src=\"resources/gif/91eb6cb6dc.gif\" width=\"33%\"/\u003e\n\u003c/br\u003e\n\n* This repo provides code, models and training/inference recipes for **TeViT**(Temporally Efficient Vision Transformer for Video Instance Segmentation).\n* TeViT is a transformer-based end-to-end video instance segmentation framework. We build our framework upon the query-based instance segmentation methods, i.e., `QueryInst`.\n* We propose a messenger shift mechanism in the transformer backbone, as well as a spatiotemporal query interaction head in the instance heads. These two designs fully utlizes both frame-level and instance-level temporal context information and obtains strong temporal modeling capacity with negligible extra computational cost.\n\n\u003c/br\u003e\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg width=\"100%\" alt=\"Overall Arch\" src=\"resources/tevit.png\"\u003e\n\u003c/div\u003e\n\n\u003c!-- \u003c/br\u003e --\u003e\n\n\u003c!-- \u003cdiv align=\"center\"\u003e\n  \u003cimg width=\"90%\" alt=\"Overall Arch\" src=\"resources/tevit_vis.png\"\u003e\n\u003c/div\u003e --\u003e\n\n\u003c!-- \u003c/br\u003e --\u003e\n\n## Models and Main Results\n\n* We provide both checkpoints and codalab server submissions on `YouTube-VIS-2019` dataset.\n\nName | AP | AP@50 | AP@75 | AR@1 | AR@10 | Params | model | submission\n--- |:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:\n[TeViT_MsgShifT](configs/tevit/tevit_msgshift.py) | 46.3 | 70.6 | 50.9 | 45.2 | 54.3 | 161.83 M | [link](https://github.com/hustvl/Storage/releases/download/v1.1.0/tevit_msgshift.pth) | [link](https://github.com/hustvl/Storage/releases/download/v1.1.0/tevit_msgshift.zip)\n[TeViT_MsgShifT_MST](configs/tevit/tevit_msgshift_mstrain.py) | 46.9 | 70.1 | 52.9 | 45.0 | 53.4 | 161.83 M | [link](https://github.com/hustvl/Storage/releases/download/v1.1.0/tevit_msgshift_mstrain.pth) | [link](https://github.com/hustvl/Storage/releases/download/v1.1.0/tevit_msgshift_mstrain.zip)\n* We have conducted multiple runs due to the training instability and checkpoints above are all the best one among multiple runs. The average performances are reported in our paper.\n* Besides basic models, we also provide TeViT with `ResNet-50` and `Swin-L` backbone, models are also trained on `YouTube-VIS-2019` dataset.\n* MST denotes multi-scale traning.\n\nName | AP | AP@50 | AP@75 | AR@1 | AR@10 | Params | model | submission\n--- |:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:\n[TeViT_R50](configs/tevit/tevit_r50.py) | 42.1 | 67.8 | 44.8 | 41.3 | 49.9 | 172.3 M | [link](https://github.com/hustvl/Storage/releases/download/v1.1.0/tevit_r50.pth) | [link](https://github.com/hustvl/Storage/releases/download/v1.1.0/tevit_r50.zip)\n[TeViT_Swin-L_MST](configs/tevit/tevit_swin-l_mstrain.py) | 56.8 | 80.6 | 63.1 | 52.0 | 63.3 | 343.86 M | [link](https://github.com/hustvl/Storage/releases/download/v1.1.0/tevit_swin-l_mstrain.pth) | [link](https://github.com/hustvl/Storage/releases/download/v1.1.0/tevit_swin-l_mstrain.zip)\n\n* Due to backbone limitations, TeViT models with `ResNet-50` and `Swin-L` backbone are conducted with `STQI Head` only (i.e., without our proposed `messenger shift mechanism`).\n* With `Swin-L` as backbone network, we apply more instance queries (i.e., from 100 to 300) and stronger data augmentation strategies. Both of them can further boost the final performance.\n\n## Installation\n\n### Prerequisites\n\n* Linux\n* Python 3.7+\n* CUDA 10.2+\n* GCC 5+\n\n### Prepare\n\n* Clone the repository locally:\n\n```bash\ngit clone https://github.com/hustvl/TeViT.git\n```\n\n* Create a conda virtual environment and activate it:\n```bash\nconda create --name tevit python=3.7.7\nconda activate tevit\n```\n\n* Install YTVOS Version API from [youtubevos/cocoapi](https://github.com/youtubevos/cocoapi):\n```\npip install git+https://github.com/youtubevos/cocoapi.git#\"egg=pycocotools\u0026subdirectory=PythonAPI\n```\n\n* Install Python requirements\n```\ntorch==1.9.0\ntorchvision==0.10.0\nmmcv==1.4.8\npip install -r requirements.txt\n```\n\n* Please follow [Docs](https://mmdetection.readthedocs.io/en/v2.21.0/get_started.html) to install `MMDetection`\n```bash\npython setup.py develop\n```\n\n* Download ```YouTube-VIS 2019``` dataset from [here](https://youtube-vos.org/dataset/vis/), and organize dataset as follows:\n```\nTeViT\n├── data\n│   ├── youtubevis\n│   │   ├── train\n│   │   │   ├── 003234408d\n│   │   │   ├── ...\n│   │   ├── val\n│   │   │   ├── ...\n│   │   ├── annotations\n│   │   │   ├── train.json\n│   │   │   ├── valid.json\n```\n\n## Inference\n\n```bash\npython tools/test_vis.py configs/tevit/tevit_msgshift.py $PATH_TO_CHECKPOINT\n```\nAfter inference process, the predicted results is stored in ```results.json```, submit it to the [evaluation server](https://competitions.codalab.org/competitions/20128) to get the final performance.\n\n## Training\n\n* Download the COCO pretrained `QueryInst` with PVT-B1 backbone from [here](https://github.com/hustvl/Storage/releases/download/v1.1.0/queryinst_pvtv2-b1_fpn_mstrain_480-800_3x_coco.pth).\n* Train TeViT with 8 GPUs:\n```bash\n./tools/dist_train.sh configs/tevit/tevit_msgshift.py 8 --no-validate --cfg-options load_from=$PATH_TO_PRETRAINED_WEIGHT\n```\n* Train TeViT with multi-scale data augmentation:\n```bash\n./tools/dist_train.sh configs/tevit/tevit_msgshift_mstrain.py 8 --no-validate --cfg-options load_from=$PATH_TO_PRETRAINED_WEIGHT\n```\n* The whole training process will cost about **three hours** with 8 TESLA V100 GPUs.\n* To train TeViT with `ResNet-50` or `Swin-L` backbone, please download the COCO pretrained weights from [`QueryInst`](https://github.com/hustvl/QueryInst).\n\n## Acknowledgement :heart:\n\nThis code is mainly based on [```mmdetection```](https://github.com/open-mmlab/mmdetection) and [```QueryInst```](https://github.com/hustvl/QueryInst), thanks for their awesome work and great contributions to the computer vision community!\n\n## Citation\n\nIf you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil: :\n\n```BibTeX\n@inproceedings{yang2022tevit,\n  title={Temporally Efficient Vision Transformer for Video Instance Segmentation,\n  author={Yang, Shusheng and Wang, Xinggang and Li, Yu and Fang, Yuxin and Fang, Jiemin and Liu and Zhao, Xun and Shan, Ying},\n  booktitle =   {Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR)},\n  year      =   {2022}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhustvl%2Ftevit","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhustvl%2Ftevit","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhustvl%2Ftevit/lists"}