{"id":13706307,"url":"https://github.com/swintransformer/ait","last_synced_at":"2025-04-11T21:04:00.381Z","repository":{"id":65305728,"uuid":"585632173","full_name":"SwinTransformer/AiT","owner":"SwinTransformer","description":null,"archived":false,"fork":false,"pushed_at":"2023-06-30T02:08:08.000Z","size":1680,"stargazers_count":104,"open_issues_count":10,"forks_count":8,"subscribers_count":8,"default_branch":"master","last_synced_at":"2025-04-11T21:03:33.750Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/SwinTransformer.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-01-05T17:05:29.000Z","updated_at":"2025-02-12T08:13:48.000Z","dependencies_parsed_at":"2024-11-13T14:45:56.401Z","dependency_job_id":null,"html_url":"https://github.com/SwinTransformer/AiT","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SwinTransformer%2FAiT","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SwinTransformer%2FAiT/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SwinTransformer%2FAiT/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SwinTransformer%2FAiT/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/SwinTransformer","download_url":"https://codeload.github.com/SwinTransformer/AiT/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248480437,"owners_count":21110937,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-02T22:00:54.000Z","updated_at":"2025-04-11T21:03:59.876Z","avatar_url":"https://github.com/SwinTransformer.png","language":"Python","funding_links":[],"categories":["Papers"],"sub_categories":[],"readme":"# All in Tokens: Unifying Output Space of Visual Tasks via Soft Token\n\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/all-in-tokens-unifying-output-space-of-visual/monocular-depth-estimation-on-nyu-depth-v2)](https://paperswithcode.com/sota/monocular-depth-estimation-on-nyu-depth-v2?p=all-in-tokens-unifying-output-space-of-visual)\n\nBy [Jia Ning](https://scholar.google.com/citations?user=hW0AexsAAAAJ\u0026hl=en)\\*, [Chen Li](https://github.com/LC-Edward)\\*, [Zheng Zhang*](https://stupidzz.github.io/), [Zigang Geng](https://scholar.google.com/citations?user=MdFYVoAAAAAJ\u0026hl=zh-CN), [Qi Dai](https://scholar.google.com/citations?user=NSJY12IAAAAJ), [Kun He](https://scholar.google.com/citations?user=YTQnGJsAAAAJ\u0026hl=en), [Han Hu](https://ancientmooner.github.io/)\n\n## Introduction\n**AiT** is initially described in [arxiv](https://arxiv.org/pdf/2301.02229.pdf), which is a framework to unify the output space of visual tasks. We demonstrate a single unified model that simultaneously handles two typical visual tasks of instance segmentation and depth estimation, which have discrete/fixed-length and continuous/varied-length outputs, respectively. We propose several new techniques that take into account the particularity of visual tasks: 1) Soft tokens. We employ soft tokens to represent the task output. Unlike hard tokens in the common VQ-VAE which are assigned one-hot to discrete codebooks/vocabularies, the soft tokens are assigned softly to the codebook embeddings. Soft tokens can improve the accuracy of both the next token inference and decoding the task output; 2) Mask augmentation. Many visual tasks have corruption, undefined or invalid values in label annotations, i.e., occluded area of depth maps. We show that a mask augmentation technique can greatly benefit these tasks. With these new techniques and other designs, we show that the proposed general-purpose task solver can perform both instance segmentation and depth estimation well. Particularly, we achieve 0.275 RMSE on the specific task of NYUv2 depth estimation, setting a new record on this benchmark.\n\n![teaser](figures/teaser.png)\n\n## Results and Models\n### Results on COCO instance segmentation\n| \u003cdiv style=\"width: 100pt\"\u003e Model | Box AP| Mask AP| VQ-VAE Model | Task-Solver Model|\n|:-------------------:|:-------:|:-------:|:-------:|:-------:|\n| [AiT(SwinV2-B)](ait/configs/swinv2b_640reso_inssegonly.py) | 43.3 | 34.2 | [vqvae_insseg.pt](https://msravcghub.blob.core.windows.net/ait-release/vae/vqvae_insseg.pt?sv=2021-10-04\u0026spr=https%2Chttp\u0026st=2023-06-30T01%3A47%3A00Z\u0026se=2026-01-01T01%3A47%3A00Z\u0026sr=c\u0026sp=rl\u0026sig=bwb8Tpfpk2FfZxsilGa4Oc5vKEZiifZK4xs%2F6RuWF9E%3D) | [model](https://msravcghub.blob.core.windows.net/ait-release/checkpoint/ait_insseg_swinv2b.pth?sv=2021-10-04\u0026spr=https%2Chttp\u0026st=2023-06-30T01%3A47%3A00Z\u0026se=2026-01-01T01%3A47%3A00Z\u0026sr=c\u0026sp=rl\u0026sig=bwb8Tpfpk2FfZxsilGa4Oc5vKEZiifZK4xs%2F6RuWF9E%3D)|\n| [AiT(SwinV2-B) w/o soft token](ait/configs/swinv2b_640reso_inssegonly_wosoft.py) | 43.6 | 31.1(-3.1) | [vqvae_insseg.pt](https://msravcghub.blob.core.windows.net/ait-release/vae/vqvae_insseg.pt?sv=2021-10-04\u0026spr=https%2Chttp\u0026st=2023-06-30T01%3A47%3A00Z\u0026se=2026-01-01T01%3A47%3A00Z\u0026sr=c\u0026sp=rl\u0026sig=bwb8Tpfpk2FfZxsilGa4Oc5vKEZiifZK4xs%2F6RuWF9E%3D) | [model](https://msravcghub.blob.core.windows.net/ait-release/checkpoint/ait_insseg_swinv2b_wosoft.pth?sv=2021-10-04\u0026spr=https%2Chttp\u0026st=2023-06-30T01%3A47%3A00Z\u0026se=2026-01-01T01%3A47%3A00Z\u0026sr=c\u0026sp=rl\u0026sig=bwb8Tpfpk2FfZxsilGa4Oc5vKEZiifZK4xs%2F6RuWF9E%3D) |\n\n\n### Results on NYUv2 depth estimation\n| \u003cdiv style=\"width: 100pt\"\u003e Model\u003c/div\u003e | D1 | D2 | D3 | Abs Rel | RMSE | Log10 | VQ-VAE \u003cbr\u003e Model | Task-Solver \u003cbr\u003e Model |\n|:-------------------:|:-------:|:-------:|:--------:|:--------:|:--------:|:-------:|:-------:|:-------:|\n| [AiT(SwinV2-B)](ait/configs/swinv2b_480reso_depthonly.py) | 0.934 | 0.991 | 0.998 | 0.087 | 0.305 | 0.037 | [vqvae_depth.pt](https://msravcghub.blob.core.windows.net/ait-release/vae/vqvae_depth.pt?sv=2021-10-04\u0026spr=https%2Chttp\u0026st=2023-06-30T01%3A47%3A00Z\u0026se=2026-01-01T01%3A47%3A00Z\u0026sr=c\u0026sp=rl\u0026sig=bwb8Tpfpk2FfZxsilGa4Oc5vKEZiifZK4xs%2F6RuWF9E%3D) |[model](https://msravcghub.blob.core.windows.net/ait-release/checkpoint/ait_depth_swinv2b_ar.pth?sv=2021-10-04\u0026spr=https%2Chttp\u0026st=2023-06-30T01%3A47%3A00Z\u0026se=2026-01-01T01%3A47%3A00Z\u0026sr=c\u0026sp=rl\u0026sig=bwb8Tpfpk2FfZxsilGa4Oc5vKEZiifZK4xs%2F6RuWF9E%3D) |\n| [AiT-P(SwinV2-B)](ait/configs/swinv2b_480reso_parallel_depthonly.py) | 0.940 | 0.992 | 0.998 | 0.085 | 0.301 | 0.036 | [vqvae_depth.pt](https://msravcghub.blob.core.windows.net/ait-release/vae/vqvae_depth.pt?sv=2021-10-04\u0026spr=https%2Chttp\u0026st=2023-06-30T01%3A47%3A00Z\u0026se=2026-01-01T01%3A47%3A00Z\u0026sr=c\u0026sp=rl\u0026sig=bwb8Tpfpk2FfZxsilGa4Oc5vKEZiifZK4xs%2F6RuWF9E%3D) | [model](https://msravcghub.blob.core.windows.net/ait-release/checkpoint/ait_depth_swinv2b_parallel.pth?sv=2021-10-04\u0026spr=https%2Chttp\u0026st=2023-06-30T01%3A47%3A00Z\u0026se=2026-01-01T01%3A47%3A00Z\u0026sr=c\u0026sp=rl\u0026sig=bwb8Tpfpk2FfZxsilGa4Oc5vKEZiifZK4xs%2F6RuWF9E%3D) |\n| [AiT(SwinV2-B) w/o soft token](ait/configs/swinv2b_480reso_depthonly_wosoft.py) | 0.932 | 0.991 | 0.998 | 0.089 | 0.318 | 0.038 | [vqvae_depth.pt](https://msravcghub.blob.core.windows.net/ait-release/vae/vqvae_depth.pt?sv=2021-10-04\u0026spr=https%2Chttp\u0026st=2023-06-30T01%3A47%3A00Z\u0026se=2026-01-01T01%3A47%3A00Z\u0026sr=c\u0026sp=rl\u0026sig=bwb8Tpfpk2FfZxsilGa4Oc5vKEZiifZK4xs%2F6RuWF9E%3D) | [model](https://msravcghub.blob.core.windows.net/ait-release/checkpoint/ait_depth_swinv2b_ar_wosoft.pth?sv=2021-10-04\u0026spr=https%2Chttp\u0026st=2023-06-30T01%3A47%3A00Z\u0026se=2026-01-01T01%3A47%3A00Z\u0026sr=c\u0026sp=rl\u0026sig=bwb8Tpfpk2FfZxsilGa4Oc5vKEZiifZK4xs%2F6RuWF9E%3D) |\n| [AiT(SwinV2-L)](ait/configs/swinv2l_480reso_depthonly.py) | 0.949 | 0.993 | 0.999 | 0.079 | 0.284 | 0.034 | [vqvae_depth.pt](https://msravcghub.blob.core.windows.net/ait-release/vae/vqvae_depth.pt?sv=2021-10-04\u0026spr=https%2Chttp\u0026st=2023-06-30T01%3A47%3A00Z\u0026se=2026-01-01T01%3A47%3A00Z\u0026sr=c\u0026sp=rl\u0026sig=bwb8Tpfpk2FfZxsilGa4Oc5vKEZiifZK4xs%2F6RuWF9E%3D) |[model](https://msravcghub.blob.core.windows.net/ait-release/checkpoint/ait_depth_swinv2l_ar.pth?sv=2021-10-04\u0026spr=https%2Chttp\u0026st=2023-06-30T01%3A47%3A00Z\u0026se=2026-01-01T01%3A47%3A00Z\u0026sr=c\u0026sp=rl\u0026sig=bwb8Tpfpk2FfZxsilGa4Oc5vKEZiifZK4xs%2F6RuWF9E%3D) |\n| [AiT-P(SwinV2-L)](ait/configs/swinv2l_480reso_parallel_depthonly.py) | 0.954 | 0.994 | 0.999 | 0.076 | 0.275 | 0.033 | [vqvae_depth.pt](https://msravcghub.blob.core.windows.net/ait-release/vae/vqvae_depth.pt?sv=2021-10-04\u0026spr=https%2Chttp\u0026st=2023-06-30T01%3A47%3A00Z\u0026se=2026-01-01T01%3A47%3A00Z\u0026sr=c\u0026sp=rl\u0026sig=bwb8Tpfpk2FfZxsilGa4Oc5vKEZiifZK4xs%2F6RuWF9E%3D) | [model](https://msravcghub.blob.core.windows.net/ait-release/checkpoint/ait_depth_swinv2l_parallel.pth?sv=2021-10-04\u0026spr=https%2Chttp\u0026st=2023-06-30T01%3A47%3A00Z\u0026se=2026-01-01T01%3A47%3A00Z\u0026sr=c\u0026sp=rl\u0026sig=bwb8Tpfpk2FfZxsilGa4Oc5vKEZiifZK4xs%2F6RuWF9E%3D) |\n\n### Joint training results on COCO and NYUv2\n| \u003cdiv style=\"width: 100pt\"\u003e Model\u003c/div\u003e | Box AP| Mask AP| RMSE | VQ-VAE Model | Task-Solver \u003cbr\u003e Model |\n|:-------------------:|:-------:|:-------:|:-------:|:-------:|:-------:|\n| [AiT(SwinV2-B)](ait/configs/swinv2b_640reso_joint.py) | 42.2 | 34.1 | 0.310 | [vqvae_depth.pt](https://msravcghub.blob.core.windows.net/ait-release/vae/vqvae_depth.pt?sv=2021-10-04\u0026spr=https%2Chttp\u0026st=2023-06-30T01%3A47%3A00Z\u0026se=2026-01-01T01%3A47%3A00Z\u0026sr=c\u0026sp=rl\u0026sig=bwb8Tpfpk2FfZxsilGa4Oc5vKEZiifZK4xs%2F6RuWF9E%3D)/[vqvae_insseg.pt](https://msravcghub.blob.core.windows.net/ait-release/vae/vqvae_insseg.pt?sv=2021-10-04\u0026spr=https%2Chttp\u0026st=2023-06-30T01%3A47%3A00Z\u0026se=2026-01-01T01%3A47%3A00Z\u0026sr=c\u0026sp=rl\u0026sig=bwb8Tpfpk2FfZxsilGa4Oc5vKEZiifZK4xs%2F6RuWF9E%3D)  | [model](https://msravcghub.blob.core.windows.net/ait-release/checkpoint/ait_joint_swinv2b.pth?sv=2021-10-04\u0026spr=https%2Chttp\u0026st=2023-06-30T01%3A47%3A00Z\u0026se=2026-01-01T01%3A47%3A00Z\u0026sr=c\u0026sp=rl\u0026sig=bwb8Tpfpk2FfZxsilGa4Oc5vKEZiifZK4xs%2F6RuWF9E%3D)|\n\n\n## Usage\n\n### Installation\nWe recommend using pytorch\u003e=1.10, other packages can be found in requirements.txt. To install boundary-iou-api, please using the following command:\n```bash\ngit clone https://github.com/bowenc0221/boundary-iou-api \u0026\u0026 cd boundary-iou-api \u0026\u0026 pip install -e .\n```\n\n### Data/Pre-training model Preparation \n1. Download the [NYU Depth V2](https://github.com/vinvino02/GLPDepth) dataset, [COCO](https://cocodataset.org/#download) datasets, our preprocess box-cropped binary instance masks, named [maskcoco](https://msravcghub.blob.core.windows.net/ait-release/data/maskcoco.tar?sv=2021-10-04\u0026spr=https%2Chttp\u0026st=2023-06-30T01%3A47%3A00Z\u0026se=2026-01-01T01%3A47%3A00Z\u0026sr=c\u0026sp=rl\u0026sig=bwb8Tpfpk2FfZxsilGa4Oc5vKEZiifZK4xs%2F6RuWF9E%3D), and organize the data according to the following directory structure:\n\n```plain\nAiT\n├── ait\n├── vae\n├── data\n│   ├── coco\n│   │   ├── annotations\n│   │   ├── train2017\n│   │   ├── val2017\n│   │   ├── test2017\n│   ├── maskcoco\n│   ├── nyu_depth_v2\n```\n2. Create the data links using following commands:\n\n```bash\nln -s data ait/data\nln -s data vae/data\n```\n\n3. Download pre-trained backbone models [swin_v2_base_densesimmim.pth](https://msravcghub.blob.core.windows.net/ait-release/checkpoint/swin_v2_base_densesimmim.pth?sv=2021-10-04\u0026spr=https%2Chttp\u0026st=2023-06-30T01%3A47%3A00Z\u0026se=2026-01-01T01%3A47%3A00Z\u0026sr=c\u0026sp=rl\u0026sig=bwb8Tpfpk2FfZxsilGa4Oc5vKEZiifZK4xs%2F6RuWF9E%3D) and [swin_v2_large_densesimmim.pth](https://msravcghub.blob.core.windows.net/ait-release/checkpoint/swin_v2_large_densesimmim.pth?sv=2021-10-04\u0026spr=https%2Chttp\u0026st=2023-06-30T01%3A47%3A00Z\u0026se=2026-01-01T01%3A47%3A00Z\u0026sr=c\u0026sp=rl\u0026sig=bwb8Tpfpk2FfZxsilGa4Oc5vKEZiifZK4xs%2F6RuWF9E%3D). \n\n\n### Training\n#### Training VQ-VAE on depth estimation:\n```bash\ncd vae\npython -m torch.distributed.launch --nproc_per_node=${N_GPUS} train_depth_vqvae_dist.py  configs/depth/ait_depth_vqvae.py --cfg-options \u003ccustom-configs\u003e\n```\n#### Training VQ-VAE on instance segmentation:\n\n```bash\ncd vae\npython -m torch.distributed.launch --nproc_per_node=${N_GPUS} train_insseg_vqvae_dist.py  configs/insseg/ait_insseg_vqvae.py --cfg-options \u003ccustom-configs\u003e\n```\n\n#### Training task-solver on depth estimation:\n```bash\ncd ait\n\n# Train auto-regressive model\npython -m torch.distributed.launch --nproc_per_node=8 code/train.py configs/swinv2b_480reso_depthonly.py --cfg-options model.backbone.init_cfg.checkpoint=swin_v2_base_densesimmim.pth model.task_heads.depth.vae_cfg.pretrained=vqvae_depth.pt # for AR training\n\n# Train parallel model\npython -m torch.distributed.launch --nproc_per_node=8 code/train.py configs/swinv2b_480reso_parallel_depthonly.py --cfg-options model.backbone.init_cfg.checkpoint=swin_v2_base_densesimmim.pth model.task_heads.depth.vae_cfg.pretrained=vqvae_depth.pt # for parallel training\n```\n\n#### Training task-solver on object detection\n```bash\ncd ait\npython -m torch.distributed.launch --nproc_per_node=16 --nnodes=2 --node_rank=${NODE_RANK} --master_addr=${MASTER_ADDR} --master_port=${MASTER_PORT} code/train.py configs/swinv2b_640reso_detonly.py --cfg-options model.backbone.init_cfg.checkpoint=swin_v2_base_densesimmim.pth\n```\n\n**Note:** We use the pre-trainined object detection model to initialize the instance segmentation models and joint-training models to save training cost, please download the pre-trained model ([ait_det_swinv2b_wodec.pth](https://msravcghub.blob.core.windows.net/ait-release/checkpoint/ait_det_swinv2b_wodec.pth?sv=2021-10-04\u0026spr=https%2Chttp\u0026st=2023-06-30T01%3A47%3A00Z\u0026se=2026-01-01T01%3A47%3A00Z\u0026sr=c\u0026sp=rl\u0026sig=bwb8Tpfpk2FfZxsilGa4Oc5vKEZiifZK4xs%2F6RuWF9E%3D)) before training on instance segmentation and joint training setting.\n\n#### Training task-solver on instance segmentation\n```bash\npython -m torch.distributed.launch --nproc_per_node=16 code/train.py configs/swinv2b_640reso_inssegonly.py --cfg-options model.backbone.init_cfg.checkpoint=swin_v2_base_densesimmim.pth model.task_heads.insseg.vae_cfg.pretrained=vqvae_insseg.pt load_from=ait_det_swinv2b_wodec.pth\n```\n\n#### Joint training on instance segmentation and depth estimation\n```bash\npython -m torch.distributed.launch --nproc_per_node=16 --nnodes=4 --node_rank=${NODE_RANK} --master_addr=${MASTER_ADDR} --master_port=${MASTER_PORT} code/train.py configs/swinv2b_640reso_joint.py --cfg-options model.backbone.init_cfg.checkpoint=swin_v2_base_densesimmim.pth model.task_heads.insseg.vae_cfg.pretrained=vqvae_insseg.pt model.task_heads.depth.vae_cfg.pretrained=vqvae_depth.pt load_from=ait_det_swinv2b_wodec.pth\n```\n\n### Inference\n#### Evaluate  on depth estimation\n```bash\ncd ait\n\n# Evaluating auto-regressive model\npython -m torch.distributed.launch --nproc_per_node=8 code/train.py configs/swinv2b_480reso_depthonly.py  --cfg-options model.task_heads.depth.vae_cfg.pretrained=vqvae_depth.pt --eval \u003cmodel_checkpiont\u003e\n\n# Evaluating parallele model\npython -m torch.distributed.launch --nproc_per_node=8 code/train.py configs/swinv2b_480reso_parallel_depthonly.py  --cfg-options model.task_heads.depth.vae_cfg.pretrained=vqvae_depth.pt --eval \u003cmodel_checkpiont\u003e\n```\n\n#### Evaluate on instance segmentation\n```bash\ncd ait\n\npython -m torch.distributed.launch --nproc_per_node=8 code/train.py configs/swinv2b_640reso_inssegonly.py --cfg-options model.task_heads.insseg.vae_cfg.pretrained=vqvae_insseg.pt --eval \u003cmodel_checkpiont\u003e\n```\n\n#### Evaluate on both depth estimation and instance segmentation\n```bash\ncd ait\n\npython -m torch.distributed.launch --nproc_per_node=8 code/train.py configs/swinv2b_640reso_joint.py --cfg-options model.task_heads.insseg.vae_cfg.pretrained=vqvae_insseg.pt model.task_heads.depth.vae_cfg.pretrained=vqvae_depth.pt --eval \u003cmodel_checkpiont\u003e\n```\n\n\n## Citation\n```\n@article{ning2023all,\n  title={All in Tokens: Unifying Output Space of Visual Tasks via Soft Token},\n  author={Ning, Jia and Li, Chen and Zhang, Zheng and Geng, Zigang and Dai, Qi and He, Kun and Hu, Han},\n  journal={arXiv preprint arXiv:2301.02229},\n  year={2023}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fswintransformer%2Fait","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fswintransformer%2Fait","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fswintransformer%2Fait/lists"}