{"id":13563890,"url":"https://github.com/NVlabs/GroupViT","last_synced_at":"2025-04-03T20:32:21.523Z","repository":{"id":37648680,"uuid":"470779379","full_name":"NVlabs/GroupViT","owner":"NVlabs","description":"Official PyTorch implementation of GroupViT: Semantic Segmentation Emerges from Text Supervision, CVPR 2022.","archived":false,"fork":false,"pushed_at":"2022-05-10T15:39:52.000Z","size":8429,"stargazers_count":713,"open_issues_count":38,"forks_count":53,"subscribers_count":11,"default_branch":"main","last_synced_at":"2024-08-01T13:30:17.945Z","etag":null,"topics":["image-text-matching","semantic-segmentation","transformers","zero-shot-learning"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2202.11094","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/NVlabs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-03-16T23:10:08.000Z","updated_at":"2024-08-01T12:34:53.000Z","dependencies_parsed_at":"2022-07-18T08:13:05.045Z","dependency_job_id":null,"html_url":"https://github.com/NVlabs/GroupViT","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVlabs%2FGroupViT","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVlabs%2FGroupViT/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVlabs%2FGroupViT/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVlabs%2FGroupViT/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/NVlabs","download_url":"https://codeload.github.com/NVlabs/GroupViT/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223030776,"owners_count":17076500,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["image-text-matching","semantic-segmentation","transformers","zero-shot-learning"],"created_at":"2024-08-01T13:01:24.303Z","updated_at":"2024-11-04T16:31:30.616Z","avatar_url":"https://github.com/NVlabs.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# GroupViT: Semantic Segmentation Emerges from Text Supervision\n\nGroupViT is a framework for learning semantic segmentation purely from text captions without\nusing any mask supervision. It learns to perform bottom-up heirarchical spatial grouping of \nsemantically-related visual regions. This repository is the official implementation of GroupViT \nintroduced in the paper:\n\n[**GroupViT: Semantic Segmentation Emerges from Text Supervision**](https://arxiv.org/abs/2202.11094),\n[*Jiarui Xu*](https://jerryxu.net),\n[*Shalini De Mello*](https://research.nvidia.com/person/shalini-gupta),\n[*Sifei Liu*](https://research.nvidia.com/person/sifei-liu),\n[*Wonmin Byeon*](https://wonmin-byeon.github.io/),\n[*Thomas Breuel*](http://www.tmbdev.net/),\n[*Jan Kautz*](https://research.nvidia.com/person/jan-kautz),\n[*Xiaolong Wang*](https://xiaolonw.github.io/),\nCVPR 2022.\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"figs/github_arch.gif\" width=\"85%\"\u003e\n\n\u003c/div\u003e\n\n## Visual Results\n\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"figs/github_voc.gif\" width=\"32%\"\u003e\n\u003cimg src=\"figs/github_ctx.gif\" width=\"32%\"\u003e\n\u003cimg src=\"figs/github_coco.gif\" width=\"32%\"\u003e\n\u003c/div\u003e\n\n## Links\n* [Jiarui Xu's Project Page](https://jerryxu.net/GroupViT/) (with additonal visual results)\n* [arXiv Page](https://arxiv.org/abs/2202.11094)\n\n\n\u003c/div\u003e\n\n## Citation\n\nIf you find our work useful in your research, please cite:\n\n```latex\n@article{xu2022groupvit,\n  author    = {Xu, Jiarui and De Mello, Shalini and Liu, Sifei and Byeon, Wonmin and Breuel, Thomas and Kautz, Jan and Wang, Xiaolong},\n  title     = {GroupViT: Semantic Segmentation Emerges from Text Supervision},\n  journal   = {arXiv preprint arXiv:2202.11094},\n  year      = {2022},\n}\n```\n\n## Environmental Setup\n\n* Python 3.7\n* PyTorch 1.8\n* webdataset 0.1.103\n* mmsegmentation 0.18.0\n* timm 0.4.12\n\nInstructions:\n\n```shell\nconda create -n groupvit python=3.7 -y\nconda activate groupvit\nconda install pytorch==1.8.0 torchvision==0.9.0 cudatoolkit=11.1 -c pytorch -c conda-forge\npip install mmcv-full==1.3.14 -f https://download.openmmlab.com/mmcv/dist/cu111/torch1.8.0/index.html\npip install mmsegmentation==0.18.0\npip install webdataset==0.1.103\npip install timm==0.4.12\ngit clone https://github.com/NVIDIA/apex\ncd \u0026\u0026 apex \u0026\u0026 pip install -v --disable-pip-version-check --no-cache-dir --global-option=\"--cpp_ext\" --global-option=\"--cuda_ext\" ./\npip install opencv-python==4.4.0.46 termcolor==1.1.0 diffdist einops omegaconf\npip install nltk ftfy regex tqdm\n```\n\n## Demo\n\n* Integrated into [Huggingface Spaces 🤗](https://huggingface.co/spaces) using [Gradio](https://github.com/gradio-app/gradio). Try out the web demo: [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/xvjiarui/GroupViT)\n\n* Run the demo on Google Colab: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1Mwtz6ojiThWWdRrpAZTLlLs6w3T9Fr6x)\n\n* To run the demo from the command line:\n\n```shell\npython demo/demo_seg.py --cfg configs/group_vit_gcc_yfcc_30e.yml --resume /path/to/checkpoint --vis input_pred_label final_group --input demo/examples/voc.jpg --output_dir demo/output\n```\n  The output is saved in `demo/output/`.\n\n## Benchmark Results\n\n\u003ctable\u003e\n\u003cthead\u003e\n  \u003ctr\u003e\n    \u003cth\u003e\u003c/th\u003e\n    \u003cth\u003eZero-shot Classification\u003c/th\u003e\n    \u003cth colspan=\"3\"\u003eZero-shot Segmentation\u003c/th\u003e\n  \u003c/tr\u003e\n\u003c/thead\u003e\n\u003ctbody\u003e\n  \u003ctr\u003e\n    \u003ctd\u003econfig\u003c/td\u003e\n    \u003ctd\u003eImageNet\u003c/td\u003e\n    \u003ctd\u003ePascal VOC\u003c/td\u003e\n    \u003ctd\u003ePascal Context\u003c/td\u003e\n    \u003ctd\u003eCOCO\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eGCC + YFCC (\u003ca href=\"configs/group_vit_gcc_yfcc_30e.yml\"\u003ecfg\u003c/a\u003e)\u003c/td\u003e\n    \u003ctd\u003e43.7\u003c/td\u003e\n    \u003ctd\u003e52.3\u003c/td\u003e\n    \u003ctd\u003e22.4\u003c/td\u003e\n    \u003ctd\u003e24.3\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eGCC + RedCaps (\u003ca href=\"configs/group_vit_gcc_redcap_30e.yml\"\u003ecfg\u003c/a\u003e)\u003c/td\u003e\n    \u003ctd\u003e51.6\u003c/td\u003e\n    \u003ctd\u003e50.8\u003c/td\u003e\n    \u003ctd\u003e23.7\u003c/td\u003e\n    \u003ctd\u003e27.5\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/tbody\u003e\n\u003c/table\u003e\n\nPre-trained weights `group_vit_gcc_yfcc_30e-879422e0.pth` and `group_vit_gcc_redcap_30e-3dd09a76.pth` for these models are provided by Jiarui Xu [here](https://github.com/xvjiarui/GroupViT#benchmark-results). \n\n## Data Preparation\n\nDuring training, we use [webdataset](https://webdataset.github.io/webdataset/) for scalable data loading.\nTo convert image text pairs into the webdataset format, we use the [img2dataset](https://github.com/rom1504/img2dataset) tool to download and preprocess the dataset.\n\nFor inference, we use [mmsegmentation](https://github.com/open-mmlab/mmsegmentation) for semantic segmentation testing, evaluation and visualization on Pascal VOC, Pascal Context and COCO datasets.\n\nThe overall file structure is as follows:\n\n```shell\nGroupViT\n├── local_data\n│   ├── gcc3m_shards\n│   │   ├── gcc-train-000000.tar\n│   │   ├── ...\n│   │   ├── gcc-train-000436.tar\n│   ├── gcc12m_shards\n│   │   ├── gcc-conceptual-12m-000000.tar\n│   │   ├── ...\n│   │   ├── gcc-conceptual-12m-001943.tar\n│   ├── yfcc14m_shards\n│   │   ├── yfcc14m-000000.tar\n│   │   ├── ...\n│   │   ├── yfcc14m-001888.tar\n│   ├── redcap12m_shards\n│   │   ├── redcap12m-000000.tar\n│   │   ├── ...\n│   │   ├── redcap12m-001211.tar\n│   ├── imagenet_shards\n│   │   ├── imagenet-val-000000.tar\n│   │   ├── ...\n│   │   ├── imagenet-val-000049.tar\n│   ├── VOCdevkit\n│   │   ├── VOC2012\n│   │   │   ├── JPEGImages\n│   │   │   ├── SegmentationClass\n│   │   │   ├── ImageSets\n│   │   │   │   ├── Segmentation\n│   │   ├── VOC2010\n│   │   │   ├── JPEGImages\n│   │   │   ├── SegmentationClassContext\n│   │   │   ├── ImageSets\n│   │   │   │   ├── SegmentationContext\n│   │   │   │   │   ├── train.txt\n│   │   │   │   │   ├── val.txt\n│   │   │   ├── trainval_merged.json\n│   │   ├── VOCaug\n│   │   │   ├── dataset\n│   │   │   │   ├── cls\n│   ├── coco\n│   │   ├── images\n│   │   │   ├── train2017\n│   │   │   ├── val2017\n│   │   ├── annotations\n│   │   │   ├── train2017\n│   │   │   ├── val2017\n```\n\nThe instructions for preparing each dataset are as follows.\n\n### GCC3M\n\nPlease download the training split annotation file from [Conceptual Caption 12M](https://ai.google.com/research/ConceptualCaptions/download) and name it as `gcc3m.tsv`.\n\nThen run `img2dataset` to download the image text pairs and save them in the webdataset format.\n```\nsed -i '1s/^/caption\\turl\\n/' gcc3m.tsv\nimg2dataset --url_list gcc3m.tsv --input_format \"tsv\" \\\n            --url_col \"url\" --caption_col \"caption\" --output_format webdataset\\\n            --output_folder local_data/gcc3m_shards\n            --processes_count 16 --thread_count 64\n            --image_size 512 --resize_mode keep_ratio --resize_only_if_bigger True \\\n            --enable_wandb True --save_metadata False --oom_shard_count 6\nrename -d 's/^/gcc-train-/' local_data/gcc3m_shards/*\n```\nPlease refer to [img2dataset CC3M tutorial](https://github.com/rom1504/img2dataset/blob/main/dataset_examples/cc3m.md) for more details.\n\n### GCC12M\n\nPlease download the annotation file from [Conceptual Caption 12M](https://github.com/google-research-datasets/conceptual-12m) and name it as `gcc12m.tsv`.\n\nThen run `img2dataset` to download the image text pairs and save them in the webdataset format.\n```\nsed -i '1s/^/caption\\turl\\n/' gcc12m.tsv\nimg2dataset --url_list gcc12m.tsv --input_format \"tsv\" \\\n            --url_col \"url\" --caption_col \"caption\" --output_format webdataset\\\n            --output_folder local_data/gcc12m_shards \\\n            --processes_count 16 --thread_count 64\n            --image_size 512 --resize_mode keep_ratio --resize_only_if_bigger True \\\n            --enable_wandb True --save_metadata False --oom_shard_count 6\nrename -d 's/^/gcc-conceptual-12m-/' local_data/gcc12m_shards/*\n```\nPlease refer to [img2dataset CC12M tutorial](https://github.com/rom1504/img2dataset/blob/main/dataset_examples/cc12m.md) for more details.\n\n### YFCC14M\nPlease follow the [CLIP Data Preparation](https://github.com/openai/CLIP/blob/main/data/yfcc100m.md) instructions to download the YFCC14M subset.\n```\nwget https://openaipublic.azureedge.net/clip/data/yfcc100m_subset_data.tsv.bz2\nbunzip2 yfcc100m_subset_data.tsv.bz2\n```\n\nThen run the preprocessing script to create the subset sql db and annotation tsv files. This may take a while.\n```\npython convert_dataset/create_subset.py --input-dir . --output-dir . --subset yfcc100m_subset_data.tsv\n```\nThis script will create two files: an SQLite db called `yfcc100m_dataset.sql` and an annotation tsv file called `yfcc14m_dataset.tsv`.\n\nThen follow the [YFCC100M Download Instruction](https://gitlab.com/jfolz/yfcc100m/-/tree/master) to download the dataset and its metadata file.\n```\npip install git+https://gitlab.com/jfolz/yfcc100m.git\nmkdir -p yfcc100m_meta\npython -m yfcc100m.convert_metadata . -o yfcc100m_meta --skip_verification\nmkdir -p yfcc100m_zip\npython -m yfcc100m.download yfcc100m_meta -o yfcc100m_zip\n```\n\nFinally convert the dataset into the webdataset format.\n```\npython convert_dataset/convert_yfcc14m.py --root yfcc100m_zip --info yfcc14m_dataset.tsv --shards yfcc14m_shards\n```\n\n### RedCaps12M\n\nPlease download the annotation file from [RedCaps](https://redcaps.xyz/).\n```\nwget https://www.dropbox.com/s/cqtdpsl4hewlli1/redcaps_v1.0_annotations.zip?dl=1\nunzip redcaps_v1.0_annotations.zip\n```\n\nThen run the preprocessing script and `img2dataset` to download the image text pairs and save them in the webdataset format.\n```\npython convert_dataset/process_redcaps.py annotations redcaps12m_meta/redcaps12m.parquet --num-split 16\nimg2dataset --url_list ~/data/redcaps12m/ --input_format \"parquet\" \\\n            --url_col \"URL\" --caption_col \"TEXT\" --output_format webdataset \\\n            --output_folder local_data/recaps12m_shards\n            --processes_count 16 --thread_count 64\n            --image_size 512 --resize_mode keep_ratio --resize_only_if_bigger True \\\n            --enable_wandb True --save_metadata False --oom_shard_count 6\nrename -d 's/^/redcap12m-/' local_data/recaps12m_shards/*\n```\n\n### ImageNet\n\nPlease follow the [webdataset ImageNet Example](https://github.com/tmbdev-archive/webdataset-examples/blob/master/makeshards.py) to convert ImageNet into the webdataset format.\n\n### Pascal VOC\n\nPlease follow the [MMSegmentation Pascal VOC Preparation](https://github.com/open-mmlab/mmsegmentation/blob/master/docs/en/dataset_prepare.md#pascal-voc) instructions to download and setup the Pascal VOC dataset.\n\n### Pascal Context\n\nPlease refer to the [MMSegmentation Pascal Context Preparation](https://github.com/open-mmlab/mmsegmentation/blob/master/docs/en/dataset_prepare.md#pascal-context) instructions to download and setup the Pascal Context dataset.\n\n### COCO\n\n[COCO dataset](https://cocodataset.org/) is an object detection dataset with instance segmentation annotations.\nTo evaluate GroupViT, we combine all the instance masks of a catergory together and generate semantic segmentation maps.\nTo generate the semantic segmentation maps, please follow [MMSegmentation's documentation](https://github.com/open-mmlab/mmsegmentation/blob/master/docs/en/dataset_prepare.md#coco-stuff-164k) to download the COCO-Stuff-164k dataset first and then run the following\n\n```shell\npython convert_dataset/convert_coco.py local_data/data/coco/ -o local_data/data/coco/\n```\n\n## Run Experiments\n\n### Pre-train\n\nTrain on a single node:\n\n```shell\n(node0)$ ./tools/dist_launch.sh main_group_vit.py /path/to/config $GPUS_PER_NODE\n```\n\nFor example, to train on a node with 8 GPUs, run:\n```shell\n(node0)$ ./tools/dist_launch.sh main_group_vit configs/group_vit_gcc_yfcc_30e.yml 8\n```\n\nTrain on multiple nodes:\n\n```shell\n(node0)$ ./tools/dist_mn_launch.sh main_group_vit.py /path/to/config $NODE_RANK $NUM_NODES $GPUS_PER_NODE $MASTER_ADDR\n(node1)$ ./tools/dist_mn_launch.sh main_group_vit.py /path/to/config $NODE_RANK $NUM_NODES $GPUS_PER_NODE $MASTER_ADDR\n```\n\nFor example, to train on two nodes with 8 GPUs each, run:\n\n```shell\n(node0)$ ./tools/dist_mn_launch.sh main_group_vit.py configs/group_vit_gcc_yfcc_30e.yml 0 2 8 tcp://node0\n(node1)$ ./tools/dist_mn_launch.sh main_group_vit.py configs/group_vit_gcc_yfcc_30e.yml 1 2 8 tcp://node0\n```\n\nWe used 16 NVIDIA V100 GPUs for pre-training (in 2 days) in our paper.\n\n### Zero-shot Transfer to Image Classification\n\n#### ImageNet\n\n```shell\n./tools/dist_launch.sh main_group_vit.py /path/to/config $NUM_GPUS --resume /path/to/checkpoint --eval\n```\n\n### Zero-shot Transfer to Semantic Segmentation\n\n#### Pascal VOC\n\n```shell\n./tools/dist_launch.sh main_seg.py /path/to/config $NUM_GPUS --resume /path/to/checkpoint\n```\n\n#### Pascal Context\n\n```shell\n./tools/dist_launch.sh main_seg.py /path/to/config $NUM_GPUS --resume /path/to/checkpoint --opts evaluate.seg.cfg segmentation/configs/_base_/datasets/pascal_context.py\n```\n\n#### COCO\n\n```shell\n./tools/dist_launch.sh main_seg.py /path/to/config $NUM_GPUS --resume /path/to/checkpoint --opts evaluate.seg.cfg segmentation/configs/_base_/datasets/coco.py\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FNVlabs%2FGroupViT","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FNVlabs%2FGroupViT","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FNVlabs%2FGroupViT/lists"}