{"id":19426946,"url":"https://github.com/khanrc/tcl","last_synced_at":"2025-04-24T17:31:18.299Z","repository":{"id":67212303,"uuid":"572344292","full_name":"khanrc/tcl","owner":"khanrc","description":"Official implementation of TCL (CVPR 2023)","archived":false,"fork":false,"pushed_at":"2023-05-11T08:54:42.000Z","size":3829,"stargazers_count":110,"open_issues_count":1,"forks_count":6,"subscribers_count":11,"default_branch":"main","last_synced_at":"2025-04-03T08:37:36.840Z","etag":null,"topics":["cvpr","open-world","segmentation","semantic-segmentation","zero-shot"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/khanrc.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-11-30T04:08:26.000Z","updated_at":"2025-03-28T09:55:46.000Z","dependencies_parsed_at":"2024-06-19T02:59:08.097Z","dependency_job_id":"54e810fe-540b-423f-a5be-c6c0fbb4445c","html_url":"https://github.com/khanrc/tcl","commit_stats":null,"previous_names":["khanrc/tcl"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/khanrc%2Ftcl","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/khanrc%2Ftcl/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/khanrc%2Ftcl/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/khanrc%2Ftcl/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/khanrc","download_url":"https://codeload.github.com/khanrc/tcl/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250674295,"owners_count":21469193,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cvpr","open-world","segmentation","semantic-segmentation","zero-shot"],"created_at":"2024-11-10T14:09:46.041Z","updated_at":"2025-04-24T17:31:17.023Z","avatar_url":"https://github.com/khanrc.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# TCL: Text-grounded Contrastive Learning (CVPR'23)\n\nOfficial PyTorch implementation of [**Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs**](https://arxiv.org/abs/2212.00785), *Junbum Cha, Jonghwan Mun, Byungseok Roh*, CVPR 2023.\n\n**T**ext-grounded **C**ontrastive **L**earning (TCL) is an open-world semantic segmentation framework using only image-text pairs. TCL enables a model to learn region-text alignment without train-test discrepancy.\n\n[**Demo page**](https://huggingface.co/spaces/khanrc/tcl) is available. Since this demo runs on a free HuggingFace CPU space, inference times may take around 5-10 seconds.\n\n\u003cdiv align=\"center\"\u003e\n\u003cfigure\u003e\n  \u003cimg alt=\"\" src=\"./assets/method.jpg\"\u003e\n\u003c/figure\u003e\n\u003c/div\u003e\n\n\n## Results\n\nTCL can perform segmentation on both (a, c) existing segmentation benchmarks and (b) arbitrary concepts, such as proper nouns and free-form text, in the wild images.\n\n\u003cdiv align=\"center\"\u003e\n\u003cfigure\u003e\n  \u003cimg alt=\"\" src=\"./assets/main.jpg\"\u003e\n\u003c/figure\u003e\n\u003c/div\u003e\n\n\u003cbr/\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e Additional examples in PASCAL VOC \u003c/summary\u003e\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"./assets/examples-voc.jpg\" width=\"800\" /\u003e\n\u003c/p\u003e\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e Additional examples in the wild \u003c/summary\u003e\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"./assets/examples-in-the-wild.jpg\" width=\"800\" /\u003e\n\u003c/p\u003e\n\u003c/details\u003e\n\n\n## Dependencies\n\nWe used pytorch 1.12.1 and torchvision 0.13.1.\n\n```bash\npip install -U openmim\nmim install mmcv-full==1.6.2 mmsegmentation==0.27.0\npip install -r requirements.txt\n```\n\nNote that the order of requirements roughly represents the importance of the version.\nWe recommend using the same version for at least `webdataset`, `mmsegmentation`, and `timm`.\n\n\n## Datasets\n\nNote that much of this section is adapted from the [data preparation section of GroupViT README](https://github.com/NVlabs/GroupViT#data-preparation).\n\nWe use [webdataset](https://webdataset.github.io/webdataset/) as scalable data format in training and [mmsegmentation](https://github.com/open-mmlab/mmsegmentation) for semantic segmentation evaluation.\n\nThe overall file structure is as follows:\n\n```shell\nTCL\n├── data\n│   ├── gcc3m\n│   │   ├── gcc-train-000000.tar\n│   │   ├── ...\n│   ├── gcc12m\n│   │   ├── cc-000000.tar\n│   │   ├── ...\n│   ├── cityscapes\n│   │   ├── leftImg8bit\n│   │   │   ├── train\n│   │   │   ├── val\n│   │   ├── gtFine\n│   │   │   ├── train\n│   │   │   ├── val\n│   ├── VOCdevkit\n│   │   ├── VOC2012\n│   │   │   ├── JPEGImages\n│   │   │   ├── SegmentationClass\n│   │   │   ├── ImageSets\n│   │   │   │   ├── Segmentation\n│   │   ├── VOC2010\n│   │   │   ├── JPEGImages\n│   │   │   ├── SegmentationClassContext\n│   │   │   ├── ImageSets\n│   │   │   │   ├── SegmentationContext\n│   │   │   │   │   ├── train.txt\n│   │   │   │   │   ├── val.txt\n│   │   │   ├── trainval_merged.json\n│   │   ├── VOCaug\n│   │   │   ├── dataset\n│   │   │   │   ├── cls\n│   ├── ade\n│   │   ├── ADEChallengeData2016\n│   │   │   ├── annotations\n│   │   │   │   ├── training\n│   │   │   │   ├── validation\n│   │   │   ├── images\n│   │   │   │   ├── training\n│   │   │   │   ├── validation\n│   ├── coco_stuff164k\n│   │   ├── images\n│   │   │   ├── train2017\n│   │   │   ├── val2017\n│   │   ├── annotations\n│   │   │   ├── train2017\n│   │   │   ├── val2017\n```\n\nThe instructions for preparing each dataset are as follows.\n\n### Training datasets\n\nIn training, we use Conceptual Caption 3m and 12m. We use [img2dataset](https://github.com/rom1504/img2dataset) tool to download and preprocess the datasets.\n\n#### GCC3M\n\nPlease download the training split annotation file from [Conceptual Caption 3M](https://ai.google.com/research/ConceptualCaptions/download) and name it as `gcc3m.tsv`.\n\nThen run `img2dataset` to download the image text pairs and save them in the webdataset format.\n```\nsed -i '1s/^/caption\\turl\\n/' gcc3m.tsv\nimg2dataset --url_list gcc3m.tsv --input_format \"tsv\" \\\n            --url_col \"url\" --caption_col \"caption\" --output_format webdataset \\\n            --output_folder data/gcc3m \\\n            --processes_count 16 --thread_count 64 \\\n            --image_size 512 --resize_mode keep_ratio --resize_only_if_bigger True \\\n            --enable_wandb True --save_metadata False --oom_shard_count 6\nrename -d 's/^/gcc-train-/' data/gcc3m/*\n```\nPlease refer to [img2dataset CC3M tutorial](https://github.com/rom1504/img2dataset/blob/main/dataset_examples/cc3m.md) for more details.\n\n#### GCC12M\n\nPlease download the annotation file from [Conceptual Caption 12M](https://github.com/google-research-datasets/conceptual-12m) and name it as `gcc12m.tsv`.\n\nThen run `img2dataset` to download the image text pairs and save them in the webdataset format.\n```\nsed -i '1s/^/caption\\turl\\n/' gcc12m.tsv\nimg2dataset --url_list gcc12m.tsv --input_format \"tsv\" \\\n            --url_col \"url\" --caption_col \"caption\" --output_format webdataset \\\n            --output_folder data/gcc12m \\\n            --processes_count 16 --thread_count 64 \\\n            --image_size 512 --resize_mode keep_ratio --resize_only_if_bigger True \\\n            --enable_wandb True --save_metadata False --oom_shard_count 6\nrename -d 's/^/cc-/' data/gcc12m/*\n```\nPlease refer to [img2dataset CC12M tutorial](https://github.com/rom1504/img2dataset/blob/main/dataset_examples/cc12m.md) for more details.\n\n\n### Evaluation datasets\n\nIn the paper, we use 8 benchmarks; (i) w/ background: PASCAL VOC20, PASCAL Context59, and COCO-Object, and (ii) w/o background: PASCAL VOC, PASCAL Context, COCO-Stuff, Cityscapes, and ADE20k.\nSince some benchmarks share the data sources (e.g., VOC20 and VOC), we need to prepare 5 datasets: PASCAL VOC, PASCAL Context, COCO-Stuff164k, Cityscapes, and ADE20k.\n\nPlease download and setup [PASCAL VOC](https://github.com/open-mmlab/mmsegmentation/blob/master/docs/en/dataset_prepare.md#pascal-voc), [PASCAL Context](https://github.com/open-mmlab/mmsegmentation/blob/master/docs/en/dataset_prepare.md#pascal-context), [COCO-Stuff164k](https://github.com/open-mmlab/mmsegmentation/blob/master/docs/en/dataset_prepare.md#coco-stuff-164k), [Cityscapes](https://github.com/open-mmlab/mmsegmentation/blob/master/docs/en/dataset_prepare.md#cityscapes), and [ADE20k](https://github.com/open-mmlab/mmsegmentation/blob/master/docs/en/dataset_prepare.md#ade20k) datasets following [MMSegmentation data preparation document](https://github.com/open-mmlab/mmsegmentation/blob/master/docs/en/dataset_prepare.md).\n\n#### COCO Object\n\nCOCO-Object dataset uses only object classes from COCO-Stuff164k dataset by collecting instance semgentation annotations.\nRun the following command to convert instance segmentation annotations to semantic segmentation annotations:\n\n```shell\npython convert_dataset/convert_coco.py data/coco_stuff164k/ -o data/coco_stuff164k/\n```\n\n\n## Training\n\nWe use 16 and 8 NVIDIA V100 GPUs for the main and ablation experiments, respectively.\n\n### Single node\n\n```\ntorchrun --rdzv_endpoint=localhost:5 --nproc_per_node=auto main.py --cfg ./configs/tcl.yml\n```\n\n### Multi node\n\n```\ntorchrun --rdzv_endpoint=$HOST:$PORT --nproc_per_node=auto --nnodes=$NNODES --node_rank=$RANK main.py --cfg ./configs/tcl.yml\n```\n\n## Evaluation\n\nWe provide [an official checkpoint](https://github.com/kakaobrain/tcl/releases/download/v1.0.0/tcl.pth) to reproduce the main results of our paper.\n\n- Zero-shot transfer to semantic segmentation (Table 2):\n\n```\ntorchrun --rdzv_endpoint=localhost:5 --nproc_per_node=auto main.py --resume checkpoints/tcl.pth --eval\n```\n\n- Evaluation without PAMR (Table 3 in Appendix):\n\n```\ntorchrun --rdzv_endpoint=localhost:5 --nproc_per_node=auto main.py --resume checkpoints/tcl.pth --eval \\\n    --opts evaluate.pamr=false evaluate.bg_thresh=0.5\n```\n\nNote that we use `bg_threshold` of 0.4 with PAMR and 0.5 without PAMR, since we observed that PAMR tends to reduce the foreground area.\n\n\n## Citation\n\n```bibtex\n@inproceedings{cha2022tcl,\n  title={Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs},\n  author={Cha, Junbum and Mun, Jonghwan and Roh, Byungseok},\n  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},\n  year={2023}\n}\n```\n\n\n## License\n\nThis project is released under [MIT license](./LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkhanrc%2Ftcl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkhanrc%2Ftcl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkhanrc%2Ftcl/lists"}