Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/leaves162/CLIPtrase
cliptrase
https://github.com/leaves162/CLIPtrase
Last synced: about 1 month ago
JSON representation
cliptrase
- Host: GitHub
- URL: https://github.com/leaves162/CLIPtrase
- Owner: leaves162
- License: mit
- Created: 2024-07-10T06:54:46.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2024-07-14T08:51:04.000Z (6 months ago)
- Last Synced: 2024-07-14T09:56:55.028Z (6 months ago)
- Language: Jupyter Notebook
- Homepage:
- Size: 7.49 MB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- Awesome-Segment-Anything - [code
README
# CLIPTrase
## [ECCV24] Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation
## 1. Introduction
> CLIP, as a vision-language model, has significantly advanced Open-Vocabulary Semantic Segmentation (OVSS) with its zero-shot capabilities. Despite its success, its application to OVSS faces challenges due to its initial image-level alignment training, which affects its performance in tasks requiring detailed local context. Our study delves into the impact of CLIP's [CLS] token on patch feature correlations, revealing a dominance of "global" patches that hinders local feature discrimination. To overcome this, we propose CLIPtrase, a novel training-free semantic segmentation strategy that enhances local feature awareness through recalibrated self-correlation among patches. This approach demonstrates notable improvements in segmentation accuracy and the ability to maintain semantic coherence across objects.
Experiments show that we are 22.3\% ahead of CLIP on average on 9 segmentation benchmarks, outperforming existing state-of-the-art training-free methods.Full paper and supplementary materials: [arxiv](https://arxiv.org/abs/2407.08268)
### 1.1. Global Patch
![global patch](/images/reason.png)
### 1.2. Model Architecture
![model architecture](/images/model.svg)
## 2. Code
### 2.1. Environments
+ base environment: pytorch==1.12.1, torchvision==0.13.1 (CUDA11.3)
```
python -m pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113
```
+ Detectron2 version: install detectron2==0.6 additionally
```
git clone https://github.com/facebookresearch/detectron2.git
python -m pip install -e detectron2
```### 2.2. Data preparation
+ We follow the detectron2 format of the datasets:
The specific processing process can refer to [MaskFormer](https://github.com/facebookresearch/MaskFormer/blob/main/datasets/README.md) and [SimSeg](https://github.com/MendelXu/zsseg.baseline)
Update `configs/dataset_cfg.py` to your own path
```
datasets/
--coco/
----...
----val2017/
----stuffthingmaps_detectron2/
------val2017/--VOC2012/
----...
----images_detectron2/
------val/
----annotations_detectron2/
------val/--pcontext/
----...
----val/
------image/
------label/----pcontext_full/
----...
----val/
------image/
------label/--ADEChallengeData2016/
----...
----images/
------validation/
----annotations_detectron2/
------validation/--ADE20K_2021_17_01/
----...
----images/
------validation/
----annotations_detectron2/
------validation/
```+ You also can use your own dataset, mask sure that it has `image` and `gt` file, and the value of each pixel in the gt image is its corresponding label.
### 2.3. Global patch demo
+ We provide a demo of the global patch in the notebook `global_patch_demo.ipynb`, where you can visualize the global patch phenomenon mentioned in our paper.### 2.4. Training-free OVSS
+ Running with single GPU
```
python clip_self_correlation.py
```
+ Running with multiple GPUs in the detectron2 version
Update: We provide detectron2 framework version, the clip state keys are modified and can be found [here](https://drive.google.com/file/d/1mZtNhYCJzL1jDfc4oO6e7rqbKiKSBGz9/view?usp=drive_link), you can download and put it in `outputs` folder.
Note: The results of the d2 version are slightly different from those in the paper due to differences in preprocessing and resolution.
```
python -W ignore train_net.py --eval-only --config-file configs/clip_self_correlation.yaml --num-gpus 4 OUTPUT_DIR your_output_path MODEL.WEIGHTS your_model_path
```
+ Resultssingle 3090, CLIP-B/16, evaluate in 9 situations on COCO, ADE, PASCAL CONTEXT, and VOC.
Our results do not use any post-processing such as densecrf.
w/o. background
w. background
Resolution
Metrics
coco171
voc20
pc59
pc459
ade150
adefull
coco80
voc21
pc60
224
pAcc
38.9
89.68
58.94
44.18
38.57
25.45
50.08
78.63
52.14
mAcc
44.47
91.4
57.08
21.53
39.17
18.78
62.5
84.11
56.08
fwIoU
26.87
82.49
45.28
35.22
27.96
18.99
38.19
67.67
37.61
mIoU
22.84
80.95
33.83
9.36
16.35
6.31
43.56
50.88
29.87
336
pAcc
40.14
89.51
60.15
45.61
39.92
26.73
50.01
79.93
53.21
mAcc
45.09
91.77
57.47
21.26
37.75
17.99
62.55
85.24
56.43
fwIoU
27.96
82.15
46.64
36.66
29.17
20.3
38.24
69.1
38.76
mIoU
24.06
81.2
34.92
9.95
17.04
5.89
44.84
53.04
30.79
## Citation
+ If you find this project useful, please consider citing:
```
@InProceedings{shao2024explore,
title={Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation},
author={Tong Shao and Zhuotao Tian and Hang Zhao and Jingyong Su},
booktitle={European Conference on Computer Vision},
organization={Springer},
year={2024}
}
```