{"id":13429605,"url":"https://github.com/xmed-lab/CLIP_Surgery","last_synced_at":"2025-03-16T03:31:53.390Z","repository":{"id":153099902,"uuid":"626501079","full_name":"xmed-lab/CLIP_Surgery","owner":"xmed-lab","description":"CLIP Surgery for Better Explainability with Enhancement in Open-Vocabulary Tasks","archived":false,"fork":false,"pushed_at":"2025-03-01T08:40:56.000Z","size":17407,"stargazers_count":395,"open_issues_count":1,"forks_count":25,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-03-01T09:26:02.075Z","etag":null,"topics":["clip","explainability","interpretability","multilabel","multimodal","open-vocabulary","sam","segment-anything","segmentation","vision-transformer"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/xmed-lab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-04-11T15:36:31.000Z","updated_at":"2025-03-01T08:41:00.000Z","dependencies_parsed_at":"2025-02-13T14:37:22.675Z","dependency_job_id":null,"html_url":"https://github.com/xmed-lab/CLIP_Surgery","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xmed-lab%2FCLIP_Surgery","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xmed-lab%2FCLIP_Surgery/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xmed-lab%2FCLIP_Surgery/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xmed-lab%2FCLIP_Surgery/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/xmed-lab","download_url":"https://codeload.github.com/xmed-lab/CLIP_Surgery/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243822309,"owners_count":20353496,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["clip","explainability","interpretability","multilabel","multimodal","open-vocabulary","sam","segment-anything","segmentation","vision-transformer"],"created_at":"2024-07-31T02:00:42.414Z","updated_at":"2025-03-16T03:31:53.383Z","avatar_url":"https://github.com/xmed-lab.png","language":"Jupyter Notebook","funding_links":[],"categories":["2 Foundation Models","Paper List"],"sub_categories":["2.3 Multimodal Foundation Models","Follow-up Papers"],"readme":"# A closer look at the explainability of Contrastive language-image pre-training ([Pattern Recognition](https://www.sciencedirect.com/science/article/abs/pii/S003132032500069X?via%3Dihub))\nEarly version: CLIP Surgery for Better Explainability with Enhancement in Open-Vocabulary Tasks ([arxiv](https://arxiv.org/abs/2304.05653))\n\n## Introduction\n\nThis work focuses on the explainability of CLIP via its raw predictions. We identify two problems about CLIP's explainability: opposite visualization and noisy activations. Then we propose the CLIP Surgery, which does not require any fine-tuning or additional supervision. It greatly improves the explainability of CLIP, and enhances downstream open-vocabulary tasks such as multi-label recognition, semantic segmentation, interactive segmentation (specifically the Segment Anything Model, SAM), and multimodal visualization. Currently, we offer a simple demo for interpretability analysis, and how to convert text to point prompts for SAM. Rest codes including evaluation and other tasks will be released later.\n\nOpposite visualization is due to wrong relation in self-attention:\n![image](figs/fig1.jpg)\n\nNoisy activations is owing to redundant features across lables:\n![image](figs/fig2.jpg)\n\nOur visualization results:\n![image](figs/fig3.jpg)\n\nText2Points to guide SAM:\n![image](figs/fig4.jpg)\n\nMultimodal visualization:\n![image](figs/fig5.jpg)\n\nSegmentation results:\n![image](figs/fig6.jpg)\n\nMultilabel results:\n![image](figs/fig7.jpg)\n\n## Demo\n\nFirstly to install the SAM, and download the model\n```\npip install git+https://github.com/facebookresearch/segment-anything.git\nwget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth\n```\n\nThen explain CLIP via jupyter demo [\"demo.ipynb\"](https://github.com/xmed-lab/CLIP_Surgery/blob/master/demo.ipynb).\nOr use the python file:\n```\npython demo.py\n```\n(Note: demo's results are slightly different from the experimental code, specifically no apex amp fp16 for easier to use.)\n\n## Cite\n```\n@article{LI2025111409,\ntitle = {A closer look at the explainability of Contrastive language-image pre-training},\njournal = {Pattern Recognition},\nvolume = {162},\npages = {111409},\nyear = {2025},\nissn = {0031-3203},\ndoi = {https://doi.org/10.1016/j.patcog.2025.111409},\nurl = {https://www.sciencedirect.com/science/article/pii/S003132032500069X},\nauthor = {Yi Li and Hualiang Wang and Yiqun Duan and Jiheng Zhang and Xiaomeng Li}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxmed-lab%2FCLIP_Surgery","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fxmed-lab%2FCLIP_Surgery","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxmed-lab%2FCLIP_Surgery/lists"}