{"id":19694578,"url":"https://github.com/baaivision/diva","last_synced_at":"2025-10-08T06:45:44.601Z","repository":{"id":250805123,"uuid":"835528905","full_name":"baaivision/DIVA","owner":"baaivision","description":"[ICLR 2025] Diffusion Feedback Helps CLIP See Better","archived":false,"fork":false,"pushed_at":"2025-01-23T05:37:10.000Z","size":2595,"stargazers_count":270,"open_issues_count":4,"forks_count":14,"subscribers_count":9,"default_branch":"main","last_synced_at":"2025-04-04T04:41:13.588Z","etag":null,"topics":["clip","diffusion","visual-perception"],"latest_commit_sha":null,"homepage":"https://rubics-xuan.github.io/DIVA/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/baaivision.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-30T03:01:03.000Z","updated_at":"2025-03-28T10:48:40.000Z","dependencies_parsed_at":"2024-08-07T15:31:41.441Z","dependency_job_id":"08d0c324-1fbe-40a5-95e8-7ff33b5f3132","html_url":"https://github.com/baaivision/DIVA","commit_stats":null,"previous_names":["baaivision/diva"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/baaivision%2FDIVA","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/baaivision%2FDIVA/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/baaivision%2FDIVA/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/baaivision%2FDIVA/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/baaivision","download_url":"https://codeload.github.com/baaivision/DIVA/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248654251,"owners_count":21140268,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["clip","diffusion","visual-perception"],"created_at":"2024-11-11T19:23:36.690Z","updated_at":"2025-10-08T06:45:39.574Z","avatar_url":"https://github.com/baaivision.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align='center'\u003e\n\n\u003ch2\u003e\u003ca href=\"https://arxiv.org/abs/2407.20171\"\u003eDiffusion Feedback Helps CLIP See Better\u003c/a\u003e\u003c/h2\u003e\n\n[Wenxuan Wang](https://scholar.google.com/citations?user=75OyC-oAAAAJ\u0026hl=zh-CN)\u003csup\u003e1,2,3*\u003c/sup\u003e, [Quan Sun](https://scholar.google.cz/citations?user=pVKiHdEAAAAJ\u0026hl=zh-CN\u0026oi=ao)\u003csup\u003e3*\u003c/sup\u003e, [Fan Zhang](https://scholar.google.cz/citations?hl=zh-CN\u0026user=VsJ39HMAAAAJ\u0026view_op=list_works\u0026sortby=pubdate)\u003csup\u003e3\u003c/sup\u003e, [Yepeng Tang](https://scholar.google.cz/citations?user=CAC_4OUAAAAJ\u0026hl=zh-CN\u0026oi=ao)\u003csup\u003e4\u003c/sup\u003e, [Jing Liu](https://scholar.google.com/citations?user=sOI-S7oAAAAJ\u0026hl=zh-CN)\u003csup\u003e1,2\u003c/sup\u003e, [Xinlong Wang](https://scholar.google.com/citations?hl=zh-CN\u0026user=DPz0DjYAAAAJ\u0026view_op=list_works\u0026sortby=pubdate/)\u003csup\u003e3\u003c/sup\u003e\n \n\u003csup\u003e1\u003c/sup\u003e[CASIA](http://english.ia.cas.cn/), \u003csup\u003e2\u003c/sup\u003e[UCAS](https://english.ucas.ac.cn/), \u003csup\u003e3\u003c/sup\u003e[BAAI](https://www.baai.ac.cn/english.html), \u003csup\u003e4\u003c/sup\u003e[BJTU](https://en.bjtu.edu.cn/) \u003cbr\u003e\u003csup\u003e*\u003c/sup\u003e Equal Contribution \u003cbr\u003e\n\n\n\u003c/div\u003e\n\n\n## ⏰ Schedule\n\n### [2025-01-23] Our [paper](https://arxiv.org/abs/2407.20171) is accepted by ICLR 2025 ! 💥\n### [2024-08-07] We release [CLIP model weights](https://huggingface.co/BAAI/DIVA) ! 💥  \n### [2024-08-05] We release [training \u0026 evaluation code](https://github.com/baaivision/DIVA) ! 💥  \n### [2024-07-30] Our [paper](https://arxiv.org/abs/2407.20171) is released on arXiv ! 💥\n\n\n## 💡 Motivation\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"assets/introduction.png\" alt=\"overview\" width=\"800\" /\u003e\n\u003c/p\u003e\n\nIn this work, we present a simple post-training approach for CLIP models, which largely overcomes its visual shortcomings via a self-supervised diffusion process. We introduce DIVA, which uses the DIffusion model as a Visual Assistant for CLIP. Specifically, DIVA leverages generative feedback from text-to-image diffusion models to optimize CLIP representations, with only images (w/o corresponding text). We demonstrate that DIVA improves CLIP's performance on the challenging MMVP-VLM benchmark which assesses fine-grained visual abilities to a large extent (e.g., 3-7% ↑), and enhances the performance of MLLMs and vision models on multimodal understanding and segmentation tasks. Extensive evaluation on 29 image classification and retrieval benchmarks confirms that DIVA preserves CLIP's strong zero-shot capabilities.\n\n\n## 🤖 Architecture\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"assets/methodology.png\" alt=\"overview\" width=\"800\" /\u003e\n\u003c/p\u003e\n\nGiven an image, the CLIP model encodes the visual features as the main part of condition, then the generative diffusion model predicts the added noise taking the noisy image and condition as input. We optimize the CLIP's representation by maximizing the image likelihood with the diffusion loss via generative feedback.\n\n\n## 🔨 Installation\nClone this repository and install the required packages:\n\n```shell\ngit clone https://github.com/baaivision/DIVA.git\ncd DIVA\nmkdir -p outputs logs datasets pretrained_weights/CLIP pretrained_weights/SD\n\nconda create -n diva python=3.9\nconda activate diva\npip install -r requirements.txt\n```\nCore packages: \n- [Pytorch](https://pytorch.org/) version 2.0.0\n- [open-clip-torch](https://github.com/mlfoundations/open_clip) version 2.24.0\n- [timm](https://github.com/rwightman/pytorch-image-models) version 0.9.8\n\n\n## 🍹 Preparation for DIVA's Generative Fine-tuning\n\n### Data Acquisition\nFor data preparation, please refer to [image2dataset](https://github.com/rom1504/img2dataset/blob/main/dataset_examples/cc3m.md) and [MMVP](https://github.com/tsb0601/MMVP/tree/main) for the employed training and evaluation data in this work. After collecting the corresponding datasets, directly put them into the `dataset/` folder path. \n\n### Pre-trained Weight Downloading\nAs for pre-trained weight preparation, please refer to [OpenAI ViT-L-14/224\u0026336](https://github.com/openai/CLIP/blob/main/clip/clip.py), [MetaCLIP ViT-L/H-14](https://github.com/facebookresearch/metaclip), [SigLIP ViT-SO-14/224](https://huggingface.co/timm/ViT-SO400M-14-SigLIP), [SigLIP ViT-SO-14/384](https://huggingface.co/timm/ViT-SO400M-14-SigLIP-384), [DFN ViT-H-14/224](https://huggingface.co/apple/DFN5B-CLIP-ViT-H-14), [DFN ViT-H-14/378](https://huggingface.co/apple/DFN5B-CLIP-ViT-H-14-378) and [SD-2-1-base](https://huggingface.co/stabilityai/stable-diffusion-2-1-base) to acquire the model weights for discriminative CLIP models and the leveraged diffusion model that provides generative feedback. After downloading all these necessary weights, move them respectively to the corresponding folder path `pretrained_weights/CLIP/` and `pretrained_weights/SD/`.\n\n### Code Modification\nFor the preparation for our DIVA's condition design, some source code in the installed [CLIP](https://github.com/openai/CLIP) and [OpenCLIP](https://github.com/mlfoundations/open_clip) packages need to be modified.\n\nFor OpenAI CLIP, use the content in our provided `condition/OpenAICLIP_for_clip_model.py` to replace the content in `Your Conda Installation Path/anaconda3/envs/diva/lib/python3.9/site-packages/clip/model.py`.\n\nFor MetaCLIP and DFN, use the content in our provided `condition/MetaCLIP_for_openclip_transformer.py` and `condition/DFN_for_openclip_transformer.py` to replace the content in `Your Conda Installation Path/anaconda3/envs/diva/lib/python3.9/site-packages/open_clip/transformer.py`, respectively.\n\nFor SigLIP, use the content in our provided `condition/SigLIP_for_timm_models_visiontransformer.py` to replace the content in `Your Conda Installation Path/anaconda3/envs/diva/lib/python3.9/site-packages/timm/models/vision_transformer.py`.\n\n\n## 🍻 Quick Start for Training \u0026 Evaluation\n\nAfter all the above preparation steps, you can simply start training for our DIVA with the following command: \n```shell\n# For OpenAICLIP\nbash DIVA_for_OpenAICLIP.sh\n\n# For MetaCLIP\nbash DIVA_for_MetaCLIP.sh\n\n# For SigLIP\nbash DIVA_for_SigLIP.sh\n\n# For DFN\nbash DIVA_for_DFN.sh\n```\n\n##  Model Zoo\n\n| Method               | Image Size | Params (M) | Average Score |\n|----------------------|------------|------------|---------------|\n| [OpenAI ViT-L-14](https://huggingface.co/BAAI/DIVA/blob/main/OpenAICLIP/OpenAI-ViT-L-14-224.pth)      | 224²       | 427.6      | 25.9 (+6.6)   |\n| [OpenAI ViT-L-14](https://huggingface.co/BAAI/DIVA/blob/main/OpenAICLIP/OpenAI-ViT-L-14-336.pth)      | 336²       | 427.9      | 25.2 (+5.2)   |\n| [MetaCLIP ViT-L-14](https://huggingface.co/BAAI/DIVA/blob/main/MetaCLIP/MetaCLIP-ViT-L-14.pth)    | 224²       | 427.6      | 27.4 (+3.7)   |\n| [MetaCLIP ViT-H-14](https://huggingface.co/BAAI/DIVA/blob/main/MetaCLIP/MetaCLIP-ViT-H-14.pth)    | 224²       | 986.1      | 31.9 (+6.7)   |\n| [SigLIP ViT-SO-14](https://huggingface.co/BAAI/DIVA/blob/main/SigLIP/SigLIP-ViT-SO-14-224.pth)     | 224²       | 877.4      | 40.7 (+2.9)   |\n| [SigLIP ViT-SO-14](https://huggingface.co/BAAI/DIVA/blob/main/SigLIP/SigLIP-ViT-SO-14-384.pth)     | 384²       | 878.0      | 38.5 (+1.5)   |\n| [DFN ViT-H-14](https://huggingface.co/BAAI/DIVA/blob/main/DFN/DFN-ViT-H-14-224.pth)        | 224²       | 986.1      | 43.7 (+4.4)   |\n| [DFN ViT-H-14](https://huggingface.co/BAAI/DIVA/blob/main/DFN/DFN-ViT-H-14-378.pth)         | 378²       | 986.7      | 37.8 (+3.0)   |\n\n\nIt is worth noting that, due to the randomness among the introduced condition design during the training phase and the selection of local patch tokens during the inference phase for OpenAI CLIP, the obtained scores on MMVP_VLM benchmark using our provided OpenAI CLIP weights might not be the same as the reported results in our paper. At this time, we recommend trying different random seeds multiple times if the scores do not meet expectations. \n\n## 🎨 Visualization\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"assets/qualitative_mmvp.png\" alt=\"scene\" width=\"900\" /\u003e\n\u003c/p\u003e\n\n\n## 💙 Acknowledgement\nDIVA is built upon the awesome [Diffusion-TTA](https://github.com/mihirp1998/Diffusion-TTA), [MMVP](https://github.com/tsb0601/MMVP), [CLIP](https://github.com/openai/CLIP), [OpenCLIP](https://github.com/mlfoundations/open_clip), [timm](https://github.com/huggingface/pytorch-image-models/). \n\n## 📝 Citation\n```bib\n@article{wang2024diffusion,\n      title={Diffusion Feedback Helps CLIP See Better},\n      author={Wang, Wenxuan and Sun, Quan and Zhang, Fan and Tang, Yepeng and Liu, Jing and Wang, Xinlong},\n      journal={arXiv preprint arXiv:2407.20171},\n      year={2024}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbaaivision%2Fdiva","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbaaivision%2Fdiva","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbaaivision%2Fdiva/lists"}