{"id":13488716,"url":"https://github.com/eric-ai-lab/Discffusion","last_synced_at":"2025-03-28T01:37:28.078Z","repository":{"id":217434072,"uuid":"719474456","full_name":"eric-ai-lab/Discffusion","owner":"eric-ai-lab","description":"Official repo for the TMLR paper \"Discffusion: Discriminative Diffusion Models as Few-shot Vision and Language Learners\"","archived":false,"fork":false,"pushed_at":"2024-04-27T06:13:35.000Z","size":8706,"stargazers_count":27,"open_issues_count":0,"forks_count":3,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-10-31T00:40:09.828Z","etag":null,"topics":["diffusion-models","discriminative-learning","few-shot-learning","vision-and-language"],"latest_commit_sha":null,"homepage":"https://sites.google.com/view/discffusion","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/eric-ai-lab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-11-16T08:44:58.000Z","updated_at":"2024-10-21T20:59:21.000Z","dependencies_parsed_at":"2024-04-27T06:27:46.430Z","dependency_job_id":"ea6c0ded-656f-4050-a38e-bf4b07540510","html_url":"https://github.com/eric-ai-lab/Discffusion","commit_stats":null,"previous_names":["eric-ai-lab/dsd"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eric-ai-lab%2FDiscffusion","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eric-ai-lab%2FDiscffusion/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eric-ai-lab%2FDiscffusion/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eric-ai-lab%2FDiscffusion/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/eric-ai-lab","download_url":"https://codeload.github.com/eric-ai-lab/Discffusion/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245952919,"owners_count":20699561,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["diffusion-models","discriminative-learning","few-shot-learning","vision-and-language"],"created_at":"2024-07-31T18:01:20.614Z","updated_at":"2025-03-28T01:37:28.073Z","avatar_url":"https://github.com/eric-ai-lab.png","language":"Python","readme":"# Discffusion\nThis is the code implementation for the paper: \"Discffusion: Discriminative Diffusion Models as Few-shot Vision and Language Learners\".\nThe project is developed based on HuggingFace Diffusers.\n\n[Project Page](https://sites.google.com/view/discffusion)\n\u003cdiv align=center\u003e  \n\u003cimg src='assets/teaser.png' width=\"50%\"\u003e\n\u003c/div\u003e\n\n\nDiffusion models, such as Stable Diffusion, have shown incredible performance on text-to-image generation. Since text-to-image generation often requires models to generate visual concepts with fine-grained details and attributes specified in text prompts, can we leverage the powerful representations learned by pre-trained diffusion models for discriminative tasks such as image-text matching? To answer this question, we propose a novel approach, Discriminative Stable Diffusion (Discffusion), which turns pre-trained text-to-image diffusion models into few-shot discriminative learners. Our approach mainly uses the cross-attention score of a Stable Diffusion model to capture the mutual influence between visual and textual information and fine-tune the model via efficient attention-based prompt learning to perform image-text matching. By comparing Discffusion with state-of-the-art methods on several benchmark datasets, we demonstrate the potential of using pre-trained diffusion models for discriminative tasks with superior results on few-shot image-text matching.\n\n\n\n## Env Setup\n\n```\nconda create -n dsd python=3.9\nconda activate dsd\ncd diffusers\npip install -e .\ncd ..\npip install -r requirements.txt\n```\n\n\n## Quick Play\n### Notebook\nWe provide a [jupyter notebook file](demo.ipynb) to try the pipeline and visualize. \n\n\n### Gradio Demo\nYou can also launch the gradio demo to upload your own image to play by:\n```bash\npython playground.py \n```\n\n![ui](assets/ui.png)\n\n\n## Dataset Setup\n\n### ComVG\nDownload images from Visual Genome official websites.\nPut the Visual_Genome in the same path with Discffusion repo.\nDownload comvg_train.csv from [link](https://drive.google.com/drive/folders/1GOxInYPaVTZrFkhune-PN8LnV_weSIm6?usp=sharing) and put it in the [data](data) repo.\nOr you can directly download com_vg.zip from [link](https://drive.google.com/drive/folders/1GOxInYPaVTZrFkhune-PN8LnV_weSIm6?usp=sharing) and unzip it and put it in the same path with Discffusion repo.\n\n\n\n### RefCOCO\nWe use the RefCOCOg split.\nDownload from RefCOCOg github repo https://github.com/lichengunc/refer.\nPut refcocog in the same path with Discffusion repo.\n\n\n\n### VQA\nDownload images from VQAv2 official websites.\nPut vqav2 in the same path with Discffusion repo.\nDownload vqa_text_train.csv from [link](https://drive.google.com/drive/folders/1GOxInYPaVTZrFkhune-PN8LnV_weSIm6?usp=sharing) and put it in the [data](data) repo.\n\n\n\nThe repo strutures are: \n\n```\n├── DSD/\n│   ├── data/\n│   │   ├── comvg_train.csv\n│   │   ├── vqa_text_train.csv\n│   │   └── ...\n├── refcocog/\n│   ├── images/\n│   │   ├── train2014/\n│   │   │   ├── COCO_train2014_000000000009.jpg\n│   │   │   └── ...\n│   │   └── ...\n│   ├── refs(google).p\n│   └── instances.json\n├── vqav2/\n│   ├── images/\n│   │   └── ...\n│   ├── train2014/\n│   │   └── ...\n│   └── val2014/\n│       └── ...\n└── Visual_Genome/\n    ├── VG_100K/\n    └── vg_concept/\n        └── 180.jpg\n        └── 411.jpg\n        └── 414.jpg\n        └── 1251.jpg\n        └── ...\n```\n\n\n\n\n\n\n## Experiments\n### Test\n#### Download pretrained Checkpoints\n|                                          ComVG                                           |                                          Refcocog                                          |                            VQA                             |\n:------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------:\n [Download](https://drive.google.com/drive/folders/13-v3zShNMVpURBceqJu5T6iGXDBJqL6p?usp=sharing) | [Download](https://drive.google.com/drive/folders/13-v3zShNMVpURBceqJu5T6iGXDBJqL6p?usp=sharing) | [Download](https://drive.google.com/drive/folders/13-v3zShNMVpURBceqJu5T6iGXDBJqL6p?usp=sharing)\n\n\n#### Run inference with pretrained checkpoints\nComVG\n```\naccelerate config\naccelerate launch dsd_infer.py --val_data ComVG_obj --batchsize 16 --sampling_time_steps 40 --output_dir downloaded_checkpoints\naccelerate launch dsd_infer.py --val_data ComVG_verb --batchsize 16 --sampling_time_steps 40 --output_dir downloaded_checkpoints\naccelerate launch dsd_infer.py --val_data ComVG_sub --batchsize 16 --sampling_steps 40 --output_dir downloaded_checkpoints\n```\n\n`downloaded_checkpoints` is the folder path for your loaded ckpt, such as FOLDER_NAME/checkpoint-500000.\n\n\nRefcocog\n```\naccelerate config\naccelerate launch dsd_infer.py --val_data Refcocog --batchsize 16 --sampling_time_steps 10 --output_dir downloaded_checkpoints\n```\n\nVQAv2\n```\naccelerate config\naccelerate launch dsd_infer.py --val_data vqa_binary --batchsize 16 --sampling_time_steps 200 --output_dir downloaded_checkpoints\naccelerate launch dsd_infer.py --val_data vqa_other --batchsize 16 --sampling_time_steps 200 --output_dir downloaded_checkpoints\n```\n\n\n\n### Train yourself\n\n```\naccelerate config\naccelerate launch dsd_train.py --pretrained_model_name_or_path stabilityai/stable-diffusion-2-1-base --train_batch_size 1 --val_batch_size 4 --output_dir PATH --train_data TRAIN_DATA --val_data VAL_DATA --num_train_epochs EPOCH --learning_rate 1e-4\n```\nWe set the accelerate config to use one GPU to do the training.  \n\n`TRAIN_DATA` currently supports ComVG, Refcocog, vqa. You can add more in the [custom_dataset](./custom_datasets.py)\n\n`VAL_DATA` supports giving types, e.g. ComVG_obj/ComVG_verb/ComVG_sub, vqa_other/vqa_binary\n\nSet `--bias` if you want to train and inference using the cross-attention score from the diffusion model only. \nFor example: \n```\naccelerate config\naccelerate launch dsd_train.py --pretrained_model_name_or_path stabilityai/stable-diffusion-2-1-base --train_batch_size 1 --val_batch_size 4 --bias --output_dir ./output --train_data ComVG --val_data ComVG_verb --num_train_epochs 1 --learning_rate 1e-4\n\naccelerate config\naccelerate launch dsd_infer.py --val_data ComVG_obj --bias --batchsize 16 --sampling_time_steps 30 --output_dir YOUR_SAVED_CKPTS\n```\n\n\n\n\n## Acknowledgements\nThe code and dataset is built on [Stable Diffusion](https://github.com/CompVis/stable-diffusion), [Diffusers](https://github.com/huggingface/diffusers), [Diffusion-itm](https://github.com/McGill-NLP/diffusion-itm), [Custom Diffusion](https://github.com/adobe-research/custom-diffusion), [ComCLIP](https://github.com/eric-ai-lab/ComCLIP), [Refer](https://github.com/lichengunc/refer). We thank the authors for their model and code.\n\n\n## Citation\nIf you find it useful in your research or applications, please consider citing us, thanks!\n```bibtex\n@article{he2023discriminative,\n  title={Discriminative Diffusion Models as Few-shot Vision and Language Learners},\n  author={He, Xuehai and Feng, Weixi and Fu, Tsu-Jui and Jampani, Varun and Akula, Arjun and Narayana, Pradyumna and Basu, Sugato and Wang, William Yang and Wang, Xin Eric},\n  journal={arXiv preprint arXiv:2305.10722},\n  year={2023}\n}\n```\n","funding_links":[],"categories":["Few-Shot"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feric-ai-lab%2FDiscffusion","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Feric-ai-lab%2FDiscffusion","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feric-ai-lab%2FDiscffusion/lists"}