{"id":13488375,"url":"https://github.com/Qrange-group/SUR-adapter","last_synced_at":"2025-03-28T00:33:46.761Z","repository":{"id":164397529,"uuid":"639827451","full_name":"Qrange-group/SUR-adapter","owner":"Qrange-group","description":"ACM MM'23 (oral), SUR-adapter for pre-trained diffusion models can acquire the powerful semantic understanding and reasoning capabilities from large language models to build a high-quality textual semantic representation for text-to-image generation.","archived":false,"fork":false,"pushed_at":"2024-04-24T03:09:31.000Z","size":2120,"stargazers_count":117,"open_issues_count":8,"forks_count":2,"subscribers_count":4,"default_branch":"main","last_synced_at":"2024-10-31T00:39:58.220Z","etag":null,"topics":["adapter","diffusion-models","image-generation","knowledge-distillation","large-language-models","pytorch"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Qrange-group.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-05-12T10:10:38.000Z","updated_at":"2024-10-29T12:08:33.000Z","dependencies_parsed_at":null,"dependency_job_id":"a181dd67-48be-4221-8c4d-65f7bbb0c291","html_url":"https://github.com/Qrange-group/SUR-adapter","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Qrange-group%2FSUR-adapter","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Qrange-group%2FSUR-adapter/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Qrange-group%2FSUR-adapter/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Qrange-group%2FSUR-adapter/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Qrange-group","download_url":"https://codeload.github.com/Qrange-group/SUR-adapter/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245949278,"owners_count":20698913,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["adapter","diffusion-models","image-generation","knowledge-distillation","large-language-models","pytorch"],"created_at":"2024-07-31T18:01:14.617Z","updated_at":"2025-03-28T00:33:41.716Z","avatar_url":"https://github.com/Qrange-group.png","language":"Python","funding_links":[],"categories":["T2I Diffusion Model augmentation"],"sub_categories":[],"readme":"# SUR-adapter \n![GitHub](https://img.shields.io/github/license/gbup-group/DIANet.svg)\n![GitHub](https://img.shields.io/badge/Qrange%20-group-orange)\n\nBy [Shanshan Zhong](https://github.com/zhongshsh) and [Zhongzhan Huang](https://dedekinds.github.io) and [Wushao Wen](https://scholar.google.com/citations?user=FSnLWy4AAAAJ) and [Jinghui Qin](https://github.com/QinJinghui) and [Liang Lin](https://scholar.google.com/citations?user=Nav8m8gAAAAJ\u0026hl=en)\n\nThis repository is the implementation of \"SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models\" [[paper]](https://arxiv.org/abs/2305.05189). Our paper has been accepted at the 31st ACM International Conference on Multimedia (ACM MM 2023, Oral).\n\n\n## 🌻 Introduction\n\n**Semantic Understanding and Reasoning** adapter (SUR-adapter) for pre-trained **diffusion models** can acquire the powerful semantic understanding and reasoning capabilities from **large language models** to build a high-quality textual semantic representation for text-to-image generation. \n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://github.com/Qrange-group/RAS/assets/62104945/af863827-2ea4-45cb-b3ed-2f98ba0e7d03\"\u003e\n\u003c/p\u003e\n\n## 📣 News\n\n\n2024/02/27 - We have provided a filtered (non-NSFW) version of dataset SURD [[Google Drive](https://drive.google.com/file/d/1HOikHEXY4_75cafEK3HmqRhPAaSYEeHh/view?usp=drive_link)]. Please try it! \n\n2023/10/20 - We have provided an example checkpoint of SUR-adapter [[Google Drive](https://drive.google.com/drive/folders/1UyC9_AqTezmHXmj4dh0A-9RBKKx_JmJZ?usp=share_link)]. Please try it! \n\n2023/08/19 - We have provided the data scraping code for Civitai. Please take a look at [processing](https://github.com/Qrange-group/SUR-adapter/blob/main/data_collect/processing.ipynb).\n\n## 🏇 TODO\n\n- [x] data collection script\n- [x] pretrain model\n- [x] dataset\n\n## 🌻 Quick Training\n\n(1) Clone the code. \n\n```sh\ngit clone https://github.com/Qrange-group/SUR-adapter\n```\n```sh\ncd SUR-adapter\n```\n\n(2) Prepare the enviroment.\n\nIf **Pytorch** is not installed, you can install it through the [official website guide](https://pytorch.org/get-started/locally). For example, when I use `nvidia-smi` to know that my `CUDA Version` is 11.1, we can install Pytorch through the following command:\n```sh\npip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu111\n```\n\nThen install `diffusers` following the [guide](https://huggingface.co/docs/diffusers/installation).\n```sh\npip install diffusers[\"torch\"]\n```\n\nFinally, install the relevant packages.\n```sh\npip install -r requirements.txt\n```\n\n(3) Download the dataset and vectors.\n\n```sh\ngdown --fuzzy https://drive.google.com/file/d/1HOikHEXY4_75cafEK3HmqRhPAaSYEeHh/view?usp=sharing\nunzip SURD.zip\nmkdir -p prompt2vec/13B\ngdown --fuzzy https://drive.google.com/file/d/1u6K3uvTr7G58I_i98PkPitzp1jDiXLLX/view?usp=sharing -O prompt2vec/13B\n```\n\n\n(4) Run the following code in shell, where `0` is the gpu id. If you encounter CUDA out of memory, you can try to find a solution in [document](https://huggingface.co/docs/diffusers/v0.16.0/en/optimization/fp16). \n\n```sh\nsh run.sh 0\n```\n\n**Quick Training** only uses about 5200 MiB GPU Memory. If your GPU memory is large enough, you can increase the batch size or not use mixed precision. The following is a description of the parameters of `run.sh`, the details can be found in `SUR_adapter_train.py`. \n\n```sh\nexport CUDA=$1               # GPU id \nexport LLM=\"13B\"             # size of LLM\nexport LLM_LAYER=39          # layer of LLM\nexport MODEL_NAME=\"runwayml/stable-diffusion-v1-5\"  # pre-trained diffusion model\nexport INFO=\"test\"           # help to idetify the checkpoints\nexport OUTPUT_DIR=\"fp16\"     # help to idetify the checkpoints\nexport TRAIN_DIR=\"SURD\"   # dataset\nexport SAVE_STEP=100         # step saved at intervals\nexport BATCH_SIZE=1          # batch size\n\n# please see https://huggingface.co/docs/diffusers/v0.16.0/en/training/text2image to get more details of training args\nCUDA_VISIBLE_DEVICES=$CUDA accelerate launch SUR_adapter_train.py \\    \n  --mixed_precision=\"fp16\" \\\n  --info=$INFO \\\n  --pretrained_model_name_or_path=$MODEL_NAME \\\n  --dataset_name=$TRAIN_DIR \\\n  --output_dir=$OUTPUT_DIR \\\n  --llm=$LLM \\\n  --llm_layer=$LLM_LAYER \\\n  --checkpointing_steps=$SAVE_STEP \\\n  --train_batch_size=$BATCH_SIZE \\\n  --resolution=512 --center_crop --random_flip \\\n  --gradient_accumulation_steps=4 \\\n  --gradient_checkpointing \\\n  --max_train_steps=5000 \\\n  --learning_rate=1e-05 \\\n  --prompt_weight=1e-05 \\\n  --llm_weight=1e-05 \\\n  --adapter_weight=1e-01 \\\n  --max_grad_norm=1 \\\n  --lr_scheduler=\"constant\" --lr_warmup_steps=0 \n```\n\n## 🌻 Dataset Declaration\n\n### Non-NFSW Version \n\nAs our original dataset SURD contains some sexually explicit images and others unsuitable for dissemination, we utilize [nsfw toolkit](https://github.com/rockyzhengwu/nsfw) to filter SURD. [nsfw](https://github.com/rockyzhengwu/nsfw) categorizes images into five groups: `porn`, `hentai`, `sexy`, `neutral`, and `drawings` (for more details, refer to [description](https://github.com/alex000kim/nsfw_data_scraper?tab=readme-ov-file#description)). We exclusively retain images labeled as `neutral` and `drawings`, ensuring they are safe for the workplace, thus forming the work-appropriate version of SURD (26121 samples).\n\n### Updating Dataset\n\nYou can try to collect more up-to-date data from the internet. We have provided the data scraping code for [Civitai](https://civitai.com). Please take a look at [processing](https://github.com/Qrange-group/SUR-adapter/blob/main/data_collect/processing.ipynb). Afterward, prepare the dataset in the format of `SURD`. If you have some problems, you can try to find answers from [datasets document](https://huggingface.co/docs/datasets/create_dataset) for more details. \n\n❣ **Warning** ❣: The dataset SURD proposed in our work is collected from [Lexica](https://lexica.art) ([license](https://lexica.art/license)), [Civitai](https://civitai.com) ([license](https://github.com/civitai/civitai/blob/main/LICENSE)), and [Stable Diffusion Online](https://stablediffusionweb.com) ([license](https://huggingface.co/spaces/CompVis/stable-diffusion-license)). The licenses point out that if the dataset is used for commercial purposes, there may be certain legal risks. If it is to be used for commercial purposes, please contact the relevant website or author for authorization.\n\n \n\n## 🌻 Prompt2vec\n\nWe utilize [LLaMA](https://github.com/facebookresearch/llama), a collection of foundation language models ranging from 7B to 65B parameters, as knowledge distillation for large language models (LLMs). Specifically, we save the vector representations of simple prompts in `i`-th layer of LLMs, which serve as the text understanding to finetune diffusion models. If you want to output the vectors from [LLaMA](https://github.com/facebookresearch/llama), we recommend that you can focus on [following two lines](https://github.com/facebookresearch/llama/blob/main/llama/model.py#L234-L235) of [LLaMA](https://github.com/facebookresearch/llama).\n\n```python\n        for layer in self.layers:\n            h = layer(h, start_pos, freqs_cis, mask)\n```\n\nThe data format for prompt2vec is as follows. \n\n```\n{\n  \"prompt\": torch.tensor,\n}\n```\n\nWhen you are ready for prompt2vec's `.pt` type file, please save the `.pt` file to the prompt2vec folder. For example, you can save the prompt vectors from the fortieth layers of LLaMA (13B) to `prompt2vec/13B/39.pt`. \n\n## 🌻 Inference\n\nRun the `demo.ipynb`.\n\n```python\nimport os\nos.environ['CUDA_VISIBLE_DEVICES']='0'\n\nfrom SUR_adapter_pipeline import SURStableDiffusionPipeline\nimport torch\nfrom SUR_adapter import Adapter\n\nadapter_path = \"checkpoints/runwayml_fp16/test_llm13B_llml39_lr1e-05_llmw1e-05_promptw1e-05_adapterw0.1/adapter_checkpoint1000.pt\"\nadapter=Adapter().to(\"cuda\")\nadapter.load_state_dict(torch.load(adapter_path))\nadapter.adapter_weight = float(adapter_path.split(\"adapterw\")[-1].split('/')[0])\n\nmodel_path = \"runwayml/stable-diffusion-v1-5\"\npipe = SURStableDiffusionPipeline.from_pretrained(model_path, adapter=adapter)\npipe.to(\"cuda\")\npipe.safety_checker = lambda images, clip_input: (images, False)\n\nimage = pipe(prompt='An aristocratic maiden in medieval attire with a headdress of brilliant feathers').images[0]\nimage.show()\n```\n\n## 🌸 Citation\n\n```\n@inproceedings{zhong2023adapter,\n  title={Sur-adapter: Enhancing text-to-image pre-trained diffusion models with large language models},\n  author={Zhong, Shanshan and Huang, Zhongzhan and Wen, Weushao and Qin, Jinghui and Lin, Liang},\n  booktitle={Proceedings of the 31st ACM International Conference on Multimedia},\n  pages={567--578},\n  year={2023}\n}\n```\n\n## 💖 Acknowledgments\n\nMany thanks to [huggingface](https://github.com/huggingface) for their [diffusers](https://github.com/huggingface/diffusers) for image generation task. I love open source. \n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FQrange-group%2FSUR-adapter","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FQrange-group%2FSUR-adapter","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FQrange-group%2FSUR-adapter/lists"}