{"id":14769226,"url":"https://github.com/ZichengDuan/EZIGen","last_synced_at":"2025-09-14T07:30:44.087Z","repository":{"id":256889032,"uuid":"856730265","full_name":"ZichengDuan/EZIGen","owner":"ZichengDuan","description":"[BMVC 2025] Official implementation for paper EZIGen: Enhancing zero-shot personalized image generation with precise subject encoding and decoupled guidance","archived":false,"fork":false,"pushed_at":"2025-08-21T04:25:50.000Z","size":65287,"stargazers_count":105,"open_issues_count":1,"forks_count":11,"subscribers_count":11,"default_branch":"main","last_synced_at":"2025-08-21T06:48:58.603Z","etag":null,"topics":["aigc","computer-vision","deep-learning","diffusers","diffusion-models","generative-ai"],"latest_commit_sha":null,"homepage":"https://zichengduan.github.io/pages/EZIGen/index.html","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ZichengDuan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-09-13T05:14:19.000Z","updated_at":"2025-07-30T04:21:39.000Z","dependencies_parsed_at":"2024-09-13T16:42:41.923Z","dependency_job_id":"0c42d257-306f-49c6-8c47-615a086bc588","html_url":"https://github.com/ZichengDuan/EZIGen","commit_stats":null,"previous_names":["zichengduan/ezigen"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ZichengDuan/EZIGen","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ZichengDuan%2FEZIGen","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ZichengDuan%2FEZIGen/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ZichengDuan%2FEZIGen/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ZichengDuan%2FEZIGen/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ZichengDuan","download_url":"https://codeload.github.com/ZichengDuan/EZIGen/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ZichengDuan%2FEZIGen/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":275076531,"owners_count":25401314,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-14T02:00:10.474Z","response_time":75,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aigc","computer-vision","deep-learning","diffusers","diffusion-models","generative-ai"],"created_at":"2024-09-16T13:00:31.444Z","updated_at":"2025-09-14T07:30:44.074Z","avatar_url":"https://github.com/ZichengDuan.png","language":"Python","funding_links":[],"categories":["New Concept Learning"],"sub_categories":[],"readme":"\n\n\u003cdiv align=\"center\"\u003e\n\u003ch1\u003eEZIGen: Enhancing zero-shot personalized image generation with precise subject encoding and decoupled guidance\u003c/h1\u003e\n\u003cdiv\u003e\n    \u003ca href='https://zichengduan.github.io' target='_blank'\u003eZicheng Duan\u003csup\u003e1\u003c/sup\u003e\u003c/a\u003e;\n    \u003ca href='https://scholar.google.com/citations?user=uOii3uEAAAAJ\u0026hl=zh-CN' target='_blank'\u003eYuxuan Ding\u003csup\u003e2\u003c/sup\u003e\u003c/a\u003e;\n    \u003ca href='https://scholar.google.com/citations?hl=zh-CN\u0026user=tlhShPsAAAAJ' target='_blank'\u003eChenhui Gou\u003csup\u003e3\u003c/sup\u003e\u003c/a\u003e;\n    \u003ca href='https://www.linkedin.com/in/ziqin-zhou-6408051b0/?originalSubdomain=au' target='_blank'\u003eZiqin Zhou\u003csup\u003e1\u003c/sup\u003e\u003c/a\u003e;\n    \u003ca href='https://www.ethansmith2000.com/' target='_blank'\u003eEthan Smith\u003csup\u003e4\u003c/sup\u003e\u003c/a\u003e;\n    \u003ca href='https://scholar.google.com/citations?hl=en\u0026user=Y2xu62UAAAAJ\u0026view_op=list_works\u0026sortby=pubdate' target='_blank'\u003eLingqiao Liu\u003csup\u003e1,*\u003c/sup\u003e\u003c/a\u003e\n\u003c/div\u003e\n\u003csup\u003e1\u003c/sup\u003eAIML, University of Adelaide |\n\u003csup\u003e2\u003c/sup\u003eXidian University |\n\u003csup\u003e3\u003c/sup\u003eMonash University |\n\u003csup\u003e4\u003c/sup\u003eLeonardo.AI \n\n\n[![arxiv](https://img.shields.io/badge/arxiv-EZIGen-red)](https://arxiv.org/abs/2409.08091)\n[![Demo](https://img.shields.io/badge/Project_Page-EZIGen-green)]([https://arxiv.org/abs/2409.08091](https://zichengduan.github.io/pages/EZIGen/index.html))\n[![Library](https://img.shields.io/badge/Library-Diffusers-blue)]([https://arxiv.org/abs/2409.08091](https://github.com/huggingface/diffusers))\n\n\u003c/div\u003e\n\n![dataset](misc/first5.jpg)\n# Abstract\nZero-shot personalized image generation models aim to produce images that align with both a given text prompt and subject image, requiring the model to effectively incorporate both sources of guidance. However, existing methods often struggle to capture fine-grained subject details and frequently prioritize one form of guidance over the other, resulting in suboptimal subject encoding and an imbalance in the generated images. In this study, we uncover key insights into achieving high-quality balances on subject identity preservation and text-following, notably that 1) the design of the subject image encoder critically influences subject identity preservation, and 2) the text and subject guidance should take effect at different denoising stages. Building on these insights, we introduce a new approach, EZIGen, that employs two main components: a carefully crafted subject image encoder based on the pretrained UNet of the Stable Diffusion model, following a process that balances the two guidances by separating their dominance stage and revisiting certain time steps to bootstrap subject transfer quality. Through these two components, EZIGen achieves state-of-the-art results on multiple personalized generation benchmarks with a unified model and 100 times less training data.\n\n\n# Overall Structure\n![dataset](misc/main4.jpg)\n\n# TODO List\n- [x] Demo pages\n- [x] Inference code and checkpoint\n- [x] Training code\n- [ ] SDXL support!\n- [ ] Flux support!!\n\n# Installation\nClone this repo\n```\ngit clone git@github.com:ZichengDuan/EZIGen.git\ncd EZIGen\n```\n\nPrepare Conda environment\n```\nconda create -n ezigen python=3.10 -y \u0026\u0026 conda activate ezigen\n```\n\nInstall PyTorch\n```\npip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118\n```\n\nBuild Diffusers from the source\n```\nwget https://github.com/huggingface/diffusers/archive/refs/tags/v0.30.1.zip\ncd diffusers-0.30.1\npip install . \u0026\u0026 cd .. \u0026\u0026 rm v0.30.1.zip\n```\n\nInstall remaining dependencies\n```\npip install -r requirements.txt\n```\n\n\n# Inference\nWe provide inference code for both subject-driven generation tasks and subject-driven image editing. Exemplary results can be found in the `outputs` folder.\n\n## Download pre-trained checkpoints\nDownload the checkpoint(`checkpoint-200000.zip`) from [Google Drive](https://drive.google.com/file/d/1uucz9IQFT2NbwLvazdnOX2vAROn-0nxU/view?usp=sharing), unzip it to your local folder.\n\n\nPlease first turn to `config/infer_config.yaml` to assign a correct checkpoint folder path (e.g. `checkpoint-200000/`).\n## Personalized image generation\nThe script for subject-driven generation and human content generation is provided in `infer_generation.sh`:\n```\n# infer_generation.sh\npython infer.py \\\n    --config configs/infer_config.yaml \\\n    --guidance_scale 7.5\\\n    --seed 3154 \\\n    --split_ratio 0.4 \\\n    --infer_steps 50 \\\n    --sim_threshold 0.99 \\\n    --target_prompt \"a dog in police outfit\" \\\n    --subject_prompt \"a dog\" \\\n    --subject_img_path \"example_images/subjects/dog6.png\" \\\n    --output_root \"outputs/\" \\\n    # --num_interations 6\n```\nSome explanations for the arguments:\n1. `split_ratio=0.4` means that we leave the last 40% of timesteps for Appearance Transfer, the first 60% steps for Layout Generation Process. the value ranges from 0 to 1 in which large value indicates more Appearance Transfer.\n\n2. `sim_threshold` is the CLIP similarity threshold for autostop. `subject_prompt` acts as a placeholder, however, it's always recommended to type in the correct class name of the subject image for best subject feature extraction. \n\n3. `# --num_interations 6` is by default set to -1 to give way to the autostop mechanism (with a minimum of 3 and maximum of 10 iterations), however, you can uncomment this line and assign the desired iteration number.\n\nSome subjects are presented in `example_images/subjects`.\n\n## Personalized image editing\n```\n# infer_editing.sh\npython infer.py \\\n    --config configs/infer_config.yaml \\\n    --guidance_scale 7.5\\\n    --seed 3154 \\\n    --split_ratio 0.4 \\\n    --infer_steps 50 \\\n    --sim_threshold 0.99 \\\n    --target_prompt \"a woman\" \\\n    --subject_prompt \"a woman\" \\\n    --subject_img_path \"example_images/subjects/lifeifei.png\" \\\n    --output_root \"outputs/\" \\\n    --foreground_mask_path example_images/source_images_with_masks/woman_mask.png \\\n    --source_image_path example_images/source_images_with_masks/woman.png \\\n    --do_editing\n    # --num_interations 6\n```\nSome explanations for the arguments:\n1. `source_image_path`: the path to the source RGB image for editing.\n\n2. `foreground_mask_path`: the path to a 3-channel mask with foreground as (255, 255, 255) and background as (0, 0, 0), indicating the source image area for editing, should be the same height and width as the source image.\n\nSome input examples are presented in `example_images/source_images_with_masks`.\n\n## Integration with off-the-shelf image generators\nThe user can simply take a generated image from any off-the-shelf image generator and edit it with `infer_edit.sh`, example results from FLUX are shown below:\n![dataset](misc/integration_flux.jpg)\n\n\n\n# Training\nYou can also start your own training following the instructions below: \n\n## Prepare training datasets\nDownload YoutubeVIS2019 dataset (training split) following this link: https://competitions.codalab.org/competitions/20128#participate-get_data\n\nDownload COCO2014 dataset (train/val splits) following this link: https://cocodataset.org/#download\n\nExtract the data to local folders and configure the corresponding path in `configs/train_config.yaml`\n\n## Start training\nAfter dataset preparation, you can then simply start DDP training with HuggingFace Accelerator:\n```\nsh train.sh\n```\nAlternatively, you can also run the training using plain python on a single GPU:\n```\npython train.py --config configs/train_config.yaml\n```\nThe checkpoint folders (e.g. checkpoint-5000) and tensorboard log will be automatically saved to the `output_dir`, which in turn can be used to do the inference. \n\n# Training details\nIn the default setting, with 200k samples, the training takes about 4 hours on 8 A100-40G GPUs, and 26 hours on 1 A100-40G GPU, with a batch size equal to 1 on each device. The provided checkpoint is trained on a single GPU, thus having a checkpoint postfix '200000', while if trained on multiple devices, the checkpointing postfix would be `num_samples / num_GPUs`.\n\n# Acknowledgements\nThanks [AnyDoor](https://github.com/ali-vilab/AnyDoor) for providing the YoutubeVIS dataset scripts, shout out to this great work!\n\n#  Citation\nIf you find this codebase useful for your research, please cite as follows:\n```\n@article{duan2024ezigen,\n  title={EZIGen: Enhancing zero-shot personalized image generation with precise subject encoding and decoupled guidance},\n  author={Duan, Zicheng and Ding, Yuxuan and Gou, Chenhui and Zhou, Ziqin and Smith, Ethan and Liu, Lingqiao},\n  journal={arXiv preprint arXiv:2409.08091},\n  year={2024}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FZichengDuan%2FEZIGen","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FZichengDuan%2FEZIGen","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FZichengDuan%2FEZIGen/lists"}