{"id":13487916,"url":"https://github.com/TencentARC/SmartEdit","last_synced_at":"2025-03-27T23:32:03.275Z","repository":{"id":211261162,"uuid":"727778641","full_name":"TencentARC/SmartEdit","owner":"TencentARC","description":"Official code of SmartEdit [CVPR-2024 Highlight]","archived":false,"fork":false,"pushed_at":"2024-06-21T11:29:00.000Z","size":1778,"stargazers_count":309,"open_issues_count":19,"forks_count":11,"subscribers_count":13,"default_branch":"main","last_synced_at":"2025-03-23T14:08:32.985Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/TencentARC.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-12-05T14:58:15.000Z","updated_at":"2025-03-18T06:43:28.000Z","dependencies_parsed_at":"2024-10-30T23:31:42.817Z","dependency_job_id":"d1ff309b-2f90-4bff-9d0a-6da4e926d416","html_url":"https://github.com/TencentARC/SmartEdit","commit_stats":null,"previous_names":["tencentarc/smartedit"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TencentARC%2FSmartEdit","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TencentARC%2FSmartEdit/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TencentARC%2FSmartEdit/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TencentARC%2FSmartEdit/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/TencentARC","download_url":"https://codeload.github.com/TencentARC/SmartEdit/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245944020,"owners_count":20697945,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-31T18:01:06.400Z","updated_at":"2025-03-27T23:32:03.239Z","avatar_url":"https://github.com/TencentARC.png","language":"Python","readme":"\u003c!-- ## \u003cdiv align=\"center\"\u003e\u003cb\u003ePhotoMaker\u003c/b\u003e\u003c/div\u003e --\u003e\n\u003cp align=\"center\"\u003e \u003cimg src=\"https://yuzhou914.github.io/SmartEdit/assets/Logo.jpg\" height=100\u003e \u003c/p\u003e\n\u003cdiv align=\"center\"\u003e\n  \n## SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models (CVPR-2024 Highlight)\n[[Paper](https://arxiv.org/abs/2312.06739)]\n[[Project Page](https://yuzhou914.github.io/SmartEdit/)]\n[Demo] \u003cbe\u003e\n\u003c/div\u003e\n\n🔥🔥 2024.04. SmartEdit is released!\n\n🔥🔥 2024.04. SmartEdit is selected as highlight by CVPR-2024!\n\n🔥🔥 2024.02. SmartEdit is accepted by CVPR-2024!\n\nIf you are interested in our work, please star ⭐ our project. \n\u003cbr\u003e\n\n### SmartEdit Framework\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://yuzhou914.github.io/SmartEdit/assets/2-SmartEdit.jpg\"\u003e\n\u003c/p\u003e\n\n\n### SmartEdit on Understanding Scenarios\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://yuzhou914.github.io/SmartEdit/assets/3-Understanding.jpg\"\u003e\n\u003c/p\u003e\n\n### SmartEdit on Reasoning Scenarios\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://yuzhou914.github.io/SmartEdit/assets/4-Reasoning.jpg\"\u003e\n\u003c/p\u003e\n\n\n### Dependencies and Installation\n        pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu118\n        pip install -r requirements.txt \n        git clone https://github.com/Dao-AILab/flash-attention.git\n        cd flash-attention\n        pip install . --no-build-isolation\n        cd ..\n\n### Training model preparation\n- Please put the prepared checkpoints in file `checkpoints`.\n- Prepare Vicuna-1.1-7B/13B checkpoint: please download [Vicuna-1.1-7B](https://huggingface.co/lmsys/vicuna-7b-v1.1) and [Vicuna-1.1-13B](https://huggingface.co/lmsys/vicuna-13b-v1.1) in link.\n- Prepare LLaVA-1.1-7B/13B checkpoint: please follow the [LLaVA instruction](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md) to prepare LLaVA-1.1-7B/13B weights.\n- Prepare InstructDiffusion checkpoint: please download [InstructDiffusion(v1-5-pruned-emaonly-adaption-task.ckpt)](https://github.com/cientgu/InstructDiffusion/tree/main) and the repo in link. Download them first and use `python convert_original_stable_diffusion_to_diffusers.py --checkpoint_path \"./checkpoints/InstructDiffusion/v1-5-pruned-emaonly-adaption-task.ckpt\" --original_config_file \"./checkpoints/InstructDiffusion/configs/instruct_diffusion.yaml\" --dump_path \"./checkpoints/InstructDiffusion_diffusers\"`.\n\n### Training dataset preparation\n- Please put the prepared checkpoints in file `dataset`.\n- Prepare CC12M dataset: https://storage.googleapis.com/conceptual_12m/cc12m.tsv.\n- Prepare InstructPix2Pix and MagicBrush datasets: these two datasets [InstructPix2Pix](https://huggingface.co/datasets/timbrooks/instructpix2pix-clip-filtered) and [MagicBrush](https://huggingface.co/datasets/osunlp/MagicBrush) are prepared in diffusers website. Download them first and use `python process_HF.py` to process them from \"parquet\" file to \"arrow\" file.\n- Prepare RefCOCO, GRefCOCO and COCOStuff datasets: please follow [InstructDiffusion](https://github.com/cientgu/InstructDiffusion/tree/main/dataset) to prepare them.\n- Prepare LISA ReasonSeg dataset: please follow [LISA](https://github.com/dvlab-research/LISA#dataset) to prepare it.\n- Prepare our synthetic editing dataset: please download in [link](https://drive.google.com/drive/folders/1SMkQe1U9av4YNML5wqOLN7crLiNs0aTF).\n\n### Stage-1: textual alignment with CC12M\n- Use the script to train:\n\n        bash scripts/TrainStage1_7b.sh\n        bash scripts/TrainStage1_13b.sh\n- Then, use the script to inference:\n\n        python test/TrainStage1_inference.py --model_name_or_path \"./checkpoints/vicuna-7b-v1-1\" --LLaVA_model_path \"./checkpoints/LLaVA-7B-v1\" --save_dir './checkpoints/stage1_CC12M_alignment_7b/Results-100000' --pretrain_model \"./checkpoints/stage1_CC12M_alignment_7b/embeddings_qformer/checkpoint-150000.bin\" --get_orig_out --LLaVA_version \"v1.1-7b\"\n        python test/TrainStage1_inference.py --model_name_or_path \"./checkpoints/vicuna-13b-v1-1\" --LLaVA_model_path \"./checkpoints/LLaVA-13B-v1\" --save_dir './checkpoints/stage1_CC12M_alignment_13b/Results-100000' --pretrain_model \"./checkpoints/stage1_CC12M_alignment_13b/embeddings_qformer/checkpoint-150000.bin\" --get_orig_out --LLaVA_version \"v1.1-13b\"\n\n### Stage-2: SmartEdit training\n- Use the script to train first:\n\n        bash scripts/MLLMSD_7b.sh\n        bash scripts/MLLMSD_13b.sh\n- Then, use the script to train:\n\n        bash scripts/SmartEdit_7b.sh\n        bash scripts/SmartEdit_13b.sh\n\n### Inference\n- Please download [SmartEdit-7B](https://huggingface.co/TencentARC/SmartEdit-7B) and [SmartEdit-13B](https://huggingface.co/TencentARC/SmartEdit-13B) checkpoints and put them in file `checkpoints`\n- Please download [Reason-Edit evaluation benchmark](https://drive.google.com/drive/folders/1QGmye23P3vzBBXjVj2BuE7K3n8gaWbyQ) and put it in file `dataset`\n\n- Use the script to inference on understanding and reasoning scenes:\n\n        python test/DS_SmartEdit_test.py --is_understanding_scenes True --model_name_or_path \"./checkpoints/vicuna-7b-v1-1\" --LLaVA_model_path \"./checkpoints/LLaVA-7B-v1\" --save_dir './checkpoints/SmartEdit-7B/Understand-15000' --steps 15000 --total_dir \"./checkpoints/SmartEdit-7B\" --sd_qformer_version \"v1.1-7b\" --resize_resolution 256\n        python test/DS_SmartEdit_test.py --is_reasoning_scenes True --model_name_or_path \"./checkpoints/vicuna-7b-v1-1\" --LLaVA_model_path \"./checkpoints/LLaVA-7B-v1\" --save_dir './checkpoints/SmartEdit-7B/Reason-15000' --steps 15000 --total_dir \"./checkpoints/SmartEdit-7B\" --sd_qformer_version \"v1.1-7b\" --resize_resolution 256\n        python test/DS_SmartEdit_test.py --is_understanding_scenes True --model_name_or_path \"./checkpoints/vicuna-13b-v1-1\" --LLaVA_model_path \"./checkpoints/LLaVA-13B-v1\" --save_dir './checkpoints/SmartEdit-13B/Understand-15000' --steps 15000 --total_dir \"./checkpoints/SmartEdit-13B\" --sd_qformer_version \"v1.1-13b\" --resize_resolution 256\n        python test/DS_SmartEdit_test.py --is_reasoning_scenes True --model_name_or_path \"./checkpoints/vicuna-13b-v1-1\" --LLaVA_model_path \"./checkpoints/LLaVA-13B-v1\" --save_dir './checkpoints/SmartEdit-13B/Reason-15000' --steps 15000 --total_dir \"./checkpoints/SmartEdit-13B\" --sd_qformer_version \"v1.1-13b\" --resize_resolution 256\n- You can use different resolution to inference on reasoning scenes:\n\n        python test/DS_SmartEdit_test.py --is_reasoning_scenes True --model_name_or_path \"./checkpoints/vicuna-7b-v1-1\" --LLaVA_model_path \"./checkpoints/LLaVA-7B-v1\" --save_dir './checkpoints/SmartEdit-7B/Reason-384-15000' --steps 15000 --total_dir \"./checkpoints/SmartEdit-7B\" --sd_qformer_version \"v1.1-7b\" --resize_resolution 384\n        python test/DS_SmartEdit_test.py --is_reasoning_scenes True --model_name_or_path \"./checkpoints/vicuna-13b-v1-1\" --LLaVA_model_path \"./checkpoints/LLaVA-13B-v1\" --save_dir './checkpoints/SmartEdit-13B/Reason-384-15000' --steps 15000 --total_dir \"./checkpoints/SmartEdit-13B\" --sd_qformer_version \"v1.1-13b\" --resize_resolution 384\n\n### Explanation of new tokens:\n- The original vocabulary size of LLaMA-1.1 (both 7B and 13B) is 32000, while LLaVA-1.1 (both 7B and 13B) is 32003, which additionally expands 32000=\"\u003cim_patch\u003e\", 32001=\"\u003cim_start\u003e\", 32002=\"\u003cim_end\u003e\". In SmartEdit, we maintain \"\u003cim_start\u003e\" and \"\u003cim_end\u003e\" in LLaVA and remove \"\u003cim_patch\u003e\". Besides, we add one special token called \"img\" for system message to generate image, and 32 tokens to summarize image and text information for conversation system (\"\u003cimg_0\u003e...\u003cimg_31\u003e\"). Therefore, the original vocabulary size of SmartEdit is 32035, where \"img\"=32000, \"\u003cim_start\u003e\"=32001, \"\u003cim_end\u003e\"=32002, and the 32 new tokens are 32003~32034. Only the 32 new tokens are effective embeddings for QFormer.\n- We especially explain the meanings of new embeddings here to eliminate misunderstanding, and there is no need to merge lora after you download SmartEdit checkpoints. If you have download the checkpoints of SmartEdit before 2024.4.28, please only re-download checkpoints in LLM-15000 folder. Besides, when preparing [LLaVA checkpoints](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md), you must firstly convert the LLaMA-delta-weight, since it is under policy protection, and LLaVA fine-tunes the whole LLaMA weights.\n\n### Metrics Evaluation\n- Use the script to compute metrics on Reason-Edit (256x256 resolution):\n \n        python test/metrics_evaluation.py --edited_image_understanding_dir \"./checkpoints/SmartEdit-7B/Understand-15000\" --edited_image_reasoning_dir \"./checkpoints/SmartEdit-7B/Reason-15000\"\n        python test/metrics_evaluation.py --edited_image_understanding_dir \"./checkpoints/SmartEdit-13B/Understand-15000\" --edited_image_reasoning_dir \"./checkpoints/SmartEdit-13B/Reason-15000\"\n\n### Todo List\n- [ ] Release checkpoints that could conduct \"add\" functionality (e.g., \"Add a smaller eleplant\").\n\n### Contact\nFor any question, feel free to email yuzhouhuang@link.cuhk.edu.cn and lb.xie@siat.ac.cn\n\n### Citation\t\n```\n@inproceedings{huang2024smartedit,\n  title={Smartedit: Exploring complex instruction-based image editing with multimodal large language models},\n  author={Huang, Yuzhou and Xie, Liangbin and Wang, Xintao and Yuan, Ziyang and Cun, Xiaodong and Ge, Yixiao and Zhou, Jiantao and Dong, Chao and Huang, Rui and Zhang, Ruimao and others},\n  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},\n  pages={8362--8371},\n  year={2024}\n}\n```\n","funding_links":[],"categories":["Text Guided Image Editing","Paper List"],"sub_categories":["Follow-up Papers"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FTencentARC%2FSmartEdit","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FTencentARC%2FSmartEdit","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FTencentARC%2FSmartEdit/lists"}