{"id":13861707,"url":"https://github.com/apple/ml-mgie","last_synced_at":"2025-04-13T22:27:55.600Z","repository":{"id":220728234,"uuid":"732203523","full_name":"apple/ml-mgie","owner":"apple","description":null,"archived":false,"fork":false,"pushed_at":"2024-03-15T21:35:14.000Z","size":6252,"stargazers_count":3853,"open_issues_count":5,"forks_count":252,"subscribers_count":62,"default_branch":"main","last_synced_at":"2024-10-29T15:38:22.592Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/apple.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.txt","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-12-15T23:04:40.000Z","updated_at":"2024-10-25T03:01:16.000Z","dependencies_parsed_at":"2024-08-05T06:13:52.252Z","dependency_job_id":null,"html_url":"https://github.com/apple/ml-mgie","commit_stats":null,"previous_names":["apple/ml-mgie"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apple%2Fml-mgie","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apple%2Fml-mgie/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apple%2Fml-mgie/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apple%2Fml-mgie/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/apple","download_url":"https://codeload.github.com/apple/ml-mgie/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248790350,"owners_count":21161996,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-05T06:01:28.515Z","updated_at":"2025-04-13T22:27:55.552Z","avatar_url":"https://github.com/apple.png","language":"Python","funding_links":[],"categories":["Python","HarmonyOS","Repos","排行榜 [2025-03-18]"],"sub_categories":["Windows Manager"],"readme":"# Guiding Instruction-based Image Editing via Multimodal Large Language Models\nThis repo contains the code for [Guiding Instruction-based Image Editing via Multimodal Large Language Models](https://arxiv.org/abs/2309.17102) (ICLR'24 Spotlight)\n\n## Overview\nMGIE is an implementation of \u003cbr\u003e\n\"[Guiding Instruction-based Image Editing via Multimodal Large Language Models](https://arxiv.org/abs/2309.17102)\" \u003cbr\u003e\n[Tsu-Jui Fu](https://scholar.google.com/citations?user=7QRDcC0AAAAJ), [Wenze Hu](https://scholar.google.com/citations?user=0YPYs5UAAAAJ), [Xianzhi Du](https://scholar.google.com/citations?user=l1hP40AAAAAJ), [William Yang Wang](https://scholar.google.com/citations?user=gf8Ms_8AAAAJ), [Yinfei Yang](https://scholar.google.com/citations?user=kvDbu90AAAAJ), and [Zhe Gan](https://scholar.google.com/citations?user=E64XWyMAAAAJ) \u003cbr\u003e\nin International Conference on Learning Representations (**ICLR**) 2024\n\n\u003cimg src='./mgie.png' width='70%' /\u003e\n\nInstruction-based image editing improves the controllability and flexibility of image manipulation via natural commands without elaborate descriptions or regional masks. However, human instructions are sometimes too brief for current methods to capture and follow. Multimodal large language models (MLLMs) show promising capabilities in cross-modal understanding and visual-aware response generation via LMs. We investigate **how MLLMs facilitate edit instructions** and present MLLM-Guided Image Editing (MGIE). MGIE learns to **derive expressive instructions** and provides explicit guidance. The editing model **jointly captures this visual imagination and performs manipulation** through end-to-end training.\n\n## Requirements\n```\nconda create -n mgie python=3.10 -y\nconda activate mgie\nconda update -n base -c defaults conda setuptools -y\nconda install -c conda-forge git git-lfs ffmpeg vim htop ninja gpustat -y\nconda clean -a -y\n\npip install -U pip cmake cython==0.29.36 pydantic==1.10 numpy\npip install -U gdown pydrive2 wget jupyter jupyterlab jupyterthemes ipython\npip install -U sentencepiece transformers diffusers tokenizers datasets gradio==3.37 accelerate evaluate git+https://github.com/openai/CLIP.git\npip install -U https://download.pytorch.org/whl/cu113/torch-1.12.0%2Bcu113-cp310-cp310-linux_x86_64.whl https://download.pytorch.org/whl/cu113/torchvision-0.13.0%2Bcu113-cp310-cp310-linux_x86_64.whl https://download.pytorch.org/whl/cu113/torchaudio-0.12.0%2Bcu113-cp310-cp310-linux_x86_64.whl\npip install -U deepspeed\n\n# git clone this repo\ncd ml-mgie\ngit submodule update --init --recursive\ncd LLaVA\npip install -e .\npip install -U https://download.pytorch.org/whl/cu113/torch-1.12.0%2Bcu113-cp310-cp310-linux_x86_64.whl https://download.pytorch.org/whl/cu113/torchvision-0.13.0%2Bcu113-cp310-cp310-linux_x86_64.whl https://download.pytorch.org/whl/cu113/torchaudio-0.12.0%2Bcu113-cp310-cp310-linux_x86_64.whl\npip install -U ninja flash-attn==1.0.2\npip install -U pydrive2 gdown wget\n\ncd ..\ncp mgie_llava.py LLaVA/llava/model/llava.py\ncp mgie_train.py LLaVA/llava/train/train.py\n```\n\n## Quick Start\nPut official [LLaVA-7B](https://huggingface.co/liuhaotian/LLaVA-Lightning-7B-delta-v1-1) in [_ckpt/LLaVA-7B-v1](_ckpt) and download pre-trained [ckpt](https://docs-assets.developer.apple.com/ml-research/models/mgie/mgie_7b.tar.gz) (on IPr2Pr + MagicBrush) in [_ckpt/mgie_7b](_ckpt)\n```\ndemo.ipynb\n```\n\u003cimg src='./demo.png' width='50%' /\u003e\n\nNotices: Apple's rights in the attached weight differentials are hereby licensed under the CC-BY-NC license. Apple makes no representations with regards to LLaMa or any other third party software, which are subject to their own terms.\n\n## Usage\n### Data\nDownload CLIP-filtered [IPr2Pr](https://github.com/timothybrooks/instruct-pix2pix) and process (including summarized expressive instruction) in [_data](_data)\n```\nprocess_data.ipynb\n```\nThere are [examples](_data) to help prepare the data\n\n### Train\nPut [Vicuna-7B](https://huggingface.co/lmsys/vicuna-7b-delta-v1.1) and [LLaVA-7B](https://huggingface.co/liuhaotian/LLaVA-Lightning-7B-delta-v1-1) in [_ckpt/vicuna-7b-v1.1](_ckpt) and [_ckpt/LLaVA-7B-v1](_ckpt)\n```\nWANDB_DISABLED='true' torchrun --nnodes=1 --nproc_per_node=8 --master_port=7122 LLaVA/llava/train/train_mem.py --model_name_or_path ./_ckpt/vicuna-7b-v1.1 --version v1 --vision_tower openai/clip-vit-large-patch14 --mm_vision_select_layer -2 --mm_use_im_start_end True --bf16 True --output_dir _snapshot/mgie --num_train_epochs 40 --per_device_train_batch_size 4 --per_device_eval_batch_size 2 --dataloader_num_workers 2 --gradient_accumulation_steps 1 --evaluation_strategy 'no' --save_strategy 'steps' --save_steps 2000 --save_total_limit 10 --learning_rate 5e-4 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type 'cosine' --logging_steps 1 --tf32 True --model_max_length 512 --gradient_checkpointing True --lazy_preprocess True\n```\n\n### Inference\nExtract trained ckpt in [_ckpt/mgie_7b](_ckpt)\n```\nextract_ckpt.ipynb\n```\nRun our demo\n```\ndemo.ipynb\n```\n\n## Citation\n```\n@inproceedings{fu2024mgie,\n  author = {Tsu-Jui Fu and Wenze Hu and Xianzhi Du and William Yang Wang and Yinfei Yang, and Zhe Gan}, \n  title = {{Guiding Instruction-based Image Editing via Multimodal Large Language Models}}, \n  booktitle = {International Conference on Learning Representations (ICLR)}, \n  year = {2024} \n}\n```\n\n## Acknowledgement\n+ [LLaVA](https://github.com/haotian-liu/LLaVA/tree/7ace501183c4bdec6052ec1a30039cdc3242a67c): the codebase we built upon\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fapple%2Fml-mgie","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fapple%2Fml-mgie","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fapple%2Fml-mgie/lists"}