{"id":30643885,"url":"https://github.com/bytedance-seed/vincie","last_synced_at":"2025-10-29T04:37:31.857Z","repository":{"id":311586057,"uuid":"1010928426","full_name":"ByteDance-Seed/VINCIE","owner":"ByteDance-Seed","description":"Official code for VINCIE: Unlocking In-context Image Editing from Video","archived":false,"fork":false,"pushed_at":"2025-09-06T06:50:35.000Z","size":17190,"stargazers_count":3,"open_issues_count":0,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-09-06T08:22:39.495Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ByteDance-Seed.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-06-30T03:50:56.000Z","updated_at":"2025-09-06T06:50:39.000Z","dependencies_parsed_at":"2025-08-25T11:45:40.102Z","dependency_job_id":"587272e5-b65c-4477-89b6-984de2d13042","html_url":"https://github.com/ByteDance-Seed/VINCIE","commit_stats":null,"previous_names":["bytedance-seed/vincie"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ByteDance-Seed/VINCIE","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ByteDance-Seed%2FVINCIE","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ByteDance-Seed%2FVINCIE/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ByteDance-Seed%2FVINCIE/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ByteDance-Seed%2FVINCIE/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ByteDance-Seed","download_url":"https://codeload.github.com/ByteDance-Seed/VINCIE/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ByteDance-Seed%2FVINCIE/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279015720,"owners_count":26085748,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-13T02:00:06.723Z","response_time":61,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-08-31T03:09:01.566Z","updated_at":"2025-10-13T14:20:43.438Z","avatar_url":"https://github.com/ByteDance-Seed.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# VINCIE: Unlocking In-context Image Editing from Video\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://vincie2025.github.io/\"\u003e\n    \u003cimg\n      src=\"https://img.shields.io/badge/VINCIE-Website-0A66C2?logo=safari\u0026logoColor=white\"\n      alt=\"VINCIE Website\"\n    /\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://arxiv.org/abs/2506.10941\"\u003e\n    \u003cimg\n      src=\"https://img.shields.io/badge/VINCIE-Paper-red?logo=arxiv\u0026logoColor=red\"\n      alt=\"VINCIE Paper on ArXiv\"\n    /\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://github.com/ByteDance-Seed/VINCIE\"\u003e\n            \u003cimg \n              alt=\"Github\" src=\"https://img.shields.io/badge/VINCIE-Codebase-536af5?color=536af5\u0026logo=github\"\n              alt=\"VINCIE Codebase\"\n            /\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://huggingface.co/collections/ByteDance-Seed/vincie-6864cc2e3116d82e4a83a17c\"\u003e\n    \u003cimg \n        src=\"https://img.shields.io/badge/VINCIE-Models-yellow?logo=huggingface\u0026logoColor=yellow\" \n        alt=\"VINCIE Models\"\n    /\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://huggingface.co/datasets/leigangqu/VINCIE-10M\"\u003e\n    \u003cimg \n        src=\"https://img.shields.io/badge/VINCIE-Dataset-yellow?logo=huggingface\u0026logoColor=yellow\" \n        alt=\"VINCIE-10M Dataset\"\n    /\u003e\n  \u003c/a\u003e\n   \u003c!-- \u003ca href=\"https://huggingface.co/spaces/ByteDance-Seed/VINCIE-3B\"\u003e\n    \u003cimg \n        src=\"https://img.shields.io/badge/VINCIE-Space-orange?logo=huggingface\u0026logoColor=yellow\" \n        alt=\"VINCIE Space\"\n    /\u003e\n  \u003c/a\u003e --\u003e\n\u003c/p\u003e\n\n\u003e [Leigang Qu](https://leigang-qu.github.io/), [Feng Cheng](https://klauscc.github.io/), [Ziyan Yang](https://ziyanyang.github.io/), [Qi Zhao](https://kevinz8866.github.io/), [Shanchuan Lin](https://scholar.google.com/citations?user=EDWUw7gAAAAJ\u0026hl=en), [Yichun Shi](https://seasonsh.github.io/), [Yicong Li](https://yl3800.github.io/), [Wenjie Wang](https://wenjiewwj.github.io/), [Tat-Seng Chua](https://www.chuatatseng.com/), [Lu Jiang](http://www.lujiang.info/index.html)\n\u003e \n\u003e In-context image editing aims to modify images based on a contextual sequence comprising text and previously generated images. Existing methods typically depend on task-specific pipelines and expert models (*e.g.*, segmentation and inpainting) to curate training data. In this work, we explore whether an in-context image editing model can be learned directly from videos. We introduce a scalable approach to annotate videos as interleaved multimodal sequences. To effectively learn from this data, we design a block-causal diffusion transformer trained on three proxy tasks: next-image prediction, current segmentation prediction, and next-segmentation prediction. Additionally, we propose a novel multi-turn image editing benchmark to advance research in this area. Extensive experiments demonstrate that our model exhibits strong in-context image editing capabilities and achieves state-of-the-art results on two multi-turn image editing benchmarks. Despite being trained exclusively on videos, our model also shows promising abilities in multi-concept composition, story generation, and chain-of-editing applications.\n\n\u003cp align=\"center\"\u003e\u003cimg src=\"assets/teaser.jpeg\" width=\"95%\"\u003e\u003c/p\u003e\n\n\n## News\n- **6 Sep, 2025:** Released the [VINCIE-3B checkpoint](https://huggingface.co/ByteDance-Seed/VINCIE-3B) (full attention).\n- **25 Aug, 2025:** Released the official [website](https://vincie2025.github.io/) and the inference code.\n- **23 Aug, 2025:** Released the [VINCIE-10M dataset](https://huggingface.co/datasets/leigangqu/VINCIE-10M). \n- **12 Jun, 2025:** Released the [VINCIE technical report](https://arxiv.org/abs/2506.10941) . \n\n\n## Quick Start\n\n1️⃣  Set up environment\n```bash\ngit clone https://github.com/ByteDance-Seed/VINCIE\ncd VINCIE\nconda create -n vincie python=3.10 -y\nconda activate vincie\npip install -r requirements.txt\npip install flash_attn==2.6.3 --no-build-isolation\n```\n\n2️⃣  Download pretrained checkpoint\n```python\nfrom huggingface_hub import snapshot_download\n\nsave_dir = \"ckpt/VINCIE-3B\"\nrepo_id = \"ByteDance-Seed/VINCIE-3B\"\ncache_dir = save_dir + \"/cache\"\n\nsnapshot_download(cache_dir=cache_dir,\n  local_dir=save_dir,\n  repo_id=repo_id,\n  local_dir_use_symlinks=False,\n  resume_download=True\n)\n\n```\n\n\n## Inference for Multi-turn Image Editing\n```bash\nturn1=\"Lower the pineapple beside her face, and change it to a smaller one.\"\nturn2=\"Add a crown to the woman's head. \"\nturn3=\"Change the woman’s expression so that she is laughing.\"\nturn4=\"Change the background to a pastel gradient of blue and lavender.\"\nturn5=\"Add a colorful bird hovering above the crown.\"\ninput_img=assets/woman_pineapple.png\noutput_dir=output/woman_pineapple\n\npython main.py configs/generate.yaml \\\n    generation.positive_prompt.image_path=\"[\\\"$input_img\\\"]\" \\\n    generation.positive_prompt.prompts=\"[\\\"$turn1\\\", \\\"$turn2\\\", \\\"$turn3\\\", \\\"$turn4\\\", \\\"$turn5\\\"]\" \\\n    generation.output.dir=$output_dir\n```\n\n## Inference for Multi-concept Composition\n```bash\np1=\"\u003cIMG1\u003e: \"; p2=\"\u003cIMG2\u003e: \"; p3=\"\u003cIMG3\u003e: \"; p4=\"\u003cIMG4\u003e: \"; p5=\"\u003cIMG5\u003e: \"\np6=\"Based on \u003cIMG0\u003e, \u003cIMG1\u003e, \u003cIMG2\u003e, \u003cIMG3\u003e, \u003cIMG4\u003e, and \u003cIMG5\u003e, A smiling multi-generational family including the father in \u003cIMG0\u003e, mother in \u003cIMG1\u003e, son in \u003cIMG2\u003e, daughter in \u003cIMG3\u003e, dog in \u003cIMG4\u003e, and cat in \u003cIMG5\u003e,  poses for a portrait amidst the sunlit trees and ferns of a forest. Output \u003cIMG6\u003e: \"\nimg0=\"./assets/father.png\"; img1=\"./assets/mother.png\"; img2=\"./assets/son.png\"; img3=\"./assets/daughter.png\"; img4=\"./assets/dog1.png\"; img5=\"./assets/cat.png\"; \noutput_dir=output/family\n\npython main.py configs/generate.yaml \\\n    generation.pad_img_placehoder=False \\\n    generation.positive_prompt.image_path=\"[\\\"$img0\\\", \\\"$img1\\\", \\\"$img2\\\", \\\"$img3\\\", \\\"$img4\\\", \\\"$img5\\\"]\" \\\n    generation.positive_prompt.prompts=\"[\\\"$p1\\\", \\\"$p2\\\", \\\"$p3\\\", \\\"$p4\\\", \\\"$p5\\\", \\\"$p6\\\"]\" \\\n    generation.output.dir=$output_dir\n```\n\n\n## Citation\n\n```bibtex\n@article{qu2025vincie,\n  title   = {VINCIE: Unlocking In-context Image Editing from Video},\n  author  = {Qu, Leigang and Cheng, Feng and Yang, Ziyan and Zhao, Qi and Lin, Shanchuan and Shi, Yichun and Li, Yicong and Wang, Wenjie and Chua, Tat-Seng and Jiang, Lu},\n  journal = {arXiv preprint arXiv:2506.10941},\n  year    = {2025}\n}\n```\n\n## License\nThis project is licensed under the [Apache-2.0 License](LICENSE), subject to any intellectual property rights in the model owned by ByteDance. The text encoder of the model is adapted from [Qwen-14B](https://huggingface.co/Qwen/Qwen-14B) and your use of that model must comply with its license. ","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbytedance-seed%2Fvincie","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbytedance-seed%2Fvincie","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbytedance-seed%2Fvincie/lists"}