{"id":19845959,"url":"https://github.com/vchitect/videobooth","last_synced_at":"2025-04-07T15:09:36.647Z","repository":{"id":214718125,"uuid":"725555872","full_name":"Vchitect/VideoBooth","owner":"Vchitect","description":"[CVPR2024] VideoBooth: Diffusion-based Video Generation with Image Prompts","archived":false,"fork":false,"pushed_at":"2024-06-09T14:51:10.000Z","size":13660,"stargazers_count":292,"open_issues_count":6,"forks_count":11,"subscribers_count":20,"default_branch":"main","last_synced_at":"2025-03-31T12:09:05.707Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Vchitect.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-11-30T11:49:01.000Z","updated_at":"2025-03-13T08:58:16.000Z","dependencies_parsed_at":null,"dependency_job_id":"c8e4b749-3b30-4a6f-aa27-4532adeb2dbd","html_url":"https://github.com/Vchitect/VideoBooth","commit_stats":null,"previous_names":["vchitect/videobooth"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Vchitect%2FVideoBooth","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Vchitect%2FVideoBooth/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Vchitect%2FVideoBooth/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Vchitect%2FVideoBooth/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Vchitect","download_url":"https://codeload.github.com/Vchitect/VideoBooth/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247675606,"owners_count":20977377,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-12T13:09:49.947Z","updated_at":"2025-04-07T15:09:36.627Z","avatar_url":"https://github.com/Vchitect.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# VideoBooth\n\n\u003c!-- [![arXiv](https://img.shields.io/badge/arXiv-2311.99999-b31b1b.svg)](https://arxiv.org/abs/2311.99999) --\u003e\n[![Paper](https://img.shields.io/badge/cs.CV-Paper-b31b1b?logo=arxiv\u0026logoColor=red)](xxxx)\n[![Project Page](https://img.shields.io/badge/VideoBooth-Website-green?logo=googlechrome\u0026logoColor=green)](https://vchitect.github.io/VideoBooth-project/)\n[![Video](https://img.shields.io/badge/YouTube-Video-c4302b?logo=youtube\u0026logoColor=red)](https://youtu.be/10DxH1JETzI)\n[![Visitor](https://hits.seeyoufarm.com/api/count/incr/badge.svg?url=https%3A%2F%2Fgithub.com%2FVchitect%2FVideoBooth\u0026count_bg=%23FFA500\u0026title_bg=%23555555\u0026icon=\u0026icon_color=%23E7E7E7\u0026title=hits\u0026edge_flat=false)](https://hits.seeyoufarm.com)\n\n\nThis repository will contain the implementation of the following paper:\n\u003e **VideoBooth: Diffusion-based Video Generation with Image Prompts**\u003cbr\u003e\n\u003e [Yuming Jiang](https://yumingj.github.io/), [Tianxing Wu](https://tianxingwu.github.io/), [Shuai Yang](https://williamyang1991.github.io/), [Chenyang Si](https://chenyangsi.top/), [Dahua Lin](http://dahua.site/), [Yu Qiao](https://scholar.google.com.sg/citations?user=gFtI-8QAAAAJ\u0026hl=en), [Chen Change Loy](https://www.mmlab-ntu.com/person/ccloy/), [Ziwei Liu](https://liuziwei7.github.io/)\u003cbr\u003e\n\nFrom [MMLab@NTU](https://www.mmlab-ntu.com/) affliated with S-Lab, Nanyang Technological University and Shanghai AI Laboratory.\n\n## Overview\nOur VideoBooth generates videos with the subjects specified in the image prompts.\n![overall_structure](./assets/teaser.png)\n\n\n## Installation\n\n1. Clone the repository.\n\n```shell\ngit clone https://github.com/Vchitect/VideoBooth.git\ncd VideoBooth\n```\n\n2. Install the environment.\n\n```shell\nconda env create -f environment.yml\nconda activate videobooth\n```\n\n3. Download pretrained models ([Stable Diffusion v1.4](https://huggingface.co/CompVis/stable-diffusion-v1-4), [VideoBooth](https://huggingface.co/yumingj/VideoBooth_models/tree/main)), and put them under the folder `./pretrained_models/`.\n\n\n## Inference\n\nHere, we provide one example to perform the inference.\n\n``` shell\npython sample_scripts/sample.py --config sample_scripts/configs/panda.yaml\n```\n\nIf you want to use your own image, you need to segment the object first. We use [Grounded-SAM](https://github.com/IDEA-Research/Grounded-Segment-Anything) to segment the subject from images.\n\n## Training\n\nVideoBooth is training in a coarse-to-fine manner.\n\n# Stage 1: Coarse Stage Training\n\n``` shell\nsrun --mpi=pmi2 torchrun --nnodes=1 --nproc_per_node=8 --master_port=29125 train_stage1.py \\\n--model TAVU \\\n--num-frames 16 \\\n--dataset WebVideoImageStage1  \\\n--frame-interval 4 \\\n--ckpt-every 1000 \\\n--clip-max-norm 0.1 \\\n--global-batch-size 16 \\\n--reg-text-weight 0 \\\n--results-dir ./results \\\n--pretrained-t2v-model path-to-t2v-model \\\n--global-mapper-path path-to-elite-global-model\n```\n\n# Stage 2: Fine Stage Training\n\n``` shell\nsrun --mpi=pmi2 torchrun --nnodes=1 --nproc_per_node=8 --master_port=29125 train_stage2.py \\\n--model TAVU \\\n--num-frames 16 \\\n--dataset WebVideoImageStage2  \\\n--frame-interval 4 \\\n--ckpt-every 1000 \\\n--clip-max-norm 0.1 \\\n--global-batch-size 16 \\\n--reg-text-weight 0 \\\n--results-dir ./results \\\n--pretrained-t2v-model path-to-t2v-model \\\n--global-mapper-path path-to-stage1-model\n```\n\n## Dataset Preparation\n\nYou can download our proposed dataset in [HuggingFace](https://huggingface.co/datasets/yumingj/VideoBoothDataset).\n\n```shell\n# merge the splited zip files\nzip -F webvid_parsing_2M_split.zip --out single-archive.zip\n\n# replace the path-to-webvid-parsing to this path\nunzip single-archive.zip\n\n# replace the path-to-videobooth-subset to this path\nunzip webvid_parsing_videobooth_subset.zip\n```\n\n\n## Citation\n\nIf you find our repo useful for your research, please consider citing our paper:\n\n```bibtex\n@article{jiang2023videobooth,\n    author = {Jiang, Yuming and Wu, Tianxing and Yang, Shuai and Si, Chenyang and Lin, Dahua and Qiao, Yu and Loy, Chen Change and Liu, Ziwei},\n    title = {VideoBooth: Diffusion-based Video Generation with Image Prompts},\n    year = {2023}\n}\n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvchitect%2Fvideobooth","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvchitect%2Fvideobooth","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvchitect%2Fvideobooth/lists"}