{"id":19845970,"url":"https://github.com/vchitect/vlogger","last_synced_at":"2025-04-05T18:11:02.187Z","repository":{"id":217782143,"uuid":"743870556","full_name":"Vchitect/Vlogger","owner":"Vchitect","description":"[CVPR2024] Make Your Dream A Vlog","archived":false,"fork":false,"pushed_at":"2024-03-19T12:19:56.000Z","size":49676,"stargazers_count":422,"open_issues_count":16,"forks_count":46,"subscribers_count":10,"default_branch":"main","last_synced_at":"2025-03-29T17:11:11.191Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Vchitect.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2024-01-16T06:57:06.000Z","updated_at":"2025-03-12T01:38:55.000Z","dependencies_parsed_at":"2024-02-21T12:54:10.273Z","dependency_job_id":null,"html_url":"https://github.com/Vchitect/Vlogger","commit_stats":null,"previous_names":["zhuangshaobin/vlogger","vchitect/vlogger"],"tags_count":1,"template":true,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Vchitect%2FVlogger","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Vchitect%2FVlogger/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Vchitect%2FVlogger/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Vchitect%2FVlogger/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Vchitect","download_url":"https://codeload.github.com/Vchitect/Vlogger/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247378149,"owners_count":20929297,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-12T13:09:55.825Z","updated_at":"2025-04-05T18:11:02.163Z","avatar_url":"https://github.com/Vchitect.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n\u003ca href=\"https://arxiv.org/abs/2401.09414\"\u003e\n\u003cimg width=\"743\" alt=\"image\" src=\"https://github.com/zhuangshaobin/Vlogger/assets/24236723/2885982e-5b18-48b3-97b1-966298329350\"\u003e\n\u003c/a\u003e\n\n[Shaobin Zhuang](https://github.com/zhuangshaobin), [Kunchang Li](https://scholar.google.com/citations?user=D4tLSbsAAAAJ), [Xinyuan Chen†](https://scholar.google.com/citations?user=3fWSC8YAAAAJ), [Yaohui Wang†](https://scholar.google.com/citations?user=R7LyAb4AAAAJ), [Ziwei Liu](https://scholar.google.com/citations?user=lc45xlcAAAAJ), [Yu Qiao](https://scholar.google.com/citations?user=gFtI-8QAAAAJ\u0026hl), [Yali Wang†](https://scholar.google.com/citations?user=hD948dkAAAAJ)\n\n[![arXiv](https://img.shields.io/badge/arXiv-2401.09414-b31b1b.svg)](https://arxiv.org/abs/2401.09414)\n[![Project Page](https://img.shields.io/badge/Vlogger-Website-green)](https://zhuangshaobin.github.io/Vlogger.github.io/)\n[![Hugging Face Model](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-yellow)](https://huggingface.co/GrayShine/Vlogger)\n[![Hugging Face Space](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-yellow)](https://huggingface.co/spaces/GrayShine/Vlogger-ShowMaker)\n[![YouTube Video](https://img.shields.io/badge/YouTube-Video-red)](https://youtu.be/ZRD1-jHbEGk)\n[![Hits](https://hits.seeyoufarm.com/api/count/incr/badge.svg?url=https%3A%2F%2Fgithub.com%2Fzhuangshaobin%2FVlogger\u0026count_bg=%23F59352\u0026title_bg=%23555555\u0026icon=\u0026icon_color=%23E7E7E7\u0026title=visitors\u0026edge_flat=false)](https://hits.seeyoufarm.com)\n\u003c/div\u003e\n\u003c/div\u003e\n\n\nIn this work, we present **Vlogger**, a generic AI system for generating a **minute**-level video blog (i.e., vlog) of user descriptions. Different from short videos with a few seconds, vlog often contains a complex storyline with diversified scenes, which is challenging for most existing video generation approaches. To break through this bottleneck, our Vlogger smartly leverages Large Language Model (LLM) as Director and decomposes a long video generation task of vlog into four key stages, where we invoke various foundation models to play the critical roles of vlog professionals, including (1) Script, (2) Actor, (3) ShowMaker, and (4) Voicer. With such a design of mimicking human beings, our Vlogger can generate vlogs through explainable cooperation of top-down planning and bottom-up shooting. Moreover, we introduce a novel video diffusion model, **ShowMaker**, which serves as a videographer in our Vlogger for generating the video snippet of each shooting scene. By incorporating Script and Actor attentively as textual and visual prompts, it can effectively enhance spatial-temporal coherence in the snippet. Besides, we design a concise mixed training paradigm for ShowMaker, boosting its capacity for both T2V generation and prediction. Finally, the extensive experiments show that our method achieves state-of-the-art performance on zero-shot T2V generation and prediction tasks. More importantly, Vlogger can generate over 5-minute vlogs from open-world descriptions, without loss of video coherence on script and actor.\n\n\n\u003cdiv align=\"center\"\u003e\n\u003cvideo src=\"https://github.com/zhuangshaobin/Vlogger/assets/94739615/1e8dd246-d3b9-49e9-8eee-d40b6d8523b9\" controls=\"controls\" width=\"500\" height=\"300\"\u003e\u003c/video\u003e\n\u003cb\u003eA compressed version of generated \u003ca href=\"https://youtu.be/ZRD1-jHbEGk\"\u003eTeddy Travel\u003c/a\u003e.\u003c/b\u003e\n\u003c/div\u003e\n\n## Usage\n\n\u003cdetails\u003e\n  \u003csummary\u003e\u003ch3\u003eSetup\u003c/h3\u003e\u003c/summary\u003e\n\n\u003ch4\u003ePrepare Environment\u003c/h4\u003e\n\n```bash\nconda create -n vlogger python==3.10.11\nconda activate vlogger\npip install -r requirements.txt\n```\n\n\u003ch4\u003eDownload our model and T2I base model\u003c/h4\u003e\n\nOur model is based on Stable diffusion v1.4, you may download [Stable Diffusion v1-4](https://huggingface.co/CompVis/stable-diffusion-v1-4) and [OpenCLIP-ViT-H-14](https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K) to the director of ``` pretrained ```\n.\nDownload our model(ShowMaker) checkpoint (from [google drive](https://drive.google.com/file/d/1pAH73kz2QRfD2Dxk4lL3SrHvLAlWcPI3/view?usp=drive_link) or [hugging face](https://huggingface.co/GrayShine/Vlogger/tree/main)) and save to the directory of ```pretrained```\n\n\nNow under `./pretrained`, you should be able to see the following:\n```\n├── pretrained\n│   ├── ShowMaker.pt\n│   ├── stable-diffusion-v1-4\n│   ├── OpenCLIP-ViT-H-14\n│   │   ├── ...\n└── └── ├── ...\n        ├── ...\n```\n\n\u003c/details\u003e\n\n\n\u003cdetails\u003e\n  \u003csummary\u003e\u003ch3\u003eInference for LLM planning and make reference image\u003c/h3\u003e\u003c/summary\u003e\n        \nRun the following command to get script, actors and protagonist:\n\n```python\npython sample_scripts/vlog_write_script.py\n```\n\n- The generated scripts will be saved in ```results/vlog/$your_story_dir/script```.\n\n- The generated reference images will be saved in ```results/vlog/$your_story_dir/img```.\n\n- :warning: Enter your openai key in the 7th line of the file ```vlogger/planning_utils/gpt4_utils.py```\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e\u003ch3\u003eInference for vlog generation\u003c/h3\u003e\u003c/summary\u003e\n        \nRun the following command to get the vlog:\n\n```python\npython sample_scripts/vlog_read_script_sample.py\n```\n\n- The generated scripts will be saved in ```results/vlog/$your_story_dir/video```.\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e\u003ch3\u003eInference for (T+I)2V \u003c/h3\u003e\u003c/summary\u003e\n        \nRun the following command to get the (T+I)2V results:\n\n```python\npython sample_scripts/with_mask_sample.py\n```\n\n- The generated video will be saved in ```results/mask_no_ref```.\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e\u003ch3\u003eInference for (T+I+Ref)2V\u003c/h3\u003e\u003c/summary\u003e\n        \nRun the following command to get the (T+I+Ref)2V results:\n\n```python\npython sample_scripts/with_mask_ref_sample.py\n```\n\n- The generated video will be saved in ```results/mask_ref```.\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e\u003ch3\u003eMore Details\u003c/h3\u003e\u003c/summary\u003e\n        \nYou may modify ```configs/with_mask_sample.yaml``` to change the (T+I)2V conditions and modify ```configs/with_mask_ref_sample.yaml``` to change the (T+I+Ref)2V conditions.\nFor example:\n\n- ```ckpt``` is used to specify a model checkpoint.\n\n- ```text_prompt``` is used to describe the content of the video.\n\n- ```input_path``` is used to specify the path to the image.\n\n- ```ref_path``` is used to specify the path to the reference image.\n\n- ```save_path``` is used to specify the path to the generated video.\n\u003c/details\u003e\n\n\n\n## Results\n### (T+Ref)2V Results\n\u003ctable class=\"center\"\u003e\n\u003ctr\u003e\n  \u003ctd style=\"text-align:center;width: 50%\" colspan=\"1\"\u003e\u003cb\u003eReference Image\u003c/b\u003e\u003c/td\u003e\n  \u003ctd style=\"text-align:center;width: 50%\" colspan=\"1\"\u003e\u003cb\u003eOutput Video\u003c/b\u003e\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n  \u003ctd\u003e\u003cimg src=\"examples/TR2V/image/Egyptian_Pyramids.png\" width=\"250\"\u003e\n      \u003cbr\u003e\n\u003c!--       \u003cdiv class=\"text\" style=\" text-align:center;\"\u003e\n        Scene Reference\n      \u003c/div\u003e --\u003e\n      \u003cp align=\"center\"\u003eScene Reference\u003c/p\u003e\n  \u003c/td\u003e\n  \u003ctd\u003e\n      \u003cimg src=\"examples/TR2V/video/Fireworks_explode_over_the_pyramids.gif\" width=\"400\"\u003e\n      \u003cbr\u003e\n\u003c!--       \u003cdiv class=\"text\" style=\" text-align:center;\"\u003e\n        Fireworks explode over the pyramids.\n      \u003c/div\u003e --\u003e\n          \u003cp align=\"center\"\u003eFireworks explode over the pyramids.\u003c/p\u003e\n  \u003c/td\u003e\n\u003c/tr\u003e\n\n\u003ctr\u003e\n  \u003ctd\u003e\u003cimg src=\"examples/TR2V/image/Great_Wall.png\" width=\"250\"\u003e\n      \u003cbr\u003e\n\u003c!--       \u003cdiv class=\"text\" style=\" text-align:center;\"\u003e\n        Scene Reference\n      \u003c/div\u003e --\u003e\n      \u003cp align=\"center\"\u003eScene Reference\u003c/p\u003e\n  \u003c/td\u003e\n  \u003ctd\u003e\n      \u003cimg src=\"examples/TR2V/video/The_Great_Wall_burning_with_raging_fire.gif\" width=\"400\"\u003e\n      \u003cbr\u003e\n\u003c!--       \u003cdiv class=\"text\" style=\" text-align:center;\"\u003e\n        The Great Wall burning with raging fire.\n      \u003c/div\u003e --\u003e\n          \u003cp align=\"center\"\u003eThe Great Wall burning with raging fire.\u003c/p\u003e\n  \u003c/td\u003e\n\u003c/tr\u003e\n\n\u003ctr\u003e\n  \u003ctd\u003e\u003cimg src=\"examples/TR2V/image/a_green_cat.png\" width=\"250\"\u003e\n      \u003cbr\u003e\n\u003c!--       \u003cdiv class=\"text\" style=\" text-align:center;\"\u003e\n        Object Reference\n      \u003c/div\u003e --\u003e\n      \u003cp align=\"center\"\u003eObject Reference\u003c/p\u003e\n  \u003c/td\u003e\n  \u003ctd\u003e\n      \u003cimg src=\"examples/TR2V/video/A_cat_is_running_on_the_beach.gif\" width=\"400\"\u003e\n      \u003cbr\u003e\n\u003c!--       \u003cdiv class=\"text\" style=\" text-align:center;\"\u003e\n        A cat is running on the beach.\n      \u003c/div\u003e --\u003e\n          \u003cp align=\"center\"\u003eA cat is running on the beach.\u003c/p\u003e\n  \u003c/td\u003e\n\u003c/tr\u003e\n\n\u003c/table\u003e\n\n### (T+I)2V Results\n\u003ctable class=\"center\"\u003e\n\u003ctr\u003e\n  \u003ctd style=\"text-align:center;width: 50%\" colspan=\"1\"\u003e\u003cb\u003eInput Image\u003c/b\u003e\u003c/td\u003e\n  \u003ctd style=\"text-align:center;width: 50%\" colspan=\"1\"\u003e\u003cb\u003eOutput Video\u003c/b\u003e\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n  \u003ctd\u003e\u003cimg src=\"input/i2v/Underwater_environment_cosmetic_bottles.png\" width=\"400\"\u003e\u003c/td\u003e\n  \u003ctd\u003e\n      \u003cimg src=\"examples/TI2V/Underwater_environment_cosmetic_bottles.gif\" width=\"400\"\u003e\n      \u003cbr\u003e\n\u003c!--       \u003cdiv class=\"text\" style=\" text-align:center;\"\u003e\n        Underwater environment cosmetic bottles.\n      \u003c/div\u003e --\u003e\n          \u003cp align=\"center\"\u003eUnderwater environment cosmetic bottles.\u003c/p\u003e\n  \u003c/td\u003e\n\u003c/tr\u003e\n\n\u003ctr\u003e\n  \u003ctd\u003e\u003cimg src=\"input/i2v/A_big_drop_of_water_falls_on_a_rose_petal.png\" width=\"400\"\u003e\u003c/td\u003e\n  \u003ctd\u003e\n      \u003cimg src=\"examples/TI2V/A_big_drop_of_water_falls_on_a_rose_petal.gif\" width=\"400\"\u003e\n      \u003cbr\u003e\n\u003c!--       \u003cdiv class=\"text\" style=\" text-align:center;\"\u003e\n        A big drop of water falls on a rose petal.\n      \u003c/div\u003e --\u003e\n          \u003cp align=\"center\"\u003eA big drop of water falls on a rose petal.\u003c/p\u003e\n  \u003c/td\u003e\n\u003c/tr\u003e\n\n\u003ctr\u003e\n  \u003ctd\u003e\u003cimg src=\"input/i2v/A_fish_swims_past_an_oriental_woman.png\" width=\"400\"\u003e\u003c/td\u003e\n  \u003ctd\u003e\n      \u003cimg src=\"examples/TI2V/A_fish_swims_past_an_oriental_woman.gif\" width=\"400\"\u003e\n      \u003cbr\u003e\n\u003c!--       \u003cdiv class=\"text\" style=\" text-align:center;\"\u003e\n        A fish swims past an oriental woman.\n      \u003c/div\u003e --\u003e\n          \u003cp align=\"center\"\u003eA fish swims past an oriental woman.\u003c/p\u003e\n  \u003c/td\u003e\n\u003c/tr\u003e\n\n\u003ctr\u003e\n  \u003ctd\u003e\u003cimg src=\"input/i2v/Cinematic_photograph_View_of_piloting_aaero.png\" width=\"400\"\u003e\u003c/td\u003e\n  \u003ctd\u003e\n      \u003cimg src=\"examples/TI2V/Cinematic_photograph_View_of_piloting_aaero.gif\" width=\"400\"\u003e\n      \u003cbr\u003e\n\u003c!--       \u003cdiv class=\"text\" style=\" text-align:center;\"\u003e\n        Cinematic photograph. View of piloting aaero.\n      \u003c/div\u003e --\u003e\n          \u003cp align=\"center\"\u003eCinematic photograph. View of piloting aaero.\u003c/p\u003e\n  \u003c/td\u003e\n\u003c/tr\u003e\n\n\u003ctr\u003e\n  \u003ctd\u003e\u003cimg src=\"input/i2v/Planet_hits_earth.png\" width=\"400\"\u003e\u003c/td\u003e\n  \u003ctd\u003e\n      \u003cimg src=\"examples/TI2V/Planet_hits_earth.gif\" width=\"400\"\u003e\n      \u003cbr\u003e\n\u003c!--       \u003cdiv class=\"text\" style=\" text-align:center;\"\u003e\n        Planet hits earth.\n      \u003c/div\u003e --\u003e\n          \u003cp align=\"center\"\u003ePlanet hits earth.\u003c/p\u003e\n  \u003c/td\u003e\n\u003c/tr\u003e\n\u003c/table\u003e\n\n\n### T2V Results\n\u003ctable\u003e\n\u003ctr\u003e\n  \u003ctd style=\"text-align:center;width: 66%\" colspan=\"2\"\u003e\u003cb\u003eOutput Video\u003c/b\u003e\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n  \u003ctd\u003e\n      \u003cimg src=\"examples/T2V/A_deer_looks_at_the_sunset_behind_him.gif\"/\u003e\n      \u003cbr\u003e\n\u003c!--       \u003cdiv class=\"text\" style=\" text-align:center;\"\u003e\n        A deer looks at the sunset behind him.\n      \u003c/div\u003e --\u003e\n          \u003cp align=\"center\"\u003eA deer looks at the sunset behind him.\u003c/p\u003e\n  \u003c/td\u003e\n  \u003ctd\u003e\n      \u003cimg src=\"examples/T2V/A_duck_is_teaching_math_to_another_duck.gif\"/\u003e\n      \u003cbr\u003e\n\u003c!--       \u003cdiv class=\"text\" style=\" text-align:center;\"\u003e\n        A duck is teaching math to another duck.\n      \u003c/div\u003e --\u003e\n          \u003cp align=\"center\"\u003eA duck is teaching math to another duck.\u003c/p\u003e\n  \u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n  \u003ctd\u003e\n      \u003cimg src=\"examples/T2V/Bezos_explores_tropical_rainforest.gif\"/\u003e\n      \u003cbr\u003e\n\u003c!--       \u003cdiv class=\"text\" style=\" text-align:center;\"\u003e\n        Bezos explores tropical rainforest.\n      \u003c/div\u003e --\u003e\n          \u003cp align=\"center\"\u003eBezos explores tropical rainforest.\u003c/p\u003e\n  \u003c/td\u003e\n  \u003ctd\u003e\n      \u003cimg src=\"examples/T2V/Light_blue_water_lapping_on_the_beach.gif\"/\u003e\n      \u003cbr\u003e\n\u003c!--       \u003cdiv class=\"text\" style=\" text-align:center;\"\u003e\n        Light blue water lapping on the beach.\n      \u003c/div\u003e --\u003e\n          \u003cp align=\"center\"\u003eLight blue water lapping on the beach.\u003c/p\u003e\n  \u003c/td\u003e\n\u003c/tr\u003e\n\n\u003c/table\u003e\n\n## BibTeX\n```bibtex\n@article{zhuang2024vlogger,\ntitle={Vlogger: Make Your Dream A Vlog},\nauthor={Zhuang, Shaobin and Li, Kunchang and Chen, Xinyuan and Wang, Yaohui and Liu, Ziwei and Qiao, Yu and Wang, Yali},\njournal={arXiv preprint arXiv:2401.09414},\nyear={2024}\n}\n```\n\n```bibtex\n@article{chen2023seine,\ntitle={SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction},\nauthor={Chen, Xinyuan and Wang, Yaohui and Zhang, Lingjun and Zhuang, Shaobin and Ma, Xin and Yu, Jiashuo and Wang, Yali and Lin, Dahua and Qiao, Yu and Liu, Ziwei},\njournal={arXiv preprint arXiv:2310.20700},\nyear={2023}\n}\n```\n\n```bibtex\n@article{wang2023lavie,\n  title={LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models},\n  author={Wang, Yaohui and Chen, Xinyuan and Ma, Xin and Zhou, Shangchen and Huang, Ziqi and Wang, Yi and Yang, Ceyuan and He, Yinan and Yu, Jiashuo and Yang, Peiqing and others},\n  journal={arXiv preprint arXiv:2309.15103},\n  year={2023}\n}\n```\n\n\n## Disclaimer\nWe disclaim responsibility for user-generated content. The model was not trained to realistically represent people or events, so using it to generate such content is beyond the model's capabilities. It is prohibited for pornographic, violent and bloody content generation, and to generate content that is demeaning or harmful to people or their environment, culture, religion, etc. Users are solely liable for their actions. The project contributors are not legally affiliated with, nor accountable for users' behaviors. Use the generative model responsibly, adhering to ethical and legal standards.\n\n## Contact Us\n**Shaobin Zhuang**: [zhuangshaobin@pjlab.org.cn](mailto:zhuangshaobin@pjlab.org.cn), **Kunchang Li**: [likunchang@pjlab.org.cn](mailto:likunchang@pjlab.org.cn)\n\n**Xinyuan Chen**: [chenxinyuan@pjlab.org.cn](mailto:chenxinyuan@pjlab.org.cn), **Yaohui Wang**: [wangyaohui@pjlab.org.cn](mailto:wangyaohui@pjlab.org.cn)  \n\n## Acknowledgements\nThe code is built upon [SEINE](https://github.com/Vchitect/SEINE), [LaVie](https://github.com/Vchitect/LaVie), [diffusers](https://github.com/huggingface/diffusers) and [Stable Diffusion](https://github.com/CompVis/stable-diffusion), we thank all the contributors for open-sourcing. \n\n\n## License\n\nThe code is licensed under Apache-2.0, model weights are fully open for academic research and also allow **free** commercial usage. To apply for a commercial license, please contact zhuangshaobin@pjlab.org.cn.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvchitect%2Fvlogger","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvchitect%2Fvlogger","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvchitect%2Fvlogger/lists"}