{"id":15034321,"url":"https://github.com/ali-vilab/vgen","last_synced_at":"2025-05-14T00:10:33.057Z","repository":{"id":205731869,"uuid":"714940169","full_name":"ali-vilab/VGen","owner":"ali-vilab","description":"Official repo for VGen: a holistic video generation ecosystem for video generation building on diffusion models","archived":false,"fork":false,"pushed_at":"2025-01-10T09:09:13.000Z","size":65580,"stargazers_count":3102,"open_issues_count":114,"forks_count":274,"subscribers_count":33,"default_branch":"main","last_synced_at":"2025-05-06T21:03:48.769Z","etag":null,"topics":["diffusion-models","video-synthesis"],"latest_commit_sha":null,"homepage":"https://i2vgen-xl.github.io","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"JacobYuan7/i2vgen-xl","license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ali-vilab.png","metadata":{"files":{"readme":"README.MD","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-11-06T06:50:19.000Z","updated_at":"2025-05-05T16:26:53.000Z","dependencies_parsed_at":"2025-01-09T01:30:34.160Z","dependency_job_id":"2471a74f-1708-4339-8d16-3d260ba112ef","html_url":"https://github.com/ali-vilab/VGen","commit_stats":{"total_commits":66,"total_committers":13,"mean_commits":5.076923076923077,"dds":0.5303030303030303,"last_synced_commit":"463781453b2d8658bce6ae6d3acb0b16016d1e67"},"previous_names":["damo-vilab/i2vgen-xl","ali-vilab/i2vgen-xl","ali-vilab/vgen"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ali-vilab%2FVGen","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ali-vilab%2FVGen/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ali-vilab%2FVGen/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ali-vilab%2FVGen/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ali-vilab","download_url":"https://codeload.github.com/ali-vilab/VGen/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254044333,"owners_count":22005135,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["diffusion-models","video-synthesis"],"created_at":"2024-09-24T20:24:37.837Z","updated_at":"2025-05-14T00:10:28.045Z","avatar_url":"https://github.com/ali-vilab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# VGen\n\n\n![figure1](source/VGen.jpg \"figure1\")\n\nVGen is an open-source video synthesis codebase developed by the Tongyi Lab of Alibaba Group, featuring state-of-the-art video generative models. This repository includes implementations of the following methods:\n\n\n- [I2VGen-xl: High-quality image-to-video synthesis via cascaded diffusion models](https://i2vgen-xl.github.io)\n- [VideoComposer: Compositional Video Synthesis with Motion Controllability](https://videocomposer.github.io)\n- [Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation](https://higen-t2v.github.io)\n- [A Recipe for Scaling up Text-to-Video Generation with Text-free Videos](https://tf-t2v.github.io)\n- [InstructVideo: Instructing Video Diffusion Models with Human Feedback](https://instructvideo.github.io)\n- [DreamVideo: Composing Your Dream Videos with Customized Subject and Motion](https://dreamvideo-t2v.github.io)\n- [VideoLCM: Video Latent Consistency Model](https://arxiv.org/abs/2312.09109)\n- [Modelscope text-to-video technical report](https://arxiv.org/abs/2308.06571)\n\n\nVGen can produce high-quality videos from the input text, images, desired motion, desired subjects, and even the feedback signals provided.  It also offers a variety of commonly used video generation tools such as visualization, sampling, training, inference, join training using images and videos, acceleration, and more. \n\n\n\u003ca href='https://i2vgen-xl.github.io/'\u003e\u003cimg src='https://img.shields.io/badge/Project-Page-Green'\u003e\u003c/a\u003e \u003ca href='https://arxiv.org/abs/2311.04145'\u003e\u003cimg src='https://img.shields.io/badge/Paper-Arxiv-red'\u003e\u003c/a\u003e [![Open in Spaces](https://huggingface.co/datasets/huggingface/badges/resolve/main/open-in-hf-spaces-sm-dark.svg)](https://huggingface.co/spaces/damo-vilab/I2VGen-XL) [![Paper page](https://huggingface.co/datasets/huggingface/badges/resolve/main/paper-page-sm-dark.svg)](https://huggingface.co/papers/2311.04145) \n[![Open in Spaces](https://huggingface.co/datasets/huggingface/badges/resolve/main/open-a-discussion-sm-dark.svg)](https://huggingface.co/spaces/damo-vilab/I2VGen-XL/discussions) [![YouTube](https://badges.aleen42.com/src/youtube.svg)](https://youtu.be/XUi0y7dxqEQ)  \u003ca href='https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/441039979087.mp4'\u003e\u003cimg src='source/logo.png'\u003e\u003c/a\u003e\n[![Replicate](https://replicate.com/cjwbw/i2vgen-xl/badge)](https://replicate.com/cjwbw/i2vgen-xl/)\n\n\n## 🔥News!!!\n\n\n- __[2025.01]__ We release the code for [metric calculation](./metric/README.MD) used in DreamVideo (**CLIP-T**, **CLIP-I**, **DINO-I**, and **Temporal Consistency**).\n- __[2024.06]__ We release the code and models of [InstructVideo](https://instructvideo.github.io/). InstructVideo enables the **LoRA** fine-tuning and inference in VGen. Feel free to use LoRA fine-tuning for other tasks.\n- __[2024.04]__ We release the models of [DreamVideo](https://dreamvideo-t2v.github.io) and ModelScopeT2V V1.5!!! ModelScopeT2V V1.5 is further fine-tuned on ModelScopeT2V for 365k iterations with more data.\n- __[2024.04]__ We release the code and models of [TF-T2V](https://tf-t2v.github.io)! \n- __[2024.04]__ We release the code and models of [VideoLCM](https://tf-t2v.github.io)! \n- __[2024.03]__ We release the training and inference code of [DreamVideo](https://dreamvideo-t2v.github.io)! \n- __[2024.03]__ We release the code and model of HiGen!!\n- __[2024.01]__ The gradio demo of I2VGen-XL has been completed in [HuggingFace](https://huggingface.co/spaces/damo-vilab/I2VGen-XL), thanks to our colleague @[Wenmeng Zhou](https://github.com/wenmengzhou) and @[AK](https://twitter.com/_akhaliq) for the support, and welcome to try it out.\n- __[2024.01]__ We support running the gradio app locally, thanks to our colleague @[Wenmeng Zhou](https://github.com/wenmengzhou) for the support and @[AK](https://twitter.com/_akhaliq) for the suggestion, and welcome to have a try.\n- __[2024.01]__ Thanks @[Chenxi](https://chenxwh.github.io) for supporting the running of i2vgen-xl on [![Replicate](https://replicate.com/cjwbw/i2vgen-xl/badge)](https://replicate.com/cjwbw/i2vgen-xl/). Feel free to give it a try. \n- __[2024.01]__ The gradio demo of I2VGen-XL has been completed in [Modelscope](https://modelscope.cn/studios/damo/I2VGen-XL/summary), and welcome to try it out.\n- __[2023.12]__ We have open-sourced the code and models for [DreamTalk](https://github.com/ali-vilab/dreamtalk), which can produce high-quality talking head videos across diverse speaking styles using diffusion models.\n- __[2023.12]__ We release [TF-T2V](https://tf-t2v.github.io) that can scale up existing video generation techniques using text-free videos, significantly enhancing the performance of both [Modelscope-T2V](https://arxiv.org/abs/2308.06571) and [VideoComposer](https://videocomposer.github.io) at the same time.\n- __[2023.12]__ We updated the codebase to support higher versions of xformer (0.0.22), torch2.0+, and removed the dependency on flash_attn.\n- __[2023.12]__ We release [InstructVideo](https://instructvideo.github.io/) that can accept human feedback signals to improve VLDM\n- __[2023.12]__ We release the diffusion based expressive talking head generation [DreamTalk](https://dreamtalk-project.github.io)\n- __[2023.12]__ We release the high-efficiency video generation method [VideoLCM](https://arxiv.org/abs/2312.09109)\n- __[2023.12]__ We release the code and model of [I2VGen-XL](https://i2vgen-xl.github.io) and the [ModelScope T2V](https://arxiv.org/abs/2308.06571)\n- __[2023.12]__ We release the T2V method [HiGen](https://higen-t2v.github.io) and customizing T2V method [DreamVideo](https://dreamvideo-t2v.github.io).\n- __[2023.12]__ We write an [introduction document](doc/introduction.pdf) for VGen and compare I2VGen-XL with SVD.\n- __[2023.11]__ We release a high-quality I2VGen-XL model, please refer to the [Webpage](https://i2vgen-xl.github.io)\n\n\n## TODO\n- [x] Release the technical papers and webpage of [I2VGen-XL](doc/i2vgen-xl.md)\n- [x] Release the code and pretrained models that can generate 1280x720 videos\n- [x] Release the code and  models of [DreamTalk](https://github.com/ali-vilab/dreamtalk) that can generate expressive talking head\n- [ ] Release the code and pretrained models of [HumanDiff]()\n- [ ] Release models optimized specifically for the human body and faces\n- [ ] Updated version can fully maintain the ID and capture large and accurate motions simultaneously\n- [ ] Release other methods and the corresponding models\n\n\n## Preparation\n\nThe main features of VGen are as follows:\n- Expandability, allowing for easy management of your own experiments.\n- Completeness, encompassing all common components for video generation.\n- Excellent performance, featuring powerful pre-trained models in multiple tasks.\n\n\n### Installation\n\n```\nconda create -n vgen python=3.8\nconda activate vgen\npip install torch==1.12.0+cu113 torchvision==0.13.0+cu113 torchaudio==0.12.0 --extra-index-url https://download.pytorch.org/whl/cu113\npip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple\n```\n\nYou  also need to ensure that your system has installed the `ffmpeg` command. If it is not installed, you can install it using the following command:\n```\nsudo apt-get update \u0026\u0026 apt-get install ffmpeg libsm6 libxext6  -y\n```\n\n\n### Datasets\n\nWe have provided a **demo dataset** that includes images and videos, along with their lists in ``data``. \n\n*Please note that the demo images used here are for testing purposes and were not included in the training.*\n\n\n### Clone the code\n\n```\ngit clone https://github.com/ali-vilab/VGen.git\ncd VGen\n```\n\n\n## Getting Started with VGen\n\n### (1) Train your text-to-video model\n\n\nExecuting the following command to enable distributed training is as easy as that.\n```\npython train_net.py --cfg configs/t2v_train.yaml\n```\n\nIn the `t2v_train.yaml` configuration file, you can specify the data, adjust the video-to-image ratio using `frame_lens`, and validate your ideas with different Diffusion settings, and so on.\n\n- Before the training, you can download any of our open-source models for initialization. Our codebase supports custom initialization and `grad_scale` settings, all of which are included in the `Pretrain` item in yaml file.\n- During the training, you can view the saved models and intermediate inference results in the `workspace/experiments/t2v_train`directory.\n\nAfter the training is completed, you can perform inference on the model using the following command.\n```\npython inference.py --cfg configs/t2v_infer.yaml\n```\nThen you can find the videos you generated in the `workspace/experiments/test_img_01` directory. For specific configurations such as data, models, seed, etc., please refer to the `t2v_infer.yaml` file.\n\n\n*If you want to directly load our previously open-sourced [Modelscope T2V model](https://huggingface.co/damo-vilab/modelscope-damo-text-to-video-synthesis/tree/main), please refer to [this link](https://github.com/damo-vilab/i2vgen-xl/issues/31).*\n\n\n\u003c!-- \u003ctable\u003e\n\u003ccenter\u003e\n  \u003ctr\u003e\n    \u003ctd \u003e\u003ccenter\u003e\n      \u003cvideo muted=\"true\" autoplay=\"true\" loop=\"true\" height=\"260\" src=\"https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/441754174077.mp4\"\u003e\u003c/video\u003e\t\n    \u003c/center\u003e\u003c/td\u003e\n    \u003ctd \u003e\u003ccenter\u003e\n      \u003cvideo muted=\"true\" autoplay=\"true\" loop=\"true\" height=\"260\" src=\"https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/441138824052.mp4\"\u003e\u003c/video\u003e\t\n    \u003c/center\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/center\u003e\n\u003c/table\u003e\n\u003c/center\u003e --\u003e\n\n\n\n\n### (2) Run the I2VGen-XL model\n\n(i) Download model and test data:\n```\n!pip install modelscope\nfrom modelscope.hub.snapshot_download import snapshot_download\nmodel_dir = snapshot_download('damo/I2VGen-XL', cache_dir='models/', revision='v1.0.0')\n```\n\nor you can also download it through HuggingFace (https://huggingface.co/damo-vilab/i2vgen-xl):\n```\n# Make sure you have git-lfs installed (https://git-lfs.com)\ngit lfs install\ngit clone https://huggingface.co/damo-vilab/i2vgen-xl\n```\n\n\n(ii) Run the following command:\n```\npython inference.py --cfg configs/i2vgen_xl_infer.yaml\n```\nor you can run:\n```\npython inference.py --cfg configs/i2vgen_xl_infer.yaml  test_list_path data/test_list_for_i2vgen.txt test_model models/i2vgen_xl_00854500.pth\n```\nThe `test_list_path` represents the input image path and its corresponding caption. Please refer to the specific format and suggestions within demo file `data/test_list_for_i2vgen.txt`. `test_model` is the path for loading the model. In a few minutes, you can retrieve the high-definition video you wish to create from the `workspace/experiments/test_list_for_i2vgen` directory. At present, we find that the current model performs inadequately on **anime images** and **images with a black background** due to the lack of relevant training data. We are consistently working to optimize it.\n\n\n(iii) Run the gradio app locally:\n```\npython gradio_app.py\n```\n\n(iv) Run the model on ModelScope and HuggingFace:\n- [Modelscope](https://modelscope.cn/studios/damo/I2VGen-XL/summary)\n- [HuggingFace](https://huggingface.co/spaces/damo-vilab/I2VGen-XL)\n\n\n\u003cspan style=\"color:red\"\u003eDue to the compression of our video quality in GIF format, please click 'HRER' below to view the original video.\u003c/span\u003e\n\n\u003ccenter\u003e\n\u003ctable\u003e\n\u003ccenter\u003e\n  \u003ctr\u003e\n    \u003ctd \u003e\u003ccenter\u003e\n      \u003cimage  height=\"260\" src=\"https://img.alicdn.com/imgextra/i1/O1CN01CCEq7K1ZeLpNQqrWu_!!6000000003219-0-tps-1280-720.jpg\"\u003e\u003c/image\u003e\t\n    \u003c/center\u003e\u003c/td\u003e\n    \u003ctd \u003e\u003ccenter\u003e\n      \u003c!-- \u003cvideo muted=\"true\" autoplay=\"true\" loop=\"true\" height=\"260\" src=\"https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/442125067544.mp4\"\u003e\u003c/video\u003e\t --\u003e\n      \u003cimage  height=\"260\" src=\"https://img.alicdn.com/imgextra/i4/O1CN01hIQcvG1spmQMLqBo0_!!6000000005816-1-tps-1280-704.gif\"\u003e\u003c/image\u003e\t\n    \u003c/center\u003e\u003c/td\u003e\n  \u003c/tr\u003e \n  \u003ctr\u003e\n    \u003ctd \u003e\u003ccenter\u003e\n      \u003cp\u003eInput Image\u003c/p\u003e\n    \u003c/center\u003e\u003c/td\u003e\n    \u003ctd \u003e\u003ccenter\u003e\n      \u003cp\u003eClick \u003ca href=\"https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/442125067544.mp4\"\u003eHERE\u003c/a\u003e to view the generated video.\u003c/p\u003e\n    \u003c/center\u003e\u003c/td\u003e\n  \u003c/tr\u003e \n  \u003ctr\u003e\n    \u003ctd \u003e\u003ccenter\u003e\n      \u003cimage  height=\"260\" src=\"https://img.alicdn.com/imgextra/i4/O1CN01ZXY7UN23K8q4oQ3uG_!!6000000007236-2-tps-1280-720.png\"\u003e\u003c/image\u003e\t\n    \u003c/center\u003e\u003c/td\u003e\n    \u003ctd \u003e\u003ccenter\u003e\n      \u003c!-- \u003cvideo muted=\"true\" autoplay=\"true\" loop=\"true\" height=\"260\" src=\"https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/441385957074.mp4\"\u003e\u003c/video\u003e\t --\u003e\n      \u003cimage height=\"260\" src=\"https://img.alicdn.com/imgextra/i1/O1CN01iaSiiv1aJZURUEY53_!!6000000003309-1-tps-1280-704.gif\"\u003e\u003c/image\u003e\t\n    \u003c/center\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd \u003e\u003ccenter\u003e\n      \u003cp\u003eInput Image\u003c/p\u003e\n    \u003c/center\u003e\u003c/td\u003e\n    \u003ctd \u003e\u003ccenter\u003e\n      \u003cp\u003eClick \u003ca href=\"https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/441385957074.mp4\"\u003eHERE\u003c/a\u003e to view the generated video.\u003c/p\u003e\n    \u003c/center\u003e\u003c/td\u003e\n  \u003c/tr\u003e \n  \u003ctr\u003e\n    \u003ctd \u003e\u003ccenter\u003e\n      \u003cimage  height=\"260\" src=\"https://img.alicdn.com/imgextra/i3/O1CN01NHpVGl1oat4H54Hjf_!!6000000005242-2-tps-1280-720.png\"\u003e\u003c/image\u003e\t\n    \u003c/center\u003e\u003c/td\u003e\n    \u003ctd \u003e\u003ccenter\u003e\n      \u003c!-- \u003cvideo muted=\"true\" autoplay=\"true\" loop=\"true\" height=\"260\" src=\"https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/442102706767.mp4\"\u003e\u003c/video\u003e\t --\u003e\n      \u003c!-- \u003cimage muted=\"true\" height=\"260\" src=\"https://img.alicdn.com/imgextra/i4/O1CN01DgLj1T240jfpzKoaQ_!!6000000007329-1-tps-1280-704.gif\"\u003e\u003c/image\u003e\t\n       --\u003e\n      \u003cimage  height=\"260\" src=\"https://img.alicdn.com/imgextra/i4/O1CN01DgLj1T240jfpzKoaQ_!!6000000007329-1-tps-1280-704.gif\"\u003e\u003c/image\u003e\n    \u003c/center\u003e\u003c/td\u003e\n  \u003c/tr\u003e \n  \u003ctr\u003e\n    \u003ctd \u003e\u003ccenter\u003e\n      \u003cp\u003eInput Image\u003c/p\u003e\n    \u003c/center\u003e\u003c/td\u003e\n    \u003ctd \u003e\u003ccenter\u003e\n      \u003cp\u003eClick \u003ca href=\"https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/442102706767.mp4\"\u003eHERE\u003c/a\u003e to view the generated video.\u003c/p\u003e\n    \u003c/center\u003e\u003c/td\u003e\n  \u003c/tr\u003e \n  \u003ctr\u003e\n    \u003ctd \u003e\u003ccenter\u003e\n      \u003cimage height=\"260\" src=\"https://img.alicdn.com/imgextra/i1/O1CN01odS61s1WW9tXen21S_!!6000000002795-0-tps-1280-720.jpg\"\u003e\u003c/image\u003e\t\n    \u003c/center\u003e\u003c/td\u003e\n    \u003ctd \u003e\u003ccenter\u003e\n      \u003c!-- \u003cvideo muted=\"true\" autoplay=\"true\" loop=\"true\" height=\"260\" src=\"https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/442163934688.mp4\"\u003e\u003c/video\u003e\t --\u003e\n      \u003cimage height=\"260\" src=\"https://img.alicdn.com/imgextra/i3/O1CN01Jyk1HT28JkZtpAtY6_!!6000000007912-1-tps-1280-704.gif\"\u003e\u003c/image\u003e\t\n    \u003c/center\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd \u003e\u003ccenter\u003e\n      \u003cp\u003eInput Image\u003c/p\u003e\n    \u003c/center\u003e\u003c/td\u003e\n    \u003ctd \u003e\u003ccenter\u003e\n      \u003cp\u003eClick \u003ca href=\"https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/442163934688.mp4\"\u003eHERE\u003c/a\u003e to view the generated video.\u003c/p\u003e\n    \u003c/center\u003e\u003c/td\u003e\n  \u003c/tr\u003e \n\u003c/center\u003e\n\u003c/table\u003e\n\u003c/center\u003e\n\n(ii) Run the following command:\n```\npython inference.py --cfg configs/i2vgen_xl_train.yaml\n```\nIn a few minutes, you can retrieve the high-definition video you wish to create from the `workspace/experiments/test_img_01` directory. At present, we find that the current model performs inadequately on **anime images** and **images with a black background** due to the lack of relevant training data. We are consistently working to optimize it.\n\n\n### (3) Run the HiGen model\n\n(i) Download model:\n```\n!pip install modelscope\nfrom modelscope.hub.snapshot_download import snapshot_download\nmodel_dir = snapshot_download('iic/HiGen', cache_dir='models/')\n```\nThen you might need the following command to move the checkpoints to the \"models/\" directory:\n```\nmv ./models/iic/HiGen/* ./models/\n```\n\n(ii) Run the following command for text-to-video generation:\n```\npython inference.py --cfg configs/higen_infer.yaml\n```\nIn a few minutes, you can retrieve the videos you wish to create from the `workspace/experiments/text_list_for_t2v_share` directory.\nThen you can execute the following command to perform super-resolution on the generated videos:\n```\npython inference.py --cfg configs/sr600_infer.yaml\n```\nFinally, you can retrieve the high-definition video from the `workspace/experiments/text_list_for_t2v_share` directory.\n\n\u003cspan style=\"color:red\"\u003eDue to the compression of our video quality in GIF format, please click 'HERE' below to view the original video.\u003c/span\u003e\n\u003ctable\u003e\n\u003ccenter\u003e\n  \u003ctr\u003e\n    \u003ctd \u003e\u003ccenter\u003e\n      \u003cimage  height=\"260\" src=\"source/duck.png\"\u003e\u003c/image\u003e\t\n    \u003c/center\u003e\u003c/td\u003e\n    \u003ctd \u003e\u003ccenter\u003e\n      \u003cimage height=\"260\" src=\"source/bat_man.png\"\u003e\u003c/image\u003e\t\n    \u003c/center\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd \u003e\u003ccenter\u003e\n      \u003cp\u003eClick \u003ca href=\"https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/452227605224.mp4\"\u003eHERE\u003c/a\u003e to view the generated video.\u003c/p\u003e\n    \u003c/center\u003e\u003c/td\u003e\n    \u003ctd \u003e\u003ccenter\u003e\n      \u003cp\u003eClick \u003ca href=\"https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/452015792863.mp4\"\u003eHERE\u003c/a\u003e to view the generated video.\u003c/p\u003e\n    \u003c/center\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/center\u003e\n\u003c/table\u003e\n\u003c/center\u003e\n\n\n### (4) DreamVideo\nOur DreamVideo uses `ModelScopeT2V V1.5` as the base video diffusion model. `ModelScopeT2V V1.5 is further fine-tuned on ModelScopeT2V for 365k iterations with more data.`\n\n#### Download ModelScopeT2V V1.5 and adapter weights of DreamVideo\n```\n!pip install modelscope\nfrom modelscope.hub.snapshot_download import snapshot_download\nmodel_dir = snapshot_download('iic/dreamvideo-t2v', cache_dir='models/')\n```\nThen you might need the following command to move the checkpoints to the \"models/\" directory:\n```\nmv ./models/iic/dreamvideo-t2v/* ./models/\n```\nOr you can download the checkpoint of ModelScopeT2V V1.5 and adapter weights of DreamVideo from this [link](https://modelscope.cn/models/iic/dreamvideo-t2v/files). \n\n#### Training\n(i) Subject Learning\n\nStep 1: learn a textual identity using Textual Inversion.\n```\npython train_net.py --cfg configs/dreamvideo/subjectLearning/dog2_subjectLearning_step1.yaml\n```\n\nStep 2: train an identity adapter by incorporating the learned textual identity.\n```\npython train_net.py --cfg configs/dreamvideo/subjectLearning/dog2_subjectLearning_step2.yaml\n```\n\nTips:\n- Generally, step 1 takes `1500 to 3000` training steps, and step 2 takes `500 to 1000` training steps. For certain subjects (like cats, etc.), excessive training may generate unnatural videos, and using text embedding with fewer training steps or reducing the training steps of step 2 may help.\n- For some subjects (like dogs, etc.), setting `use_mask_diffusion` to `True` may achieve better results. Make sure to put the binary masks of the subject into the folder `data/images/custom/YOUR_SUBJECT/masks`, and you can use [SAM](https://github.com/facebookresearch/segment-anything) to obtain these masks.\n\n(ii) Motion Learning\n\nTrain a motion adapter on the given videos.\n```\npython train_net.py --cfg configs/dreamvideo/motionLearning/carTurn_motionLearning.yaml\n```\n\nYou can customize your own configuration files for subject/motion learning.\n\nTips:\n- Generally, motion learning takes `500 to 2000` training steps.\n- Try setting `p_image_zero` from `0 to 0.5` to adjust the effect of appearance guidance during training.\n- Try increasing training steps or increasing the learning rate for single video motion customization to better align the motion pattern.\n\n\n#### Inference\n(i) Subject Customization\n```\npython inference.py --cfg configs/dreamvideo/infer/subject_dog2.yaml\n```\n\n(ii) Motion Customization\n```\npython inference.py --cfg configs/dreamvideo/infer/motion_carTurn.yaml\n```\nFor inference with appearance guidance, make sure to add images of foreground objects (e.g., any image of a bear) to the folder `data/images/motionReferenceImgs` and modify your test file.\n\nTips:\n- Try setting `appearance_guide_strength_cond` and `appearance_guide_strength_uncond` from `0 to 1` to adjust the effect of appearance guidance during inference.\n- We do not use DDIM Inversion by default. However, for single video motion customization, you can try setting `inverse_noise_strength` to `0~0.5` to better align the training video. For multi-video motion customization, we recommend setting `inverse_noise_strength` to `0`.\n\n\n(iii) Joint Customization\n```\npython inference.py --cfg configs/dreamvideo/infer/joint_dog2_carTurn.yaml\n```\n\nTips:\n- Try changing `identity_adapter_index` and `motion_adapter_index` for better results. Typically, increasing identity_adapter_index improves identity preservation, while increasing motion_adapter_index enhances motion alignment. Balance the two for optimal results.\n\n\n#### Examples\nWe provide some examples for inference. Before you start, make sure you download the models.\n\n(i) Subject Customization\n```\npython inference.py --cfg configs/dreamvideo/infer/examples/subject_dog2.yaml\n\npython inference.py --cfg configs/dreamvideo/infer/examples/subject_wolf_plushie.yaml\n```\n\n\u003ccenter\u003e\n\u003ctable\u003e\n\u003ccenter\u003e\n\u003ctr\u003e\n  \u003ctd style=\"text-align:center;\"\u003e\u003ccenter\u003e\u003cb\u003eSubject\u003c/b\u003e\u003c/center\u003e\u003c/td\u003e\n  \u003ctd style=\"text-align:center;\"\u003e\u003ccenter\u003e\u003cb\u003eGenerated Video\u003c/b\u003e\u003c/center\u003e\u003c/td\u003e\n  \u003ctd style=\"text-align:center;\"\u003e\u003ccenter\u003e\u003cb\u003eSubject\u003c/b\u003e\u003c/center\u003e\u003c/td\u003e\n  \u003ctd style=\"text-align:center;\"\u003e\u003ccenter\u003e\u003cb\u003eGenerated Video\u003c/b\u003e\u003c/center\u003e\u003c/td\u003e\n\u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e\u003ccenter\u003e\n      \u003cimg src=\"https://img.alicdn.com/imgextra/i4/O1CN01wjCCkF1f7jMzZh6pn_!!6000000003960-2-tps-256-256.png\" /\u003e\n    \u003c/center\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ccenter\u003e\n      \u003cimg src=\"https://img.alicdn.com/imgextra/i2/O1CN016RJQ1u1zEPSfSzfjh_!!6000000006682-1-tps-256-256.gif\" /\u003e\n    \u003c/center\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ccenter\u003e\n      \u003cimg src=\"https://img.alicdn.com/imgextra/i4/O1CN013ekGAg1vwbDDt8N9H_!!6000000006237-2-tps-256-256.png\" /\u003e\n    \u003c/center\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ccenter\u003e\n      \u003cimg src=\"https://img.alicdn.com/imgextra/i2/O1CN011HQwSb1REglZ81Ki8_!!6000000002080-1-tps-256-256.gif\" /\u003e\n    \u003c/center\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n  \u003ctd style=\"text-align:center;\"\u003e\u003ccenter\u003edog\u003c/center\u003e\u003c/td\u003e\n  \u003ctd style=\"text-align:center;\"\u003e\u003ccenter\u003e\"a * eating pizza\"\u003c/br\u003e seed: 2767\u003c/center\u003e\u003c/td\u003e\n  \u003ctd style=\"text-align:center;\"\u003e\u003ccenter\u003ewolf plushie\u003c/center\u003e\u003c/td\u003e\n  \u003ctd style=\"text-align:center;\"\u003e\u003ccenter\u003e\"a * running in the forest\" \u003c/br\u003e seed: 2339\u003c/center\u003e\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/center\u003e\n\u003c/table\u003e\n\u003c/center\u003e\n\n(ii) Motion Customization\n```\npython inference.py --cfg configs/dreamvideo/infer/examples/motion_carTurn.yaml\n\npython inference.py --cfg configs/dreamvideo/infer/examples/motion_playingGuitar.yaml\n```\n\n\u003ccenter\u003e\n\u003ctable\u003e\n\u003ccenter\u003e\n\u003ctr\u003e\n  \u003ctd style=\"text-align:center;\"\u003e\u003ccenter\u003e\u003cb\u003eMotion\u003c/b\u003e\u003c/center\u003e\u003c/td\u003e\n  \u003ctd style=\"text-align:center;\"\u003e\u003ccenter\u003e\u003cb\u003eGenerated Video\u003c/b\u003e\u003c/center\u003e\u003c/td\u003e\n  \u003ctd style=\"text-align:center;\"\u003e\u003ccenter\u003e\u003cb\u003eMotion\u003c/b\u003e\u003c/center\u003e\u003c/td\u003e\n  \u003ctd style=\"text-align:center;\"\u003e\u003ccenter\u003e\u003cb\u003eGenerated Video\u003c/b\u003e\u003c/center\u003e\u003c/td\u003e\n\u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e\u003ccenter\u003e\n      \u003cimg src=\"https://img.alicdn.com/imgextra/i1/O1CN01r9yeiR1irw83xSXZz_!!6000000004467-1-tps-256-256.gif\" /\u003e\n    \u003c/center\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ccenter\u003e\n      \u003cimg src=\"https://img.alicdn.com/imgextra/i4/O1CN01OKbdZj1sy1b2Nhq5M_!!6000000005834-1-tps-256-256.gif\" /\u003e\n    \u003c/center\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ccenter\u003e\n      \u003cimg src=\"https://img.alicdn.com/imgextra/i4/O1CN01GkeSm31p1S679Hf3J_!!6000000005300-1-tps-256-256.gif\" /\u003e\n    \u003c/center\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ccenter\u003e\n      \u003cimg src=\"https://img.alicdn.com/imgextra/i4/O1CN01okWw0V1s544J9v6BU_!!6000000005714-1-tps-256-256.gif\" /\u003e\n    \u003c/center\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n  \u003ctd style=\"text-align:center;\"\u003e\u003ccenter\u003e\"a car running on the road\"\u003c/center\u003e\u003c/td\u003e\n  \u003ctd style=\"text-align:center;\"\u003e\u003ccenter\u003e\"a lion running on the road\"\u003c/br\u003e seed: 8888\u003c/center\u003e\u003c/td\u003e\n  \u003ctd style=\"text-align:center;\"\u003e\u003ccenter\u003e\"a person is playing guitar\"\u003c/center\u003e\u003c/td\u003e\n  \u003ctd style=\"text-align:center;\"\u003e\u003ccenter\u003e\"a monkey is playing guitar on Mars\" \u003c/br\u003e seed: 8888\u003c/center\u003e\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/center\u003e\n\u003c/table\u003e\n\u003c/center\u003e\n\n(iii) Joint Customization\n```\npython inference.py --cfg configs/dreamvideo/infer/examples/joint_dog2_carTurn.yaml\n\npython inference.py --cfg configs/dreamvideo/infer/examples/joint_dog2_playingGuitar.yaml\n\npython inference.py --cfg configs/dreamvideo/infer/examples/joint_wolf_plushie_carTurn.yaml\n\npython inference.py --cfg configs/dreamvideo/infer/examples/joint_wolf_plushie_playingGuitar.yaml\n```\n\n\u003ccenter\u003e\n\u003ctable\u003e\n\u003ccenter\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e\u003ccenter\u003e\n      \u003cimg src=\"https://img.alicdn.com/imgextra/i4/O1CN01uT4pI71EAwWGLx9hP_!!6000000000312-2-tps-256-256.png\" /\u003e\n    \u003c/center\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ccenter\u003e\n      \u003cimg src=\"https://img.alicdn.com/imgextra/i4/O1CN01wjCCkF1f7jMzZh6pn_!!6000000003960-2-tps-256-256.png\" /\u003e\n    \u003c/center\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ccenter\u003e\n      \u003cimg src=\"https://img.alicdn.com/imgextra/i4/O1CN013ekGAg1vwbDDt8N9H_!!6000000006237-2-tps-256-256.png\" /\u003e\n    \u003c/center\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr \u003e\n    \u003ctd style=\"text-align:center;\"\u003e\u003c/td\u003e\n    \u003ctd style=\"text-align:center;\"\u003e\u003ccenter\u003edog\u003c/center\u003e\u003c/td\u003e\n    \u003ctd style=\"text-align:center;\"\u003e\u003ccenter\u003ewolf plushie\u003c/center\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e\u003ccenter\u003e\n      \u003cimg src=\"https://img.alicdn.com/imgextra/i1/O1CN01r9yeiR1irw83xSXZz_!!6000000004467-1-tps-256-256.gif\" /\u003e\n    \u003c/center\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ccenter\u003e\n      \u003cimg src=\"https://img.alicdn.com/imgextra/i4/O1CN01GOVb6p244rHKdTzPJ_!!6000000007338-1-tps-256-256.gif\" /\u003e\n    \u003c/center\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ccenter\u003e\n      \u003cimg src=\"https://img.alicdn.com/imgextra/i3/O1CN01uwNzW21kLXO82PKjp_!!6000000004667-1-tps-256-256.gif\" /\u003e\n    \u003c/center\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd style=\"text-align:center;\"\u003e\u003ccenter\u003e\"a car running on the road\"\u003c/center\u003e\u003c/td\u003e\n    \u003ctd style=\"text-align:center;\"\u003e\u003ccenter\u003e\"a * running on the beach\"\u003c/br\u003e seed: 8888\u003c/center\u003e\u003c/td\u003e\n    \u003ctd style=\"text-align:center;\"\u003e\u003ccenter\u003e\"a * running on the road\"\u003c/br\u003e seed: 3677\u003c/center\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e\u003ccenter\u003e\n      \u003cimg src=\"https://img.alicdn.com/imgextra/i4/O1CN01GkeSm31p1S679Hf3J_!!6000000005300-1-tps-256-256.gif\" /\u003e\n    \u003c/center\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ccenter\u003e\n      \u003cimg src=\"https://img.alicdn.com/imgextra/i1/O1CN01ouNpOm1Ptn4sG4uFV_!!6000000001899-1-tps-256-256.gif\" /\u003e\n    \u003c/center\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ccenter\u003e\n      \u003cimg src=\"https://img.alicdn.com/imgextra/i4/O1CN01q7kZra1SpcJezWkrk_!!6000000002296-1-tps-256-256.gif\" /\u003e\n    \u003c/center\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd style=\"text-align:center;\"\u003e\u003ccenter\u003e\"a person is playing guitar\"\u003c/center\u003e\u003c/td\u003e\n    \u003ctd style=\"text-align:center;\"\u003e\u003ccenter\u003e\"a * is playing guitar on the moon\"\u003c/br\u003e seed: 8888\u003c/center\u003e\u003c/td\u003e\n    \u003ctd style=\"text-align:center;\"\u003e\u003ccenter\u003e\"a * is playing guitar\"\u003c/br\u003e seed: 6071\u003c/center\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/center\u003e\n\u003c/table\u003e\n\u003c/center\u003e\n\n#### Metric Calculation\n\nWe provide code that includes the calculation of metrics for `CLIP-T`, `CLIP-I`, `DINO-I`, and `Temporal Consistency`. Please refer to [metric/README.MD](./metric/README.MD) for more details.\n\n\n### (5) Run the TF-T2V (CVPR-2024) model\n\n(i) Download model:\n```\n!pip install modelscope\nfrom modelscope.hub.snapshot_download import snapshot_download\nmodel_dir = snapshot_download('iic/tf-t2v', cache_dir='models/')\n```\nThen you might need the following command to move the checkpoints to the \"models/\" directory:\n```\nmv ./models/iic/tf-t2v/* ./models/\n```\n\n(ii) We provide a config file for generating 16-frame video with 448x256 resolution. The command is as follows:\n```\npython inference.py --cfg configs/tft2v_t2v_infer.yaml\n```\n(If there are environmental problems during operation, we also provide the environment configuration \"tft2v_environment.yaml\" of TF-T2V for your reference.)\n\n\nIn a few minutes, you can retrieve the videos you wish to create from the `workspace/experiments/text_list_for_tft2v` directory.\nThen you can execute the following command to perform super-resolution on the generated videos:\n```\npython inference.py --cfg configs/tft2v_16frames_sr600_infer.yaml\n```\nFinally, you can retrieve the high-definition video from the `workspace/experiments/text_list_for_tft2v` directory.\n(It should be noted that the super-resolution model only supports 32-frame input, and 16-frame video cannot be used, thus we construct a pseudo 32-frame video by copying frames.)\n\n\u003cspan style=\"color:red\"\u003eDue to the compression of our video quality in GIF format, please click 'HERE' below to view the original video.\u003c/span\u003e\n\u003ctable\u003e\n\u003ccenter\u003e\n  \u003ctr\u003e\n    \u003ctd \u003e\u003ccenter\u003e\n      \u003cimage  height=\"260\" src=\"https://img.alicdn.com/imgextra/i1/O1CN014OJopR1MFXQ0y9jN1_!!6000000001405-1-tps-320-180.gif\"\u003e\u003c/image\u003e\t\n    \u003c/center\u003e\u003c/td\u003e\n    \u003ctd \u003e\u003ccenter\u003e\n      \u003cimage height=\"260\" src=\"https://img.alicdn.com/imgextra/i1/O1CN01gzSvmL1s7oRR4sBJf_!!6000000005720-1-tps-320-180.gif\"\u003e\u003c/image\u003e\t\n    \u003c/center\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd \u003e\u003ccenter\u003e\n      \u003cp\u003eClick \u003ca href=\"https://cloud.video.taobao.com/vod/play/d0pnU3FxV01sT3ljVGowRFVmeVJOR2R0U3Rya05udGJoZ29BR203QzlURzZQZWw1SnpKVVVCTlh4OVFON0V5UUVMUDduY1RJak82VE1sdXdHTjNOaHc9PQ\"\u003eHERE\u003c/a\u003e to view the generated video.\u003c/p\u003e\n    \u003c/center\u003e\u003c/td\u003e\n    \u003ctd \u003e\u003ccenter\u003e\n      \u003cp\u003eClick \u003ca href=\"https://cloud.video.taobao.com/vod/play/d0pnU3FxV01sT3ljVGowRFVmeVJOR2FXNTVHdDdoTndIOHh1R09zejZsQzZQZWw1SnpKVVVCTlh4OVFON0V5UUVMUDduY1RJak82VE1sdXdHTjNOaHc9PQ\"\u003eHERE\u003c/a\u003e to view the generated video.\u003c/p\u003e\n    \u003c/center\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/center\u003e\n\u003c/table\u003e\n\u003c/center\u003e\n\n(iii) Additionally, you can run the following command for text-to-video generation (32 frames):\n```\npython inference.py --cfg configs/tft2v_t2v_32frames_infer.yaml\n```\n\nIn a few minutes, you can retrieve the videos you wish to create from the `workspace/experiments/text_list_for_tft2v_32frame` directory.\nThen you can execute the following command to perform super-resolution on the generated videos:\n```\npython inference.py --cfg configs/tft2v_32frames_sr600_infer.yaml\n```\nFinally, you can retrieve the high-definition video from the `workspace/experiments/text_list_for_tft2v_32frame` directory.\n(It should be noted that the super-resolution model only supports 32-frame input, and 16-frame video cannot be used.)\n\n\u003cspan style=\"color:red\"\u003eDue to the compression of our video quality in GIF format, please click 'HERE' below to view the original video.\u003c/span\u003e\n\u003ctable\u003e\n\u003ccenter\u003e\n  \u003ctr\u003e\n    \u003ctd \u003e\u003ccenter\u003e\n      \u003cimage  height=\"260\" src=\"https://img.alicdn.com/imgextra/i3/O1CN01PN4Gv31ZfGfrf4bI3_!!6000000003221-1-tps-320-180.gif\"\u003e\u003c/image\u003e\t\n    \u003c/center\u003e\u003c/td\u003e\n    \u003ctd \u003e\u003ccenter\u003e\n      \u003cimage height=\"260\" src=\"https://img.alicdn.com/imgextra/i3/O1CN01qMPlvb26JAtAqovud_!!6000000007640-1-tps-320-180.gif\"\u003e\u003c/image\u003e\t\n    \u003c/center\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd \u003e\u003ccenter\u003e\n      \u003cp\u003eClick \u003ca href=\"https://cloud.video.taobao.com/vod/play/eGhWQVU2UHBJcWRTMXRISFNKcFhnY0Z1dTlkM013M3ZsNDZCUHlad2lrRzZQZWw1SnpKVVVCTlh4OVFON0V5UUVMUDduY1RJak82VE1sdXdHTjNOaHc9PQ\"\u003eHERE\u003c/a\u003e to view the generated video.\u003c/p\u003e\n    \u003c/center\u003e\u003c/td\u003e\n    \u003ctd \u003e\u003ccenter\u003e\n      \u003cp\u003eClick \u003ca href=\"https://cloud.video.taobao.com/vod/play/eGhWQVU2UHBJcWRTMXRISFNKcFhnWDBaTmQxYTc1TzRYS3BFTG1TMXhORzZQZWw1SnpKVVVCTlh4OVFON0V5UUVMUDduY1RJak82VE1sdXdHTjNOaHc9PQ\"\u003eHERE\u003c/a\u003e to view the generated video.\u003c/p\u003e\n    \u003c/center\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/center\u003e\n\u003c/table\u003e\n\u003c/center\u003e\n\n\n(iv) Run the following command for compositional video generation like videocomposer (32 frames):\n\n```\npython inference.py --cfg configs/tft2v_vcomposer_32frames_infer.yaml\n```\nIn a few minutes, you can retrieve the videos you wish to create from the `workspace/experiments/vid_list_vcomposer_32frame` directory.\nThen you can execute the following command to perform super-resolution on the generated videos:\n```\npython inference.py --cfg configs/tft2v_vcomposer_32frames_sr600_infer.yaml\n```\nFinally, you can retrieve the high-definition video from the `workspace/experiments/vid_list_vcomposer_32frame` directory.\n\n\n\u003cspan style=\"color:red\"\u003eDue to the compression of our video quality in GIF format, please click 'HERE' below to view the original video.\u003c/span\u003e\n\u003ctable\u003e\n\u003ccenter\u003e\n  \u003ctr\u003e\n    \u003ctd \u003e\u003ccenter\u003e\n      \u003cimage  height=\"260\" src=\"https://img.alicdn.com/imgextra/i2/O1CN01I4cVz01eHWDQKnC0b_!!6000000003846-1-tps-320-180.gif\"\u003e\u003c/image\u003e\t\n    \u003c/center\u003e\u003c/td\u003e\n    \u003ctd \u003e\u003ccenter\u003e\n      \u003cimage height=\"260\" src=\"https://img.alicdn.com/imgextra/i1/O1CN01VWLCS01TEo65Rkv8F_!!6000000002351-1-tps-320-180.gif\"\u003e\u003c/image\u003e\t\n    \u003c/center\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd \u003e\u003ccenter\u003e\n      \u003cp\u003eClick \u003ca href=\"https://cloud.video.taobao.com/vod/play/dkwzYUtUNExsSzQ2cGZRM0N3Z0VYWlczOW5YanFzdkkzYW1hRDZRRGlHNjZQZWw1SnpKVVVCTlh4OVFON0V5UUVMUDduY1RJak82VE1sdXdHTjNOaHc9PQ\"\u003eHERE\u003c/a\u003e to view the generated video.\u003c/p\u003e\n    \u003c/center\u003e\u003c/td\u003e\n    \u003ctd \u003e\u003ccenter\u003e\n      \u003cp\u003eClick \u003ca href=\"https://cloud.video.taobao.com/vod/play/dkwzYUtUNExsSzQ2cGZRM0N3Z0VYUlNNRHVwN0ZmbndFMkpZeFlEczdsZTZQZWw1SnpKVVVCTlh4OVFON0V5UUVMUDduY1RJak82VE1sdXdHTjNOaHc9PQ\"\u003eHERE\u003c/a\u003e to view the generated video.\u003c/p\u003e\n    \u003c/center\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/center\u003e\n\u003c/table\u003e\n\u003c/center\u003e\n\n\n(v) We also provide a config file for generating 16-frame video with 448x256 resolution under the compositional video synthesis setting. The command is as follows:\n```\npython inference.py --cfg configs/tft2v_vcomposer_infer.yaml\n```\nYou can also generate a 16-frame video with 896x512 resolution within one model by running:\n```\npython inference.py --cfg configs/tft2v_vcomposer_896x512_infer.yaml\n```\n\nIt should be noted that the super-resolution model only supports 32-frame input, and 16-frame video cannot be used.\n\n\n\n### (6) Run the VideoLCM model\n\n(i) Download models as in TF-T2V (if you have already downloaded them in TF-T2V, skip this step):\n```\n!pip install modelscope\nfrom modelscope.hub.snapshot_download import snapshot_download\nmodel_dir = snapshot_download('iic/tf-t2v', cache_dir='models/')\n```\nThen you might need the following command to move the checkpoints to the \"models/\" directory:\n```\nmv ./models/iic/tf-t2v/* ./models/\n```\n\n(ii) Run the following command for text-to-video generation (16 frames with 448x256 resolution):\n```\npython inference.py --cfg configs/videolcm_t2v_infer.yaml\n```\n\nTo generate high-resolution videos (1280x720 resolution), you can run the following command:\n```\npython inference.py --cfg configs/videolcm_t2v_16frames_sr600_infer.yaml\n```\n\n\n\u003cspan style=\"color:red\"\u003eDue to the compression of our video quality in GIF format, please click 'HERE' below to view the original video.\u003c/span\u003e\n\u003ctable\u003e\n\u003ccenter\u003e\n  \u003ctr\u003e\n    \u003ctd \u003e\u003ccenter\u003e\n      \u003cimage  height=\"260\" src=\"https://img.alicdn.com/imgextra/i2/O1CN01AqHVYW1T2tnVCfd1u_!!6000000002325-1-tps-320-180.gif\"\u003e\u003c/image\u003e\t\n    \u003c/center\u003e\u003c/td\u003e\n    \u003ctd \u003e\u003ccenter\u003e\n      \u003cimage height=\"260\" src=\"https://img.alicdn.com/imgextra/i1/O1CN01a7YqU51GxbpoqiKsv_!!6000000000689-1-tps-320-180.gif\"\u003e\u003c/image\u003e\t\n    \u003c/center\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd \u003e\u003ccenter\u003e\n      \u003cp\u003eClick \u003ca href=\"https://cloud.video.taobao.com/vod/play/TGdxZ2dwem8xVXFtcm1TYjlCMXd6UXFzMVM/dDV1MTltVldlWlVUajlkcTZQZWw1SnpKVVVCTlh4OVFON0V5UUVMUDduY1RJak82VE1sdXdHTjNOaHc9PQ\"\u003eHERE\u003c/a\u003e to view the generated video.\u003c/p\u003e\n    \u003c/center\u003e\u003c/td\u003e\n    \u003ctd \u003e\u003ccenter\u003e\n      \u003cp\u003eClick \u003ca href=\"https://cloud.video.taobao.com/vod/play/OVZ5THVhZW1hbkZxSm4wRzVNNm0xMDFxbjdoVzR6VjhXWlBoSkNjKw/VjI2UGVsNUp6SlVVQk5YeDlRTjdFeVFFTFA3bmNUSWpPNlRNbHV3R04zTmh3PT0\"\u003eHERE\u003c/a\u003e to view the generated video.\u003c/p\u003e\n    \u003c/center\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/center\u003e\n\u003c/table\u003e\n\u003c/center\u003e\n\n\n(iii) Run the following command for compositional video generation (16 frames with 448x256 resolution):\n```\npython inference.py --cfg configs/videolcm_vcomposer_infer.yaml\n```\n\n### (7) InstructVideo (CVPR 2024)\n\nFeel free to reach out (hj.yuan@zju.edu.cn) if have questions.\n\n#### Dataset preparation and environment configuration\nThe training of InstructVideo requires video-text pairs to save computational cost during reward fine-tuning.\nIn the paper, we utilize a small set of videos in WebVid to fine-tune our base model.\nThe file list is shown under the folder:\n```\ndata/instructvideo/webvid_simple_animals_2_selected_20_train_file_list/00000.txt\n```\nYou should try filtering the videos from your webvid dataset to compose the training data. Another alternative is to use your own video-text pairs.\n(I tested InstructVideo on WebVid data and some proprietary data. Both worked.)\n\nConcerning the environment configuration, you should follow the instructions for [VGen installation](https://github.com/ali-vilab/VGen?tab=readme-ov-file#installation).\n\n\n#### Pre-trained weights preparation\n```\n!pip install modelscope\nfrom modelscope.hub.snapshot_download import snapshot_download\nmodel_dir = snapshot_download('iic/InstructVideo', cache_dir='models/')\n```\nYou need to move the checkpoints to the \"models/\" directory:\n```\nmv ./models/iic/InstructVideo/* ./models/\n```\nNote that `models/model_scope_v1-4_0600000.pth` is the pre-trained base model used in the paper.\nThe fine-tuned model is placed under the folder `models/instructvideo-finetuned`.\n\nYou can get access to the provided files on [Instructvideo ModelScope Page](https://modelscope.cn/models/iic/InstructVideo/files).\n\n\n#### The inference of InstructVideo\nYou can leverage the provided fine-tuned checkpoints to generate videos by running the command:\n```\nbash configs/instructvideo/eval_generate_videos.sh\n```\nThis command uses yaml files under `configs/instructvideo/eval`, containing caption file paths for generating videos of in-domain animals, new animals and non-animals.\nFeel free to switch among them or replace them with your own captions.\nAlthough we fine-tuned using 20-step DDIM, you can still use 50-step DDIM generation.\n\n#### The reward fine-tuning of InstructVideo\nYou can perform InstrcutVideo reward fine-tuning by running the command:\n```\nbash configs/instructvideo/train.sh\n```\nSince performing reward fine-tuning can lead to over-optimization, I strongly recommend checking the generation performance on some evaluation captions regularly (like the captions indicated in `configs/instructvideo/eval`).\n\n\n### (8) Other methods\n\nIn preparation!!\n\n\n## Customize your own approach\n\nOur codebase essentially supports all the commonly used components in video generation. You can manage your experiments flexibly by adding corresponding registration classes, including `ENGINE, MODEL, DATASETS, EMBEDDER, AUTO_ENCODER, VISUAL, DIFFUSION, PRETRAIN`, and can be compatible with all our open-source algorithms according to your own needs. If you have any questions, feel free to give us your feedback at any time.\n\n\n\n## BibTeX\n\nIf this repo is useful to you, please cite our corresponding technical paper.\n\n\n```bibtex\n@article{wang2023videocomposer,\n  title={Videocomposer: Compositional Video Synthesis with Motion Controllability},\n  author={Wang, Xiang and Yuan, Hangjie and Zhang, Shiwei and Chen, Dayou and Wang, Jiuniu and Zhang, Yingya and Shen, Yujun and Zhao, Deli and Zhou, Jingren},\n  journal={NeurIPS},\n  volume={36},\n  year={2023}\n}\n@article{2023i2vgenxl,\n  title={I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models},\n  author={Zhang, Shiwei and Wang, Jiayu and Zhang, Yingya and Zhao, Kang and Yuan, Hangjie and Qing, Zhiwu and Wang, Xiang  and Zhao, Deli and Zhou, Jingren},\n  booktitle={arXiv preprint arXiv:2311.04145},\n  year={2023}\n}\n@article{wang2023modelscope,\n  title={Modelscope text-to-video technical report},\n  author={Wang, Jiuniu and Yuan, Hangjie and Chen, Dayou and Zhang, Yingya and Wang, Xiang and Zhang, Shiwei},\n  journal={arXiv preprint arXiv:2308.06571},\n  year={2023}\n}\n@inproceedings{wei2023dreamvideo,\n  title={DreamVideo: Composing Your Dream Videos with Customized Subject and Motion},\n  author={Wei, Yujie and Zhang, Shiwei and Qing, Zhiwu and Yuan, Hangjie and Liu, Zhiheng and Liu, Yu and Zhang, Yingya and Zhou, Jingren and Shan, Hongming},\n  booktitle={CVPR},\n  year={2024}\n}\n@inproceedings{higen,\n  title={Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation},\n  author={Qing, Zhiwu and Zhang, Shiwei and Wang, Jiayu and Wang, Xiang and Wei, Yujie and Zhang, Yingya and Gao, Changxin and Sang, Nong },\n  booktitle={CVPR},\n  year={2024}\n}\n@article{wang2023videolcm,\n  title={VideoLCM: Video Latent Consistency Model},\n  author={Wang, Xiang and Zhang, Shiwei and Zhang, Han and Liu, Yu and Zhang, Yingya and Gao, Changxin and Sang, Nong },\n  journal={arXiv preprint arXiv:2312.09109},\n  year={2023}\n}\n@article{ma2023dreamtalk,\n  title={DreamTalk: When Expressive Talking Head Generation Meets Diffusion Probabilistic Models},\n  author={Ma, Yifeng and Zhang, Shiwei and Wang, Jiayu and Wang, Xiang and Zhang, Yingya and Deng Zhidong},\n  journal={arXiv preprint arXiv:2312.09767},\n  year={2023}\n}\n@inproceedings{InstructVideo,\n  title={InstructVideo: Instructing Video Diffusion Models with Human Feedback},\n  author={Yuan, Hangjie and Zhang, Shiwei and Wang, Xiang and Wei, Yujie and Feng, Tao and Pan, Yining and Zhang, Yingya and Liu, Ziwei and Albanie, Samuel and Ni, Dong},\n  booktitle={CVPR},\n  year={2024}\n}\n@inproceedings{TFT2V,\n  title={A Recipe for Scaling up Text-to-Video Generation with Text-free Videos},\n  author={Wang, Xiang and Zhang, Shiwei and Yuan, Hangjie and Qing, Zhiwu and Gong, Biao and Zhang, Yingya and Shen, Yujun and Gao, Changxin and Sang, Nong},\n  booktitle={CVPR},\n  year={2024}\n}\n```\n\n\n## Acknowledgement\n\nWe would like to express our gratitude for the contributions of several previous works to the development of VGen. This includes, but is not limited to [Composer](https://arxiv.org/abs/2302.09778), [ModelScopeT2V](https://modelscope.cn/models/damo/text-to-video-synthesis/summary), [Stable Diffusion](https://github.com/Stability-AI/stablediffusion), [OpenCLIP](https://github.com/mlfoundations/open_clip), [WebVid-10M](https://m-bain.github.io/webvid-dataset/), [LAION-400M](https://laion.ai/blog/laion-400-open-dataset/), [Pidinet](https://github.com/zhuoinoulu/pidinet) and [MiDaS](https://github.com/isl-org/MiDaS). We are committed to building upon these foundations in a way that respects their original contributions.\n\n\n\n## Disclaimer\n\nThis open-source model is trained with using [WebVid-10M](https://m-bain.github.io/webvid-dataset/) and [LAION-400M](https://laion.ai/blog/laion-400-open-dataset/) datasets and is intended for \u003cstrong\u003eRESEARCH/NON-COMMERCIAL USE ONLY\u003c/strong\u003e. \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fali-vilab%2Fvgen","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fali-vilab%2Fvgen","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fali-vilab%2Fvgen/lists"}