{"id":13488920,"url":"https://github.com/YangLing0818/VideoTetris","last_synced_at":"2025-03-28T02:31:26.491Z","repository":{"id":243158775,"uuid":"811334259","full_name":"YangLing0818/VideoTetris","owner":"YangLing0818","description":"[NeurIPS 2024] VideoTetris: Towards Compositional Text-To-Video Generation","archived":false,"fork":false,"pushed_at":"2024-09-27T02:35:15.000Z","size":29345,"stargazers_count":202,"open_issues_count":5,"forks_count":6,"subscribers_count":19,"default_branch":"main","last_synced_at":"2024-10-31T01:35:04.959Z","etag":null,"topics":["diffusion-models","large-language-models","text-to-video-generation"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2406.04277","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/YangLing0818.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-06-06T11:54:20.000Z","updated_at":"2024-10-21T19:24:26.000Z","dependencies_parsed_at":"2024-10-31T01:31:11.269Z","dependency_job_id":"2c1d36ba-249d-4683-8e7e-301df41b92dd","html_url":"https://github.com/YangLing0818/VideoTetris","commit_stats":null,"previous_names":["yangling0818/videotetris"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/YangLing0818%2FVideoTetris","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/YangLing0818%2FVideoTetris/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/YangLing0818%2FVideoTetris/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/YangLing0818%2FVideoTetris/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/YangLing0818","download_url":"https://codeload.github.com/YangLing0818/VideoTetris/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245957684,"owners_count":20700317,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["diffusion-models","large-language-models","text-to-video-generation"],"created_at":"2024-07-31T18:01:24.207Z","updated_at":"2025-03-28T02:31:21.478Z","avatar_url":"https://github.com/YangLing0818.png","language":"Python","funding_links":[],"categories":["Video Generation"],"sub_categories":[],"readme":"\n## ___***VideoTetris: Towards Compositional Text-To-Video Generation***___\n\u003cdiv align=\"left\"\u003e\n \u003ca href='https://arxiv.org/abs/2406.04277'\u003e\u003cimg src='https://img.shields.io/badge/arXiv-2406.04277-b31b1b.svg'\u003e\u003c/a\u003e \u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\n \u003ca href='https://videotetris.github.io'\u003e\u003cimg src='https://img.shields.io/badge/Project-Page-Green'\u003e\u003c/a\u003e \u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\n\nThis repo contains the official implementation of our [VideoTetris](https://arxiv.org/abs/2406.04277) (**NeurIPS 2024**).\n\n\u003e [**VideoTetris: Towards Compositional Text-To-Video Generation**](https://arxiv.org/abs/2406.04277)   \n\u003e [Ye Tian](https://tyfeld.github.io/),\n\u003e [Ling Yang*](https://yangling0818.github.io), \n\u003e [Haotian Yang](https://scholar.google.com/citations?user=LH71RGkAAAAJ\u0026hl=en),\n\u003e [Yuan Gao](https://videotetris.github.io/),\n\u003e [Yufan Deng](https://videotetris.github.io/),\n\u003e [Jingmin Chen](https://videotetris.github.io/),\n\u003e [Xintao Wang](https://xinntao.github.io),\n\u003e [Zhaochen Yu](https://videotetris.github.io/),\n\u003e [Pengfei Wan](https://scholar.google.com/citations?user=P6MraaYAAAAJ\u0026hl=en),\n\u003e [Di Zhang](https://openreview.net/profile?id=~Di_ZHANG3),\n\u003e [Bin Cui](https://cuibinpku.github.io/cuibin_cn.html)   \n\u003e (* Equal Contribution and Corresponding Author)\n\u003e \u003cbr\u003ePeking University, Kuaishou Technology\u003cbr\u003e\n\n\n## Introduction\nVideoTetris is a novel framework that enables **compositional T2V generation**. Specifically, we propose **spatio-temporal compositional diffusion** to precisely follow complex textual semantics by manipulating and composing the attention maps of denoising networks spatially and temporally. Moreover, we propose an enhanced video data preprocessing to enhance the training data regarding motion dynamics and prompt understanding, equipped with a new reference frame attention mechanism to improve the consistency of auto-regressive video generation.  Our demonstrations include successful examples of **videos spanning from 10s, 30s to 2 minutes**, and can be extended for even longer durations.\n\u003ctable class=\"center\"\u003e\n    \u003ctr\u003e\n    \u003ctd width=100% style=\"border: none\"\u003e\u003cimg src=\"assets/first.png\" style=\"width:100%\"\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n    \u003ctd width=\"100%\" style=\"border: none; text-align: center; word-wrap: break-word\"\u003e\n\u003c/td\u003e\n  \u003c/tr\u003e\n    \u003ctr\u003e\n    \u003ctd width=100% style=\"border: none\"\u003e\u003cimg src=\"assets/secondd.png\" style=\"width:100%\"\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n    \u003ctd width=\"100%\" style=\"border: none; text-align: center; word-wrap: break-word\"\u003e\n\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\n\n\n\n## Training and Inference\n\n### Composition Text-to-Video Generation\nWe provide the inference code of our VideoTetris for compositional video generation based on VideoCrafter2. You can download the pretrained model from [Hugging Face](https://huggingface.co/VideoCrafter/VideoCrafter2/blob/main/model.ckpt) and put it in `checkpoints/base_512_v2/model.ckpt`. Then run the following command:\n#### 1. Install Environment via Anaconda (Recommended)\n```bash\ncd short\nconda create -n videocrafter python=3.8.5\nconda activate videocrafter\npip install -r requirements.txt\n```\n\n#### 2. Region Planning\nYou can then plan the regions for different sub-objects in a json file like `prompts/demo_videotetris.json`. The regions are defined by the top-left and bottom-right coordinates of the bounding box. You can refer to the `prompts/demo_videotetris.json` for an example. And the final planning json should be like:\n```json\n{\n  {\n    \"basic_prompt\": \"A cat on the left and a dog on the right are napping in the sun.\",\n    \"sub_objects\":[\n        \"A cute orange cat.\",\n        \"A cute dog.\"\n    ],\n    \"layout_boxes\":[\n        [0, 0, 0.5, 1],\n        [0.5, 0, 1, 1]\n    ]\n  },\n}\n```\nIn this case, we first define the basic prompt, and then specify the sub-objects and their corresponding regions, resulting in a video with a left cat and a right dog.\n\n#### 3. Inference of VideoTetris\n```bash\nsh scripts/run_text2video_from_layout.sh\n```\nYou can specify the input json file `run_text2video_from_layout.sh` script.\n\n\n### Long Video Generation with Progressive Compositional Prompts\n\n#### 1. Install Environment via Anaconda (Recommended)\n```bash\ncd long\nconda create -n st2v python=3.10\nconda activate st2v\npip install -r requirements.txt\n```\n#### 2. Download the Checkpoint\n\nWe put our VideoTetris-long model finetuned on our filtered dataset on [Hugging Face](https://huggingface.co/tyfeld/VideoTetris-long). You can download the weights and put it in the directory through:\n```bash\nwget https://huggingface.co/tyfeld/VideoTetris-long/resolve/main/model-step=6000-v1.ckpt\n```\n\n#### 3. Region Planning\n\nYou can then plan the regions for different sub-objects in a json file like prompts/prompt.json. You should specify the video chunk index, prompt, sub-objects and layout boxes for each video chunk. \n\n\u003e Video Chunk Meaning: As the long video is autoregressively generated by 8 frames for each chunk, a video with 80 frames will be autoregressively generated with (80-8)/8 = 9 rounds. And every chunk means the expanding 8 frames generated in one round.\n\nThe regions are defined by the top-left and bottom-right coordinates of the bounding box. You can refer to the prompts/prompt.json for an example. And the final planning json should be like:\n```json\n[\n    {\n        \"video_chunk_index\": 0, \n        \"prompt\": \"A cute brown squirrel in Antarctica, on a pile of hazelnuts cinematic.\",\n        \"sub_objects\": [\n            \"A cute brown squirrel in Antarctica, on a pile of hazelnuts cinematic.\"\n        ],\n        \"layout_boxes\":[\n            [0, 0, 1, 1]\n        ]\n    },\n    {\n        \"video_chunk_index\": 4,\n        \"prompt\": \"A cute brown squirrel and a cute white squirrel in Antarctica, on a pile of hazelnuts cinematic\",\n        \"sub_objects\": [\n            \"A cute brown squirrel in Antarctica, on a pile of hazelnuts cinematic.\",\n            \"A cute white squirrel in Antarctica, on a pile of hazelnuts cinematic.\"\n        ],\n        \"layout_boxes\":[\n            [0.5, 0, 1, 1],\n            [0, 0, 0.5, 1]\n        ]\n    }\n]\n```\n#### 4. Inference of VideoTetris-long\n```bash\ncd t2v_enhanced\npython inference_videotetris.py --num_frames 80\n```\n\n## Example Results\nWe only provide some example results here, more detailed results can be found in the [project page](https://videotetris.github.io/).\n\u003ctable class=\"center\"\u003e\n    \u003ctr\u003e\n    \u003ctd width=25% style=\"border: none\"\u003e\u003cimg src=\"assets/cat_and_dog.gif\" style=\"width:100%\"\u003e\u003c/td\u003e\n    \u003ctd width=25% style=\"border: none\"\u003e\u003cimg src=\"assets/farmer_and_blacksmith.gif\" style=\"width:100%\"\u003e\u003c/td\u003e\n  \u003ctr\u003e\n    \u003ctd width=\"25%\" style=\"border: none; text-align: center; word-wrap: break-word\"\u003eA cute brown dog on the left and a sleepy cat on the right are napping in the sun. \u003cbr\u003e @16 Frames\u003c/td\u003e\n    \u003ctd width=\"25%\" style=\"border: none; text-align: center; word-wrap: break-word\"\u003eA cheerful farmer and a hardworking blacksmith are building a barn. \u003cbr\u003e @16 Frames\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\n\u003ctable class=\"center\"\u003e\n    \u003ctr\u003e\n    \u003ctd width=35% style=\"border: none\"\u003e\u003cimg src=\"assets/1234.gif\" style=\"width:130%\"\u003e\u003c/td\u003e\n    \u003ctd width=35% style=\"border: none\"\u003e\u003cimg src=\"assets/brown2white.gif\" style=\"width:130%\"\u003e\u003c/td\u003e\n  \u003ctr\u003e\n    \u003ctd width=\"35%\" style=\"border: none; text-align: center; word-wrap: break-word\"\u003eOne cute brown squirrel, on a pile of hazelnuts, cinematic. \u003cbr\u003e ------\u003e  transitions to \u003cbr\u003e\nTwo cute brown squirrels, on a pile of hazelnuts, cinematic. \u003cbr\u003e ------\u003e  transitions to \u003cbr\u003e\nThree cute brown squirrels, on a pile of hazelnuts, cinematic. \u003cbr\u003e ------\u003e  transitions to \u003cbr\u003e\nFour cute brown squirrels, on a pile of hazelnuts, cinematic. \u003cbr\u003e \n @80 Frames\u003c/td\u003e\n    \u003ctd width=\"35%\" style=\"border: none; text-align: center; word-wrap: break-word\"\u003eA cute brown squirrel, on a pile of hazelnuts, cinematic. \u003cbr\u003e ------\u003e  transitions to \u003cbr\u003e\nA cute brown squirrel and a cute white squirrel, on a pile of hazelnuts, cinematic.  \u003cbr\u003e\n @240 Frames\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\n\n\n\n## Citation\n```\n@article{tian2024videotetris,\n  title={VideoTetris: Towards Compositional Text-to-Video Generation},\n  author={Tian, Ye and Yang, Ling and Yang, Haotian and Gao, Yuan and Deng, Yufan and Chen, Jingmin and Wang, Xintao and Yu, Zhaochen and Tao, Xin and Wan, Pengfei and Zhang, Di and Cui, Bin},\n  journal={arXiv preprint arXiv:2406.04277},\n  year={2024}\n}\n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FYangLing0818%2FVideoTetris","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FYangLing0818%2FVideoTetris","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FYangLing0818%2FVideoTetris/lists"}