{"id":29009906,"url":"https://github.com/tencentarc/ditctrl","last_synced_at":"2025-06-25T15:33:41.823Z","repository":{"id":269574493,"uuid":"907796924","full_name":"TencentARC/DiTCtrl","owner":"TencentARC","description":"[CVPR 2025] Official code of \"DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation\"","archived":false,"fork":false,"pushed_at":"2025-03-30T07:22:03.000Z","size":44979,"stargazers_count":245,"open_issues_count":2,"forks_count":5,"subscribers_count":14,"default_branch":"main","last_synced_at":"2025-03-30T08:24:25.207Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/TencentARC.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-12-24T12:11:03.000Z","updated_at":"2025-03-30T07:22:07.000Z","dependencies_parsed_at":"2025-03-30T08:31:11.642Z","dependency_job_id":null,"html_url":"https://github.com/TencentARC/DiTCtrl","commit_stats":null,"previous_names":["tencentarc/ditctrl"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/TencentARC/DiTCtrl","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TencentARC%2FDiTCtrl","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TencentARC%2FDiTCtrl/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TencentARC%2FDiTCtrl/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TencentARC%2FDiTCtrl/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/TencentARC","download_url":"https://codeload.github.com/TencentARC/DiTCtrl/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TencentARC%2FDiTCtrl/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":261901407,"owners_count":23227593,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-06-25T15:33:40.534Z","updated_at":"2025-06-25T15:33:41.786Z","avatar_url":"https://github.com/TencentARC.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation\n\n\n\n**[Minghong Cai\u003csup\u003e1 \u0026dagger;\u003c/sup\u003e](https://onevfall.github.io/personal_page/), \n[Xiaodong Cun\u003csup\u003e2\u003c/sup\u003e](https://vinthony.github.io/academic/), \n[Xiaoyu Li\u003csup\u003e3 \u0026#9993;\u003c/sup\u003e](https://xiaoyu258.github.io/), \n[Wenze Liu\u003csup\u003e1\u003c/sup\u003e](https://openreview.net/profile?id=~Wenze_Liu1), \n[Zhaoyang Zhang\u003csup\u003e3\u003c/sup\u003e](https://zzyfd.github.io/#/), \n[Yong Zhang\u003csup\u003e4\u003c/sup\u003e](https://yzhang2016.github.io/), \n[Ying Shan\u003csup\u003e3\u003c/sup\u003e](https://www.linkedin.com/in/YingShanProfile/), \n[Xiangyu Yue\u003csup\u003e1 \u0026#9993;\u003c/sup\u003e](https://xyue.io/)**\n\u003cbr\u003e\n\u003csup\u003e1\u003c/sup\u003eMMLab, The Chinese University of Hong Kong\n\u003csup\u003e2\u003c/sup\u003eGVC Lab, Great Bay University\n\u003csup\u003e3\u003c/sup\u003eARC Lab, Tencent PCG\n\u003csup\u003e4\u003c/sup\u003eTencent AI Lab\n\u003cbr\u003e\n\u0026dagger;: Intern at ARC Lab, Tencent PCG, \u0026#9993;: Corresponding Authors\n\n\u003ca href='https://arxiv.org/abs/2412.18597'\u003e\u003cimg src='https://img.shields.io/badge/ArXiv-2412.18597-red'\u003e\u003c/a\u003e \n\u003ca href='https://onevfall.github.io/project_page/ditctrl/'\u003e\u003cimg src='https://img.shields.io/badge/Project-Page-Green'\u003e\u003c/a\u003e\n\n\n\u003cdiv \u003e\n    \u003cimg src=\"assets/teaser.gif\" \u003e\n\u003c/div\u003e\n\n## 📋 News\n- [2024.12.24] Release code and demo on CogVideoX-2B!\n- [2025.1.3] Release code of DiT attention map visualization.\n- [2025.2.27] This paper is accepted by CVPR 2025.\n- [2025.3.26] Arxiv-v2 is updated.\n\n## 🔆Demo\n\n### Longer Multi-Prompt Text-to-video Generation\n\n![mp_demo](assets/mp_oneline.gif)\n\n\u003cbr\u003e\n\n### Longer Single-Prompt Text-to-video Generation\nOur method can naturally work on single-prompt longer video generation by setting sequential multi-prompts as the same. This shows that our method can enhance the consistency of single prompt in long video generation.\n\n![sp_demo](assets/sp_oneline.gif)\n\n\u003cbr\u003e\n\n### Video Editing\nRemoving our latent blending strategy of our approach DiTCtrl,\nwe can achieve the video editing performance of **Word Swap** like [prompt-to-prompt](https://github.com/google/prompt-to-prompt).\nSpecifically, we just use KV-sharing strategy to share keys and values from source prompt P_source branch,\nso that we can synthesize a new video to preserve the original composition \nwhile also addressing the content of the new prompt P_target.\n\n![word_swap](assets/word_swap_2.gif)\n\nSimilar to [prompt-to-prompt](https://github.com/google/prompt-to-prompt), \nthrough reweighting the specific columns and rows corresponding to specified token (e.g. \"pink\") \nin the MM-DiT's Text-Video attention and Video-Text attention, \nwe can also achieve the video editing performance of **Reweight**.\n\n![video_reweight](assets/video_reweight_1.gif)\n\n\n## 🎏 Abstract\n\u003cb\u003eTL; DR: \u003cfont color=\"red\"\u003eDiTCtrl\u003c/font\u003e is the first tuning-free approach based on MM-DiT architecture for coherent multi-prompt video generation. Our key idea is to take the multi-prompt video generation task as temporal video editing with smooth transitions.\u003c/b\u003e\n\n\n\u003cdetails\u003e\u003csummary\u003eCLICK for the full abstract\u003c/summary\u003e\n\n\n\u003e Sora-like video generation models have achieved remarkable progress with a Multi-Modal Diffusion Transformer (MM-DiT) architecture. \nHowever, the current video generation models predominantly focus on single-prompt, struggling to generate coherent scenes with multiple sequential prompts that better reflect real-world dynamic scenarios. \nWhile some pioneering works have explored multi-prompt video generation, they face significant challenges including strict training data requirements, weak prompt following, and unnatural transitions. \nTo address these problems, we propose \u003cfont color=\"red\"\u003eDiTCtrl\u003c/font\u003e, a training-free multi-prompt video generation method under MM-DiT architectures for the first time. \nOur key idea is to take the multi-prompt video generation task as temporal video editing with smooth transitions. \nTo achieve this goal, we first analyze MM-DiT's attention mechanism, finding that the 3D full attention behaves similarly to that of the cross/self-attention blocks in the UNet-like diffusion models, enabling mask-guided precise semantic control across different prompts with attention sharing for multi-prompt video generation. \nBased on our careful design, the video generated by \u003cfont color=\"red\"\u003eDiTCtrl\u003c/font\u003e achieves smooth transitions and consistent object motion given multiple sequential prompts without additional training. \nBesides, we also present MPVBench, a new benchmark specially designed for multi-prompt video generation to evaluate the performance of multi-prompt generation. \nExtensive experiments demonstrate that our method achieves state-of-the-art performance without additional training.\n\u003c/details\u003e\n\n\n\n\n## 🛡 Setup Environment\nOur method is tested using CUDA12, on a single A100 or V100.\n\n```bash\ncd DiTCtrl\n\nconda create -n ditctrl python=3.10\nconda activate ditctrl\n\npip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu121\n\npip install -r requirements.txt\n\nconda install https://anaconda.org/xformers/xformers/0.0.28.post1/download/linux-64/xformers-0.0.28.post1-py310_cu12.1.0_pyt2.4.1.tar.bz2\n```\n\nOur environment is similar to [CogVideo](https://github.com/THUDM/CogVideo/blob/main/sat/README.md). You may check them for more details.\n\n\n## ⚙️ Download CogVideoX-2B Model Weights\n\nFirst, download CogVideoX-2B model weights, download as follows, which is copied from [CogVideoX](https://github.com/THUDM/CogVideo/blob/main/sat/README.md): \n\n```\ncd sat\nmkdir CogVideoX-2b-sat\ncd CogVideoX-2b-sat\nwget https://cloud.tsinghua.edu.cn/f/fdba7608a49c463ba754/?dl=1\nmv 'index.html?dl=1' vae.zip\nunzip vae.zip\nwget https://cloud.tsinghua.edu.cn/f/556a3e1329e74f1bac45/?dl=1\nmv 'index.html?dl=1' transformer.zip\nunzip transformer.zip\n```\n\nArrange the model files in the following structure:\n\n```\nCogVideoX-2b-sat/\n├── transformer\n│   ├── 1000 (or 1)\n│   │   └── mp_rank_00_model_states.pt\n│   └── latest\n└── vae\n    └── 3d-vae.pt\n```\n\nSince model weight files are large, it’s recommended to use `git lfs`.  \nSee [here](https://github.com/git-lfs/git-lfs?tab=readme-ov-file#installing) for `git lfs` installation.\n\n```\ngit lfs install\n```\n\nNext, clone the T5 model, which is used as an encoder and doesn’t require training or fine-tuning.\n\u003e You may also use the model file location on [Modelscope](https://modelscope.cn/models/ZhipuAI/CogVideoX-2b).\n\n```\ngit clone https://huggingface.co/THUDM/CogVideoX-2b.git # Download model from Huggingface\n# git clone https://www.modelscope.cn/ZhipuAI/CogVideoX-2b.git # Download from Modelscope\nmkdir t5-v1_1-xxl\nmv CogVideoX-2b/text_encoder/* CogVideoX-2b/tokenizer/* t5-v1_1-xxl\n```\n\nThis will yield a safetensor format T5 file that can be loaded without error during Deepspeed fine-tuning.\n\n\n```\n├── added_tokens.json\n├── config.json\n├── model-00001-of-00002.safetensors\n├── model-00002-of-00002.safetensors\n├── model.safetensors.index.json\n├── special_tokens_map.json\n├── spiece.model\n└── tokenizer_config.json\n\n0 directories, 8 files\n\n```\n\n### ❓ FAQ\n\n**Q: I'm getting a `safetensors rust.SafetensorError: Error while deserializing header: HeaderTooLarge` error. What should I do?**\n\n**A:** It's because the T5 model not downloaded correctly. Please check the filesize of the `t5-v1_1-xxl` folder, it should be around **8.9GB**. Otherwise, you may be influenced by huggingface network. You can go to [hf-mirror](https://hf-mirror.com/) by the following command:\n\n```\nexport HF_ENDPOINT=https://hf-mirror.com\nhuggingface-cli download THUDM/CogVideoX-2b --local-dir ./CogVideoX-2b\n```\n\n\nFinally, your file structure should be like this:\n\n```bash\nsat/\n├── CogVideoX-2b-sat/\n  ├── transformer\n  ├── CogVideoX-2b\n  ├── t5-v1_1-xxl\n  ├── vae\n├── configs/\n├── inference_case_configs/\n├── run_multi_prompt.sh\n├── run_single_prompt.sh\n├── run_edit_video.sh \n├── sample_video.py\n├── sample_video_edit.py\n├── README.md\n├── LICENSE\n├── ...\n```\n\n## 💫 Get Started\n\n\n### 1. Longer Multi-Prompt Text-to-Video\n\n```bash\n  cd sat\n  bash run_multi_prompt.sh\n```\n\n### 2. Longer Single-Prompt Text-to-Video\n\n```bash\n  cd sat\n  bash run_single_prompt.sh\n```\n\n### 3. Video Editing\n\n```bash\n  cd sat\n  bash run_edit_video.sh\n```\n\n### Custom config\n\nTake the `run_multi_prompt.sh` as an example:\n\n```bash\ninference_case_config=\"inference_case_configs/multi_prompts/rose.yaml\"\nrun_cmd=\"$environs python sample_video.py --base configs/cogvideox_2b.yaml configs/inference.yaml --custom-config $inference_case_config\"\necho ${run_cmd}\neval ${run_cmd}\n```\n\nThe custom config is the config file in the `inference_case_configs` folder. `inference_case_configs` is the folder where you put your custom config files, which can **overwrite** the default config in the `configs/inference.yaml` folder.\n\nTake the `rose.yaml` as an example:\n\n```yaml\nargs:\n  is_run_isolated: False  # If True, will generate the isolated videos not using our method\n  seed: 42\n  output_dir: outputs/multi_prompt_case/rose  # The output directory\n  prompts:     # Put your prompts here to generate multi-prompt long videos\n    - \"A gentle close shot of the same rose petal, where the camera gradually pulls back to reveal the entire unfurling bloom in its perfect symmetry.\"\n    - \"A steady medium shot of the rose, where the camera continues retreating to show the full stem with its leaves and neighboring buds.\"\n    - \"A smooth full shot of the rose bush, where the camera moves further back to encompass the entire garden bed and surrounding flowering plants.\"\n```\nMore details about the custom config, please refer to the `configs/inference.yaml` file.\nWhen you run the command, it will generate the video in the `outputs/multi_prompt_case/rose` folder.\n\n### How to create your own prompts by Large Language Model\n\n**Single-prompts**: Please refer to the [CogvideoX](https://github.com/THUDM/CogVideo/blob/main/inference/convert_demo.py) instruction.\n\n**Multi-prompts**: First, you can refer to our prompts case in the `inference_case_configs/multi_prompts` folder to get inspiration. Then, we provide two instruction files in the `prompts_gen_instruction` folder to generate your own multi-prompts. You can try both of them and chat with the LLM to get the best prompts.\n\n- [Presto](prompts_gen_instruction/presto.md): Modified from Presto's instruction, focusing on realistic cinematographic sequences with natural camera movements and temporal progression (ideal for documentary-style or realistic scenarios).\n- [DitCtrl](prompts_gen_instruction/ditctrl.md): Our custom instruction for DiTCtrl, emphasizing creative scene transitions and imaginative scenarios (perfect for artistic and fantasy-based video generation).\n\n### How to visualize the attention maps\n\nThe code is also provided, you can run this:\n\n```bash\n  cd sat\n  bash run_visualize.sh\n```\n\n## 🚧 Todo\n\n\n- [x] Release paper on arxiv\n- [x] Release Code based on \u003ca href='https://github.com/THUDM/CogVideo'\u003eCogVideoX-2B\u003c/a\u003e\n- [x] Visualization of attention maps\n- [x] metrics and tsne visualization\n- [ ] Diffuser version of DiTCtrl on CogVideoX-2B\n\n\n\n## 😉 Citation\n\n```bibtex\n@article{cai2024ditctrl,\n  title     = {DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation},\n  author    = {Cai, Minghong and Cun, Xiaodong and Li, Xiaoyu and Liu, Wenze and Zhang, Zhaoyang and Zhang, Yong and Shan, Ying and Yue, Xiangyu},\n  journal   = {arXiv:2412.18597},\n  year      = {2024},\n}\n```\n\n## 📚 Acknowledgements\nOur codebase builds on [CogVideoX](https://github.com/THUDM/CogVideo), [MasaCtrl](https://github.com/TencentARC/MasaCtrl), [MimicMotion](https://github.com/Tencent/MimicMotion), [FreeNoise](https://github.com/AILab-CVC/FreeNoise), and [prompt-to-prompt](https://github.com/google/prompt-to-prompt). \nThanks to the authors for sharing their awesome codebases! Thanks to concurrent training-based work [Presto](https://presto-video.github.io/#gallery) for providing the scene description instruction, and the first case is inspired by the scene description from [Presto](https://presto-video.github.io/#gallery). Thanks for the great work!\n\n## License\n\nThis project is released under [LICENSE](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftencentarc%2Fditctrl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftencentarc%2Fditctrl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftencentarc%2Fditctrl/lists"}