{"id":13651183,"url":"https://github.com/ExponentialML/Text-To-Video-Finetuning","last_synced_at":"2025-04-22T22:30:35.801Z","repository":{"id":151546643,"uuid":"617790697","full_name":"ExponentialML/Text-To-Video-Finetuning","owner":"ExponentialML","description":"Finetune ModelScope's Text To Video model using Diffusers 🧨","archived":true,"fork":false,"pushed_at":"2023-12-14T21:59:06.000Z","size":1904,"stargazers_count":664,"open_issues_count":28,"forks_count":107,"subscribers_count":18,"default_branch":"main","last_synced_at":"2024-11-10T02:34:04.263Z","etag":null,"topics":["deep-learning","diffusers","diffusion-models","modelscope","pytorch","stable-diffusion","text-to-video","text2video"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ExponentialML.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2023-03-23T05:39:12.000Z","updated_at":"2024-11-02T11:14:00.000Z","dependencies_parsed_at":"2023-10-15T00:32:23.435Z","dependency_job_id":"43370a30-6063-46fa-8abf-6334bf25048f","html_url":"https://github.com/ExponentialML/Text-To-Video-Finetuning","commit_stats":null,"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ExponentialML%2FText-To-Video-Finetuning","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ExponentialML%2FText-To-Video-Finetuning/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ExponentialML%2FText-To-Video-Finetuning/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ExponentialML%2FText-To-Video-Finetuning/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ExponentialML","download_url":"https://codeload.github.com/ExponentialML/Text-To-Video-Finetuning/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250333863,"owners_count":21413471,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","diffusers","diffusion-models","modelscope","pytorch","stable-diffusion","text-to-video","text2video"],"created_at":"2024-08-02T02:00:46.182Z","updated_at":"2025-04-22T22:30:35.497Z","avatar_url":"https://github.com/ExponentialML.png","language":"Python","readme":"\u003cdiv align=\"center\" width=\"100\" height=\"100\" \u003e\n  \u003cimg src=\"https://github.com/ExponentialML/Text-To-Video-Finetuning/assets/59846140/184f0dce-b77a-45d7-b24d-1814e5b9c314\" /\u003e\n  \u003cdiv align=\"center\" style=\"font-style: italic;\" \u003e\n    \u003ci\u003eVideo Credit: dotsimulate\u003c/i\u003e\n  \u003c/div\u003e\n  \u003cdiv align=\"center\" style=\"font-style: italic;\" \u003e\n    \u003ci\u003eModel: Zeroscope XL\u003c/i\u003e\n  \u003c/div\u003e\n\u003c/div\u003e\n\n\n\n# Text-To-Video-Finetuning\n## Finetune ModelScope's Text To Video model using Diffusers 🧨 \n\n## Important Update **2023-12-14**\nFirst of all a note from me. Thank you guys for your support, feedback, and journey through discovering the nascent, innate potential of video Diffusion Models.\n\n@damo-vilab Has released a repository for finetuning all things Video Diffusion Models, and I recommend their implementation over this repository.\nhttps://github.com/damo-vilab/i2vgen-xl\n\nhttps://github.com/ExponentialML/Text-To-Video-Finetuning/assets/59846140/55608f6a-333a-458f-b7d5-94461c5da8bb\n\nThis repository will no longer be updated, but will instead be archived for researchers \u0026 builders that wish to bootstrap their projects. \nI will be leaving the issues, pull requests, and all related things for posterity purposes.\n\nThanks again!\n\n### Updates\n- **2023-7-12**: You can now train a LoRA that is compatibile with the [webui extension](https://github.com/kabachuha/sd-webui-text2video)! See instructions [here.](https://github.com/ExponentialML/Text-To-Video-Finetuning#training-a-lora)\n- **2023-4-17**: You can now convert your trained models from diffusers to `.ckpt` format for A111 webui. Thanks @kabachuha!  \n- **2023-4-8**: LoRA Training released! Checkout `configs/v2/lora_training_config.yaml` for instructions. \n- **2023-4-8**: Version 2 is released! \n- **2023-3-29**: Added gradient checkpointing support. \n- **2023-3-27**: Support for using Scaled Dot Product Attention for Torch 2.0 users. \n\n## Getting Started\n\n### Requirements \u0026 Installation\n\n```bash\ngit clone https://github.com/ExponentialML/Text-To-Video-Finetuning.git\ncd Text-To-Video-Finetuning\ngit lfs install\ngit clone https://huggingface.co/damo-vilab/text-to-video-ms-1.7b ./models/model_scope_diffusers/\n```\n\n## Other Models\nAlternatively, you can train starting from other models made by the community.\n\n| Contributer    |Model Name    | Link                                                |\n| -------------- | ------------ | --------------------------------------------------- | \n| cerspense      | ZeroScope    | https://huggingface.co/cerspense/zeroscope_v2_576w  |\n| cameduru       | Potat1       | https://huggingface.co/camenduru/potat1             |\n| strangeman3107 | animov-512x  | https://huggingface.co/strangeman3107/animov-512x   |\n\n### Create Conda Environment (Optional)\nIt is recommended to install Anaconda.\n\n**Windows Installation:** https://docs.anaconda.com/anaconda/install/windows/\n\n**Linux Installation:** https://docs.anaconda.com/anaconda/install/linux/\n\n```bash\nconda create -n text2video-finetune python=3.10\nconda activate text2video-finetune\n```\n\n### Python Requirements\n```bash\npip install -r requirements.txt\n```\n\n## Hardware\n\nAll code was tested on Python 3.10.9 \u0026 Torch version 1.13.1 \u0026 2.0.\n\nIt is **highly recommended** to install \u003e= Torch 2.0. This way, you don't have to install Xformers *or* worry about memory performance. \n\nIf you don't have Xformers enabled, you can follow the instructions here: https://github.com/facebookresearch/xformers\n\nRecommended to use a RTX 3090, but you should be able to train on GPUs with \u003c= 16GB ram with:\n- Validation turned off.\n- Xformers or Torch 2.0 Scaled Dot-Product Attention \n- Gradient checkpointing enabled. \n- Resolution of 256.\n- Hybrid LoRA training.\n- Training only using LoRA with ranks between 4-16.\n\n## Preprocessing your data\n\n### Using Captions\n\nYou can use caption files when training on images or video. Simply place them into a folder like so:\n\n**Images**: `/images/img.png /images/img.txt`\n**Videos**: `/videos/vid.mp4 | /videos/vid.txt`\n\nThen in your config, make sure to have `-folder` enabled, along with the root directory containing the files.\n\n### Process Automatically\n\nYou can automatically caption the videos using the [Video-BLIP2-Preprocessor Script](https://github.com/ExponentialML/Video-BLIP2-Preprocessor)\n\n## Configuration\n\nThe configuration uses a YAML config borrowed from [Tune-A-Video](https://github.com/showlab/Tune-A-Video) reposotories. \n\nAll configuration details are placed in `configs/v2/train_config.yaml`. Each parameter has a definition for what it does.\n\n### How would you recommend I proceed with making a config with my data?\n\nI highly recommend (I did this myself) going to `configs/v2/train_config.yaml`. Then make a copy of it and name it whatever you wish `my_train.yaml`.\n\nThen, follow each line and configure it for your specific use case. \n\nThe instructions should be clear enough to get you up and running with your dataset, but feel free to ask any questions in the discussion board.\n\n## Training a LoRA\n\n***Please read this section carefully if you are training a LoRA model***\n\nYou can also train a LoRA that is compatible with the webui extension. \nBy default it's set to `'cloneofsimo'`, which was the first LoRA implementation for Stable Diffusion.\n\nThis ('cloneofsimo') version you can use in the `inference.py` file in this repository. It is **not** compatible with the webui.\n\nTo train and ***use*** a LoRA with the webui, change the `lora_version` to **\"stable_lora\"** in your config if you already have one made.\n\nThis will train an [A1111 webui extension](https://github.com/kabachuha/sd-webui-text2video) compatibile LoRA.\nYou can get started at `configs/v2/stable_lora_config.yaml` and everything is set by default in there. During and after training, LoRAs will be saved in your outputs directory with the prefix `_webui`.\n\nIf you do not choose this setting, you *will not* currently be able to use these in the webui. If you train a Stable LoRA file, you cannot *currently* use them in `inference.py`.\n\n### Continue training a LoRA\nTo continue training a LoRA, simply set your `lora_path` in your config to the **directory** that contains your LoRA file(s), not an individual file. \nEach specific LoRA should have `_unet` or `_text_encoder` in the file name respectively, or else it will not work.\n\nYou should then be able to resume training from a LoRA model, regardless of which method you use (as long as the trained LoRA matches the version in the config).\n\n### What you cannot do:\n- Use LoRA files that were made for SD image models in other trainers.\n- Use 'cloneofsimo' LoRAs in another project (unless you build it or create a PR)\n- Merge LoRA weights together (yet).\n\n## Finetune.\n```python\npython train.py --config ./configs/v2/train_config.yaml\n```\n---\n\n## Training Results\n\nWith a lot of data, you can expect training results to show at roughly 2500 steps at a constant learning rate of 5e-6. \n\nWhen finetuning on a single video, you should see results in half as many steps.\n\nAfter training, you should see your results in your output directory. \n\nBy default, it should be placed at the script root under `./outputs/train_\u003cdate\u003e`\n\nFrom my testing, I recommend:\n\n- Keep the number of sample frames between 4-16. Use long frame generation for inference, *not* training.\n- If you have a low VRAM system, you can try single frame training or just use `n_sample_frames: 2`.\n- Using a learning rate of about `5e-6` seems to work well in all cases.\n- The best quality will always come from training the text encoder. If you're limited on VRAM, disabling it can help.\n- Leave some memory to avoid OOM when saving models during training.\n\n## Running inference\nThe `inference.py` script can be used to render videos with trained checkpoints.\n\nExample usage: \n```\npython inference.py \\\n  --model camenduru/potat1 \\\n  --prompt \"a fast moving fancy sports car\" \\\n  --num-frames 60 \\\n  --window-size 12 \\\n  --width 1024 \\\n  --height 576 \\\n  --sdp\n```\n\n```\n\u003e python inference.py --help\n\nusage: inference.py [-h] -m MODEL -p PROMPT [-n NEGATIVE_PROMPT] [-o OUTPUT_DIR]\n                    [-B BATCH_SIZE] [-W WIDTH] [-H HEIGHT] [-T NUM_FRAMES]\n                    [-WS WINDOW_SIZE] [-VB VAE_BATCH_SIZE] [-s NUM_STEPS]\n                    [-g GUIDANCE_SCALE] [-i INIT_VIDEO] [-iw INIT_WEIGHT] [-f FPS]\n                    [-d DEVICE] [-x] [-S] [-lP LORA_PATH] [-lR LORA_RANK] [-rw]\n\noptions:\n  -h, --help            show this help message and exit\n  -m MODEL, --model MODEL\n                        HuggingFace repository or path to model checkpoint directory\n  -p PROMPT, --prompt PROMPT\n                        Text prompt to condition on\n  -n NEGATIVE_PROMPT, --negative-prompt NEGATIVE_PROMPT\n                        Text prompt to condition against\n  -o OUTPUT_DIR, --output-dir OUTPUT_DIR\n                        Directory to save output video to\n  -B BATCH_SIZE, --batch-size BATCH_SIZE\n                        Batch size for inference\n  -W WIDTH, --width WIDTH\n                        Width of output video\n  -H HEIGHT, --height HEIGHT\n                        Height of output video\n  -T NUM_FRAMES, --num-frames NUM_FRAMES\n                        Total number of frames to generate\n  -WS WINDOW_SIZE, --window-size WINDOW_SIZE\n                        Number of frames to process at once (defaults to full\n                        sequence). When less than num_frames, a round robin diffusion\n                        process is used to denoise the full sequence iteratively one\n                        window at a time. Must be divide num_frames exactly!\n  -VB VAE_BATCH_SIZE, --vae-batch-size VAE_BATCH_SIZE\n                        Batch size for VAE encoding/decoding to/from latents (higher\n                        values = faster inference, but more memory usage).\n  -s NUM_STEPS, --num-steps NUM_STEPS\n                        Number of diffusion steps to run per frame.\n  -g GUIDANCE_SCALE, --guidance-scale GUIDANCE_SCALE\n                        Scale for guidance loss (higher values = more guidance, but\n                        possibly more artifacts).\n  -i INIT_VIDEO, --init-video INIT_VIDEO\n                        Path to video to initialize diffusion from (will be resized to\n                        the specified num_frames, height, and width).\n  -iw INIT_WEIGHT, --init-weight INIT_WEIGHT\n                        Strength of visual effect of init_video on the output (lower\n                        values adhere more closely to the text prompt, but have a less\n                        recognizable init_video).\n  -f FPS, --fps FPS     FPS of output video\n  -d DEVICE, --device DEVICE\n                        Device to run inference on (defaults to cuda).\n  -x, --xformers        Use XFormers attnetion, a memory-efficient attention\n                        implementation (requires `pip install xformers`).\n  -S, --sdp             Use SDP attention, PyTorch's built-in memory-efficient\n                        attention implementation.\n  -lP LORA_PATH, --lora_path LORA_PATH\n                        Path to Low Rank Adaptation checkpoint file (defaults to empty\n                        string, which uses no LoRA).\n  -lR LORA_RANK, --lora_rank LORA_RANK\n                        Size of the LoRA checkpoint's projection matrix (defaults to\n                        64).\n  -rw, --remove-watermark\n                        Post-process the videos with LAMA to inpaint ModelScope's\n                        common watermarks.\n```\n\n## Developing\n\nPlease feel free to open a pull request if you have a feature implementation or suggesstion! I welcome all contributions.\n\nI've tried to make the code fairly modular so you can hack away, see how the code works, and what the implementations do.\n\n## Deprecation\nIf you want to use the V1 repository, you can use the branch [here](https://github.com/ExponentialML/Text-To-Video-Finetuning/tree/version/first-release).\n\n## Shoutouts\n\n- [Showlab](https://github.com/showlab/Tune-A-Video) and bryandlee[https://github.com/bryandlee/Tune-A-Video] for their Tune-A-Video contribution that made this much easier.\n- [lucidrains](https://github.com/lucidrains) for their implementations around video diffusion.\n- [cloneofsimo](https://github.com/cloneofsimo) for their diffusers implementation of LoRA.\n- [kabachuha](https://github.com/kabachuha) for their conversion scripts, training ideas, and webui works.\n- [JCBrouwer](https://github.com/JCBrouwer) Inference implementations.\n- [sergiobr](https://github.com/sergiobr) Helpful ideas and bug fixes.\n\n## Citation\nIf you find this work interesting, consider citing the original [ModelScope Text-to-Video Technical Report](https://arxiv.org/abs/2308.06571):\n```bibtex\n@article{ModelScopeT2V,\n  title={ModelScope Text-to-Video Technical Report},\n  author={Wang, Jiuniu and Yuan, Hangjie and Chen, Dayou and Zhang, Yingya and Wang, Xiang and Zhang, Shiwei},\n  journal={arXiv preprint arXiv:2308.06571},\n  year={2023}\n}\n```\n","funding_links":[],"categories":["👑Stable Diffusion","Python"],"sub_categories":["Python"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FExponentialML%2FText-To-Video-Finetuning","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FExponentialML%2FText-To-Video-Finetuning","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FExponentialML%2FText-To-Video-Finetuning/lists"}