{"id":13861776,"url":"https://github.com/guyyariv/TempoTokens","last_synced_at":"2025-07-14T09:33:41.744Z","repository":{"id":196977814,"uuid":"697715713","full_name":"guyyariv/TempoTokens","owner":"guyyariv","description":"This repo contains the official PyTorch implementation of: Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation","archived":false,"fork":false,"pushed_at":"2023-10-26T18:26:39.000Z","size":11225,"stargazers_count":79,"open_issues_count":0,"forks_count":10,"subscribers_count":2,"default_branch":"master","last_synced_at":"2024-04-14T12:49:59.598Z","etag":null,"topics":["ai-art","audio-to-video","audio-visual","deep-learning","diffusion-models","generative-ai","modelscope","pytorch","video-synthesis"],"latest_commit_sha":null,"homepage":"https://pages.cs.huji.ac.il/adiyoss-lab/TempoTokens/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/guyyariv.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2023-09-28T10:25:03.000Z","updated_at":"2024-04-23T12:47:56.702Z","dependencies_parsed_at":null,"dependency_job_id":"bfd380db-34e5-440e-a4d2-41a0ad157452","html_url":"https://github.com/guyyariv/TempoTokens","commit_stats":null,"previous_names":["guyyariv/tempotokens"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/guyyariv%2FTempoTokens","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/guyyariv%2FTempoTokens/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/guyyariv%2FTempoTokens/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/guyyariv%2FTempoTokens/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/guyyariv","download_url":"https://codeload.github.com/guyyariv/TempoTokens/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":225968880,"owners_count":17553153,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-art","audio-to-video","audio-visual","deep-learning","diffusion-models","generative-ai","modelscope","pytorch","video-synthesis"],"created_at":"2024-08-05T06:01:30.055Z","updated_at":"2024-11-22T21:31:08.859Z","avatar_url":"https://github.com/guyyariv.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation\nThis repo contains the official PyTorch implementation of  [*Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation*](https://pages.cs.huji.ac.il/adiyoss-lab/TempoTokens/)\n\nhttps://github.com/guyyariv/TempoTokens/assets/89798559/753cc371-33a6-4574-b049-0f570f07a389\n\n\n# Abstract\nWe consider the task of generating diverse and realistic videos guided by natural audio samples from\na wide variety of semantic classes. For this task, the videos are required to be aligned both\nglobally and temporally with the input audio: globally, the input audio is semantically associated\nwith the entire output video, and temporally, each segment of the input audio is associated with a\ncorresponding segment of that video. We utilize an existing text-conditioned video generation model\nand a pre-trained audio encoder model. The proposed method is based on a lightweight adaptor network,\nwhich learns to map the audio-based representation to the input representation expected by the\ntext-to-video generation model. As such, it also enables video generation conditioned on text, audio,\nand, for the first time as far as we can ascertain, on both text and audio.\nWe validate our method extensively on three datasets demonstrating significant semantic diversity\nof audio-video samples and further propose a novel evaluation metric (AV-Align) to assess\nthe alignment of generated videos with input audio samples. AV-Align is based on the detection and\ncomparison of energy peaks in both modalities. In comparison to recent state-of-the-art approaches,\nour method generates videos that are better aligned with the input sound, both with respect to\ncontent and temporal axis. We also show that videos produced by our method present higher visual\nquality and are more diverse.\n\n\u003ca href=\"https://arxiv.org/abs/2309.16429\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2309.16429-b31b1b.svg\" height=22.5\u003e\u003c/a\u003e\n\u003ca href=\"https://pages.cs.huji.ac.il/adiyoss-lab/TempoTokens/\"\u003e\u003cimg src=\"https://img.shields.io/static/v1?label=Project\u0026message=Website\u0026color=red\" height=20.5\u003e\u003c/a\u003e \n\n[//]: # ([![Hugging Face Spaces]\u0026#40;https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue\u0026#41;]\u0026#40;https://huggingface.co/spaces/GuyYariv/AudioToken\u0026#41;)\n\n# Installation\n```\ngit clone git@github.com:guyyariv/TempoTokens.git\ncd TempoTokens\npip install -r requirements.txt\n```\nAnd initialize an Accelerate environment with:\n```angular2html\naccelerate config\n```\nDownload [BEATs](https://github.com/microsoft/unilm/blob/master/beats/BEATs.py) pre-trained model \n```\nmkdir -p models/BEATs/ \u0026\u0026 wget -P models/BEATs/ -O \"models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt\" \"https://valle.blob.core.windows.net/share/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt?sv=2020-08-04\u0026st=2023-03-01T07%3A51%3A05Z\u0026se=2033-03-02T07%3A51%3A00Z\u0026sr=c\u0026sp=rl\u0026sig=QJXmSJG9DbMKf48UDIU1MfzIro8HQOf3sqlNXiflY1I%3D\"\n```\n\n# Training\nExecute the relevant command for each dataset we have trained on, including [VGGSound](https://huggingface.co/datasets/Loie/VGGSound/tree/main), [Landscape](https://drive.google.com/drive/folders/14A1zaQI5EfShlv3QirgCGeNFzZBzQ3lq), and [AudioSet-Drum](https://www.dropbox.com/s/7ykgybrc8nb3lgf/AudioSet_Drums.zip?dl=0).\n```angular2html\naccelerate launch train.py --config configs/v2/vggsound.yaml\n```\n```angular2html\naccelerate launch train.py --config configs/v2/landscape.yaml\n```\n```angular2html\naccelerate launch train.py --config configs/v2/audioset_drum.yaml\n```\nWe strongly recommend reviewing the configuration files and customizing the parameters according to your preferences.\n\n# Pre-trained weights\nObtain the pre-trained weights for the three datasets we conducted training on by visiting the following link: https://drive.google.com/drive/folders/10pRWoq0m5torvMXILmIQd7j9fLPEeHtS\nWe advise you to save the folders in the directory named \"models/.\"\n\n# Inference\n\nThe ```inference.py``` script serves the purpose of generating videos using trained checkpoints.\nOnce you've completed the model training using the provided command (or opted for our pre-trained models)\n, you can effortlessly create videos from the datasets we've utilized for training, such as\n[VGGSound](https://huggingface.co/datasets/Loie/VGGSound/tree/main), \n[Landscape](https://drive.google.com/drive/folders/14A1zaQI5EfShlv3QirgCGeNFzZBzQ3lq), \nand [AudioSet-Drum](https://www.dropbox.com/s/7ykgybrc8nb3lgf/AudioSet_Drums.zip?dl=0).\n```angular2html\naccelerate launch inference.py --mapper_weights models/vggsound/learned_embeds.pth --testset vggsound\n```\n```angular2html\naccelerate launch inference.py --mapper_weights models/landscape/learned_embeds.pth --testset landscape\n```\n```angular2html\naccelerate launch inference.py --mapper_weights models/audioset_drum/learned_embeds.pth --testset audioset_drum\n```\nMoreover, you have the capability to generate a video from your own audio, as demonstrated below:\n```angular2html\naccelerate launch inference.py --mapper_weights models/vggsound/learned_embeds.pth --audio_path /audio/path\n```\n\n```\n\u003e python inference.py --help\n\nusage: inference.py [-h] -m MODEL -p PROMPT [-n NEGATIVE_PROMPT] [-o OUTPUT_DIR]\n                    [-B BATCH_SIZE] [-W WIDTH] [-H HEIGHT] [-T NUM_FRAMES]\n                    [-WS WINDOW_SIZE] [-VB VAE_BATCH_SIZE] [-s NUM_STEPS]\n                    [-g GUIDANCE_SCALE] [-i INIT_VIDEO] [-iw INIT_WEIGHT] [-f FPS]\n                    [-d DEVICE] [-x] [-S] [-lP LORA_PATH] [-lR LORA_RANK] [-rw]\n\noptions:\n  -h, --help            show this help message and exit\n  -m MODEL, --model MODEL\n                        HuggingFace repository or path to model checkpoint directory\n  -p PROMPT, --prompt PROMPT\n                        Text prompt to condition on\n  -n NEGATIVE_PROMPT, --negative-prompt NEGATIVE_PROMPT\n                        Text prompt to condition against\n  -o OUTPUT_DIR, --output-dir OUTPUT_DIR\n                        Directory to save output video to\n  -B BATCH_SIZE, --batch-size BATCH_SIZE\n                        Batch size for inference\n  -W WIDTH, --width WIDTH\n                        Width of output video\n  -H HEIGHT, --height HEIGHT\n                        Height of output video\n  -T NUM_FRAMES, --num-frames NUM_FRAMES\n                        Total number of frames to generate\n  -WS WINDOW_SIZE, --window-size WINDOW_SIZE\n                        Number of frames to process at once (defaults to full\n                        sequence). When less than num_frames, a round robin diffusion\n                        process is used to denoise the full sequence iteratively one\n                        window at a time. Must be divide num_frames exactly!\n  -VB VAE_BATCH_SIZE, --vae-batch-size VAE_BATCH_SIZE\n                        Batch size for VAE encoding/decoding to/from latents (higher\n                        values = faster inference, but more memory usage).\n  -s NUM_STEPS, --num-steps NUM_STEPS\n                        Number of diffusion steps to run per frame.\n  -g GUIDANCE_SCALE, --guidance-scale GUIDANCE_SCALE\n                        Scale for guidance loss (higher values = more guidance, but\n                        possibly more artifacts).\n  -i INIT_VIDEO, --init-video INIT_VIDEO\n                        Path to video to initialize diffusion from (will be resized to\n                        the specified num_frames, height, and width).\n  -iw INIT_WEIGHT, --init-weight INIT_WEIGHT\n                        Strength of visual effect of init_video on the output (lower\n                        values adhere more closely to the text prompt, but have a less\n                        recognizable init_video).\n  -f FPS, --fps FPS     FPS of output video\n  -d DEVICE, --device DEVICE\n                        Device to run inference on (defaults to cuda).\n  -x, --xformers        Use XFormers attnetion, a memory-efficient attention\n                        implementation (requires `pip install xformers`).\n  -S, --sdp             Use SDP attention, PyTorch's built-in memory-efficient\n                        attention implementation.\n  -lP LORA_PATH, --lora_path LORA_PATH\n                        Path to Low Rank Adaptation checkpoint file (defaults to empty\n                        string, which uses no LoRA).\n  -lR LORA_RANK, --lora_rank LORA_RANK\n                        Size of the LoRA checkpoint's projection matrix (defaults to\n                        64).\n  -rw, --remove-watermark\n                        Post-process the videos with LAMA to inpaint ModelScope's\n                        common watermarks.\n```\n\n# Acknowledgments\nOur code is partially built upon [Text-To-Video-Finetuning](https://github.com/ExponentialML/Text-To-Video-Finetuning)\n\n# Cite\nIf you use our work in your research, please cite the following paper:\n```\n@misc{yariv2023diverse,\n      title={Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation}, \n      author={Guy Yariv and Itai Gat and Sagie Benaim and Lior Wolf and Idan Schwartz and Yossi Adi},\n      year={2023},\n      eprint={2309.16429},\n      archivePrefix={arXiv},\n      primaryClass={cs.LG}\n}\n```\n\n# License\nThis repository is released under the MIT license as found in the [LICENSE](LICENSE) file. \n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fguyyariv%2FTempoTokens","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fguyyariv%2FTempoTokens","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fguyyariv%2FTempoTokens/lists"}