{"id":14476021,"url":"https://github.com/Bai-YT/ConsistencyTTA","last_synced_at":"2025-08-29T15:32:48.086Z","repository":{"id":220587619,"uuid":"752039360","full_name":"Bai-YT/ConsistencyTTA","owner":"Bai-YT","description":"ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation","archived":false,"fork":false,"pushed_at":"2024-11-20T05:23:01.000Z","size":5711,"stargazers_count":32,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-12-26T18:32:06.812Z","etag":null,"topics":["audio-generation","audio-processing","consistency-models","diffusion-models","ldm"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Bai-YT.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.txt","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":"SUPPORT.md","governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-02-02T21:56:20.000Z","updated_at":"2024-11-20T05:23:06.000Z","dependencies_parsed_at":"2024-06-15T01:28:01.752Z","dependency_job_id":"93dd7c8f-0099-46cd-bc4e-a53b61df95e4","html_url":"https://github.com/Bai-YT/ConsistencyTTA","commit_stats":null,"previous_names":["bai-yt/consistencytta"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Bai-YT/ConsistencyTTA","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Bai-YT%2FConsistencyTTA","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Bai-YT%2FConsistencyTTA/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Bai-YT%2FConsistencyTTA/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Bai-YT%2FConsistencyTTA/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Bai-YT","download_url":"https://codeload.github.com/Bai-YT/ConsistencyTTA/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Bai-YT%2FConsistencyTTA/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":272710033,"owners_count":24980352,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-29T02:00:10.610Z","response_time":87,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["audio-generation","audio-processing","consistency-models","diffusion-models","ldm"],"created_at":"2024-09-02T15:01:09.441Z","updated_at":"2025-08-29T15:32:47.015Z","avatar_url":"https://github.com/Bai-YT.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation\n\nThis is the **official** code implementation for the paper \\\n*ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation* \\\nfrom Microsoft Applied Science Group and UC Berkeley \\\nby [Yatong Bai](https://bai-yt.github.io),\n[Trung Dang](https://www.microsoft.com/applied-sciences/people/trung-dang),\n[Dung Tran](https://www.microsoft.com/applied-sciences/people/dung-tran),\n[Kazuhito Koishida](https://www.microsoft.com/applied-sciences/people/kazuhito-koishida),\nand [Somayeh Sojoudi](https://people.eecs.berkeley.edu/~sojoudi).\n\n**[[🤗 Live Demo](https://huggingface.co/spaces/Bai-YT/ConsistencyTTA)]** \u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\n**[[Preprint Paper](https://arxiv.org/abs/2309.10740)]** \u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\n**[[Project Homepage](https://consistency-tta.github.io)]** \u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp; \\\n**[[Code](https://github.com/Bai-YT/ConsistencyTTA)]** \u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\n**[[Model Checkpoints](https://huggingface.co/Bai-YT/ConsistencyTTA)]** \u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\n**[[Generation Examples](https://consistency-tta.github.io/demo.html)]**\n\n\n## Description\n\n**2024/06 Updates:**\n\n- We have hosted an interactive live demo of ConsistencyTTA at [🤗 Huggingface](https://huggingface.co/spaces/Bai-YT/ConsistencyTTA).\n- ConsistencyTTA has been accepted to ***INTERSPEECH 2024***! We look forward to meeting you in Kos Island.\n- We added a simpler inference-only implementation to the [`easy_inference`](https://github.com/Bai-YT/ConsistencyTTA/tree/main/easy_inference) directory of this repo.\n\nThis work proposes a *consistency distillation* framework to train text-to-audio (TTA)\ngeneration models that only require a single neural network query,\nreducing the computation of the core step of diffusion-based TTA models by a factor of 400.\nBy incorporating *classifier-free guidance* into the distillation framework,\nour models retain diffusion models' impressive generation quality and diversity.\nFurthermore, the non-recurrent differentiable structure of the consistency model allows\nfor end-to-end fine-tuning with novel loss functions such as the CLAP score,\nfurther boosting performance.\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"main_figure.png\" alt=\"ConsistencyTTA Results\" title=\"Results\" width=\"450\"/\u003e \u0026nbsp;\u0026nbsp;\n    \u003cvideo width=\"216\" height=\"384\" controls\u003e\n        \u003csource src=\"demo_video.mp4\" type=\"video/mp4\"\u003e\n        Your browser does not support the video tag.\n    \u003c/video\u003e\n\u003c/p\u003e\n\n\n## Getting Started\n\nThis codebase performs training, evaluation, and inference.\nIf you only wish to do inference, there is a simpler implementation at [`easy_inference`](https://github.com/Bai-YT/ConsistencyTTA/tree/main/easy_inference).\n\nThis codebase uses PyTorch as the central implementation tool, with extensive usage of HuggingFace's Accelerator package.\nThe required packages can be found in `environment.yml`.\n\n\n### Model Checkpoints\n\nWe share three model checkpoints:\n- [ConsistencyTTA directly distilled from a diffusion model](\n  https://huggingface.co/Bai-YT/ConsistencyTTA/blob/main/ConsistencyTTA.zip);\n- [ConsistencyTTA fine-tuned by optimizing the CLAP score](\n  https://huggingface.co/Bai-YT/ConsistencyTTA/blob/main/ConsistencyTTA_CLAPFT.zip);\n- [The diffusion teacher model from which ConsistencyTTA is distilled](\n  https://huggingface.co/Bai-YT/ConsistencyTTA/blob/main/LightweightLDM.zip).\n\nThe first two models are capable of high-quality single-step text-to-audio generation. Generations are 10 seconds long.\n\nThese model checkpoints are available on our [Huggingface page](https://huggingface.co/Bai-YT/ConsistencyTTA).\nAfter downloading and unzipping the files, place them in the `saved` directory.\n\n\n### Dataset\n\nConsistencyTTA models are trained on the [AudioCaps](https://audiocaps.github.io) dataset.\nPlease download the dataset following the instructions on their website (we cannot share the data).\n\nThe `.json` files in the `data` directory are used for training and evaluation.\nOnce you have downloaded your version of the data,\nyou should be able to map it to our format using the file IDs provided in the `.json` files.\nPlease modify the file locations in the `.json` files accordingly.\n\n\n## Running and Training ConsistencyTTA\n\n### Quickstart Demo\n\nTo perform an interactive demo, where the model generates audio following user's input prompts, run the following script:\n```\npython demo.py --original_args saved/ConsistencyTTA/summary.jsonl \\\n    --model saved/ConsistencyTTA/epoch_60/pytorch_model_2.bin --use_ema\n```\n\nSome example prompts include:\n- Food sizzling with some knocking and banging followed by a dog barking.\n- Train diesel engine rumbling and a baby crying.\n\n\n### Training\n\nThe training of our consistency model contains three distillation phases:\n\n1. (Optional) Distill a diffusion model with adjustable guidance strength.\n2. Perform the consistency distillation.\n3. (Optional) Optimize the CLAP score to finetune.\n\nThe file `train.sh` contains the training script for all three stages.\nThe trained model checkpoints will be stored in the `/saved` directory.\n\nThe teacher model for our distilled consistency models is based on [TANGO](https://github.com/declare-lab/tango),\na state-of-the-art TTA generation framework based on latent diffusion models.\n\nThe training script should automatically download the AudioLDM weights from [here](https://zenodo.org/record/7600541/files/audioldm-s-full?download=1).\nHowever, if the download is slow or if you face any other issues, then you can:\ni) download the `audioldm-s-full` file from [here](https://huggingface.co/haoheliu/AudioLDM-S-Full/tree/main),\nii) rename it to `audioldm-s-full.ckpt`,\nand iii) keep it in the `/home/user/.cache/audioldm/` directory.\n\nFor fine-tuning and evaluating with CLAP, we use [this](https://huggingface.co/lukewys/laion_clap/resolve/main/music_audioset_epoch_15_esc_90.14.pt)\nCLAP model checkpoint from [this](https://github.com/LAION-AI/CLAP) repository.\nAfter downloading, place it into the `/ckpt` directory.\n\nOn two Nvidia RTX 6000 Ada GPUs, Stage 1 (40 epochs) should take ~40 hours,\nStage 2 (60 epochs) should take ~80 hours, and Stage 3 (10 epochs) ~30 hours.\n\n\n### Evaluation\n\nTo perform inference using a trained consistency model and evaluate the generated audio clips, please refer to `inference.sh`.\nThe generated audio clips will be stored in the `/outputs` directory.\n\nTo evaluate existing audio generations, use `evaluate_existing.py`.\nAn example script is in `inference.sh`.\n\n\n## Main Experiment Results\n\nOur evaluation metrics include Fréchet Audio Distance (FAD), Fréchet Distance (FD), KL Divergence, and CLAP Scores.\n\n|                              | # queries (↓)    | CLAP\u003csub\u003eT\u003c/sub\u003e (↑) | CLAP\u003csub\u003eA\u003c/sub\u003e (↑) | FAD (↓) | FD (↓) | KLD (↓) |\n|------------------------------|------------------|----------------------|---------------------|---------|--------|---------|\n| Diffusion (Baseline)         | 400              | 24.57                | 72.79                   | 1.908   | 19.57  | 1.350   |\n| Consistency + CLAP FT (Ours) | 1                | 24.69                | 72.54                   | 2.406   | 20.97  | 1.358   |\n| Consistency (Ours)           | 1                | 22.50                | 72.30                   | 2.575   | 22.08  | 1.354   |\n\n[This PaperWithCode benchmark](https://paperswithcode.com/sota/audio-generation-on-audiocaps) demonstrates how our single-step models\nstack up against previous methods, most of which mostly require hundreds of generation steps.\n\n\n## Cite Our Work (BibTeX)\n\n```bibtex\n@inproceedings{bai2024consistencytta,\n  title={ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation},\n  author={Bai, Yatong and Dang, Trung and Tran, Dung and Koishida, Kazuhito and Sojoudi, Somayeh},\n  booktitle = {INTERSPEECH},\n  year = {2024}\n}\n```\n\n\n## Acknowledgement and Trademarks\n\n**Third-Party Code.** The structure of this repository roughly follows [TANGO](https://github.com/declare-lab/tango),\nwhich in turn heavily relies on [Diffusers](https://huggingface.co/docs/diffusers) and [AudioLDM](https://github.com/haoheliu/AudioLDM).\nWe made modifications in `audioldm`, `audioldm_eval`, and `diffusers` directories to for training and evaluating Consistency TTA.\nWe sincerely appreciate the the authors of these repositories for open-sourcing them.\nPlease refer to `NOTICE.md` for license information.\n\n**Trademarks.** This project may contain trademarks or logos for projects, products, or services.\nAuthorized use of Microsoft trademarks or logos is subject to and must follow\n[Microsoft’s Trademark \u0026 Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).\nUse of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.\nAny use of third-party trademarks or logos are subject to those third-party’s policies.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FBai-YT%2FConsistencyTTA","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FBai-YT%2FConsistencyTTA","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FBai-YT%2FConsistencyTTA/lists"}