{"id":13631835,"url":"https://github.com/Text-to-Audio/Make-An-Audio","last_synced_at":"2025-04-17T22:32:05.453Z","repository":{"id":192213720,"uuid":"654834995","full_name":"Text-to-Audio/Make-An-Audio","owner":"Text-to-Audio","description":"PyTorch Implementation of Make-An-Audio (ICML'23) with a Text-to-Audio Generative Model","archived":false,"fork":false,"pushed_at":"2024-05-22T03:27:58.000Z","size":984,"stargazers_count":727,"open_issues_count":5,"forks_count":107,"subscribers_count":71,"default_branch":"main","last_synced_at":"2024-08-01T22:50:29.039Z","etag":null,"topics":["diffusion-models","latent-diffusion","latent-space","text-to-audio","video-to-audio"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Text-to-Audio.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-06-17T04:56:46.000Z","updated_at":"2024-08-01T12:10:27.000Z","dependencies_parsed_at":null,"dependency_job_id":"23a6e0ad-e898-4fc5-b964-b34346d790a0","html_url":"https://github.com/Text-to-Audio/Make-An-Audio","commit_stats":null,"previous_names":["text-to-audio/make-an-audio"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Text-to-Audio%2FMake-An-Audio","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Text-to-Audio%2FMake-An-Audio/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Text-to-Audio%2FMake-An-Audio/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Text-to-Audio%2FMake-An-Audio/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Text-to-Audio","download_url":"https://codeload.github.com/Text-to-Audio/Make-An-Audio/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223768657,"owners_count":17199357,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["diffusion-models","latent-diffusion","latent-space","text-to-audio","video-to-audio"],"created_at":"2024-08-01T22:02:40.157Z","updated_at":"2024-11-08T23:31:43.071Z","avatar_url":"https://github.com/Text-to-Audio.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models\n\n#### Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren, Luping Liu, Mingze Li, Zhenhui Ye, Jinglin Liu, Xiang Yin, Zhou Zhao\n\nPyTorch Implementation of [Make-An-Audio (ICML'23)](https://arxiv.org/abs/2301.12661): a conditional diffusion probabilistic model capable of generating high fidelity audio efficiently from X modality.\n\n[![arXiv](https://img.shields.io/badge/arXiv-Paper-\u003cCOLOR\u003e.svg)](https://arxiv.org/abs/2301.12661)\n[![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-blue)](https://huggingface.co/spaces/AIGC-Audio/Make_An_Audio)\n[![GitHub Stars](https://img.shields.io/github/stars/Text-to-Audio/Make-An-Audio?style=social)](https://github.com/Text-to-Audio/Make-An-Audio)\n\nWe provide our implementation and pretrained models as open source in this repository.\n\nVisit our [demo page](https://text-to-audio.github.io/) for audio samples.\n\n[Text-to-Audio HuggingFace Space](https://huggingface.co/spaces/AIGC-Audio/Make_An_Audio) | [Audio Inpainting HuggingFace Space](https://huggingface.co/spaces/AIGC-Audio/Make_An_Audio_inpaint)\n\n## News\n- Jan, 2023: **[Make-An-Audio](https://arxiv.org/abs/2207.06389)** submitted to arxiv.\n- August, 2023: **[Make-An-Audio](https://arxiv.org/abs/2301.12661) (ICML 2022)** released in Github. \n\n## Quick Started\nWe provide an example of how you can generate high-fidelity samples using Make-An-Audio.\n\nTo try on your own dataset, simply clone this repo in your local machine provided with NVIDIA GPU + CUDA cuDNN and follow the below instructions.\n\n\n### Support Datasets and Pretrained Models\n\nSimply run following command to download the weights from [Google drive](https://drive.google.com/drive/folders/1zZTI3-nHrUIywKFqwxlFO6PjB66JA8jI?usp=drive_link).\nDownload CLAP weights from [Hugging Face](https://huggingface.co/microsoft/msclap/blob/main/CLAP_weights_2022.pth).\n\n```\nDownload:\n    maa1_full.ckpt and put it into ./useful_ckpts  \n    BigVGAN vocoder and put it into ./useful_ckpts  \n    CLAP_weights_2022.pth and put it into ./useful_ckpts/CLAP\n```\nThe directory structure should be:\n```\nuseful_ckpts/\n├── bigvgan\n│   ├── args.yml\n│   └── best_netG.pt\n├── CLAP\n│   ├── config.yml\n│   └── CLAP_weights_2022.pth\n└── maa1_full.ckpt\n```\n\n\n### Dependencies\nSee requirements in `requirement.txt`:\n\n## Inference with pretrained model\n```bash\npython gen_wav.py --prompt \"a bird chirps\" --ddim_steps 100 --duration 10 --scale 3 --n_samples 1 --save_name \"results\"\n```\n# Train\n## dataset preparation\nWe can't provide the dataset download link for copyright issues. We provide the process code to generate melspec.  \nBefore training, we need to construct the dataset information into a tsv file, which includes name (id for each audio), dataset (which dataset the audio belongs to), audio_path (the path of .wav file),caption (the caption of the audio) ,mel_path (the processed melspec file path of each audio). We provide a tsv file of audiocaps test set: ./data/audiocaps_test.tsv as a sample.\n### generate the melspec file of audio\nAssume you have already got a tsv file to link each caption to its audio_path, which mean the tsv_file have \"name\",\"audio_path\",\"dataset\" and \"caption\" columns in it.\nTo get the melspec of audio, run the following command, which will save mels in ./processed\n```bash\npython preprocess/mel_spec.py --tsv_path tmp.tsv --num_gpus 1 --max_duration 10\n```\n## Train variational autoencoder\nAssume we have processed several datasets, and save the .tsv files in data/*.tsv . Replace **data.params.spec_dir_path** with the **data**(the directory that contain tsvs) in the config file. Then we can train VAE with the following command. If you don't have 8 gpus in your machine, you can replace --gpus 0,1,...,gpu_nums\n```bash\npython main.py --base configs/train/vae.yaml -t --gpus 0,1,2,3,4,5,6,7\n```\nThe training result will be save in ./logs/\n## train latent diffsuion\nAfter Trainning VAE, replace model.params.first_stage_config.params.ckpt_path with your trained VAE checkpoint path in the config file.\nRun the following command to train Diffusion model\n```bash\npython main.py --base configs/train/diffusion.yaml -t  --gpus 0,1,2,3,4,5,6,7\n```\nThe training result will be save in ./logs/\n# Evaluation\n## generate audiocaps samples\n```bash\npython gen_wavs_by_tsv.py --tsv_path data/audiocaps_test.tsv --save_dir audiocaps_gen\n```\n\n## calculate FD,FAD,IS,KL\ninstall [audioldm_eval](https://github.com/haoheliu/audioldm_eval) by\n```bash\ngit clone git@github.com:haoheliu/audioldm_eval.git\n```\nThen test with:\n```bash\npython scripts/test.py --pred_wavsdir {the directory that saves the audios you generated} --gt_wavsdir {the directory that saves audiocaps test set waves}\n```\n## calculate Clap_score\n```bash\npython wav_evaluation/cal_clap_score.py --tsv_path {the directory that saves the audios you generated}/result.tsv\n```\n# X-To-Audio\n## Audio2Audio\n```bash\npython scripts/audio2audio.py  --prompt \"a bird chirping\"  --strength 0.3 --init-audio sample.wav --ckpt useful_ckpts/maa1_full.ckpt --vocoder_ckpt useful_ckpts/bigvgan --config configs/text_to_audio/txt2audio_args.yaml --outdir audio2audio_samples\n```\n\n## Acknowledgements\nThis implementation uses parts of the code from the following Github repos:\n[CLAP](https://github.com/LAION-AI/CLAP),\n[Stable Diffusion](https://github.com/CompVis/stable-diffusion),\nas described in our code.\n\n## Citations ##\nIf you find this code useful in your research, please consider citing:\n```bibtex\n@article{huang2023make,\n  title={Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models},\n  author={Huang, Rongjie and Huang, Jiawei and Yang, Dongchao and Ren, Yi and Liu, Luping and Li, Mingze and Ye, Zhenhui and Liu, Jinglin and Yin, Xiang and Zhao, Zhou},\n  journal={arXiv preprint arXiv:2301.12661},\n  year={2023}\n}\n```\n\n# Disclaimer ##\nAny organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FText-to-Audio%2FMake-An-Audio","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FText-to-Audio%2FMake-An-Audio","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FText-to-Audio%2FMake-An-Audio/lists"}