{"id":13456689,"url":"https://github.com/haoheliu/AudioLDM2","last_synced_at":"2025-03-24T11:31:03.160Z","repository":{"id":186779077,"uuid":"674695560","full_name":"haoheliu/AudioLDM2","owner":"haoheliu","description":"Text-to-Audio/Music Generation","archived":false,"fork":false,"pushed_at":"2024-09-29T15:05:21.000Z","size":3910,"stargazers_count":2389,"open_issues_count":65,"forks_count":187,"subscribers_count":46,"default_branch":"main","last_synced_at":"2025-03-19T19:11:48.033Z","etag":null,"topics":["audio-generation"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/haoheliu.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-08-04T14:43:09.000Z","updated_at":"2025-03-19T12:30:20.000Z","dependencies_parsed_at":"2024-06-02T15:50:11.814Z","dependency_job_id":"cff1319f-5849-457b-af21-1c54f181f0d8","html_url":"https://github.com/haoheliu/AudioLDM2","commit_stats":null,"previous_names":["haoheliu/audioldm2"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/haoheliu%2FAudioLDM2","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/haoheliu%2FAudioLDM2/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/haoheliu%2FAudioLDM2/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/haoheliu%2FAudioLDM2/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/haoheliu","download_url":"https://codeload.github.com/haoheliu/AudioLDM2/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245260773,"owners_count":20586462,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["audio-generation"],"created_at":"2024-07-31T08:01:26.073Z","updated_at":"2025-03-24T11:31:02.563Z","avatar_url":"https://github.com/haoheliu.png","language":"Python","funding_links":[],"categories":["Python","GitHub projects","语音合成","Music Generation"],"sub_categories":["网络服务_其他"],"readme":"# AudioLDM 2\n\n[![arXiv](https://img.shields.io/badge/arXiv-2308.05734-brightgreen.svg?style=flat-square)](https://arxiv.org/abs/2308.05734)  [![githubio](https://img.shields.io/badge/GitHub.io-Audio_Samples-blue?logo=Github\u0026style=flat-square)](https://audioldm.github.io/audioldm2/)  [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/haoheliu/audioldm2-text2audio-text2music)  \n\nThis repo currently support Text-to-Audio (including Music), Text-to-Speech Generation and Super Resolution Inpainting.\n\n\u003chr\u003e\n\n## Change Log\n\n- 2023-08-27: Add two new checkpoints! \n  - 🌟 **48kHz AudioLDM model**: Now we support high-fidelity audio generation! [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/haoheliu/AudioLDM_48K_Text-to-HiFiAudio_Generation)  \n  - **16kHz improved AudioLDM model**: Trained with more data and optimized model architecture.\n\n## TODO\n\n- [x] Add the text-to-speech checkpoint\n- [x] Open-source the [AudioLDM training code](https://github.com/haoheliu/AudioLDM-training-finetuning).\n- [x] Support the generation of longer audio (\u003e 10s)\n- [x] Optimizing the inference speed of the model.\n- [x] Integration with the Diffusers library (see [🧨 Diffusers](#hugging-face--diffusers))\n- [ ] Add the style-transfer and inpainting code for the audioldm_48k checkpoint (PR welcomed, same logic as [AudioLDMv1](https://github.com/haoheliu/AudioLDM))\n\n## Web APP\n\n1. Prepare running environment\n\n```shell\nconda create -n audioldm python=3.8; conda activate audioldm\npip3 install git+https://github.com/haoheliu/AudioLDM2.git\ngit clone https://github.com/haoheliu/AudioLDM2; cd AudioLDM2\n```\n\n2. Start the web application (powered by Gradio)\n\n```shell\npython3 app.py\n```\n\n3. A link will be printed out. Click the link to open the browser and play.\n\n## Commandline Usage\n\n### Installation\n\nPrepare running environment\n\n```shell\n# Optional\nconda create -n audioldm python=3.8; conda activate audioldm\n# Install AudioLDM\npip3 install git+https://github.com/haoheliu/AudioLDM2.git\n```\n\nIf you plan to play around with text-to-speech generation. Please also make sure you have installed [espeak](https://espeak.sourceforge.net/download.html). On linux you can do it by\n\n```shell\nsudo apt-get install espeak\n```\n\n### Run the model in commandline\n\n- Generate sound effect or Music based on a text prompt\n\n```shell\naudioldm2 -t \"Musical constellations twinkling in the night sky, forming a cosmic melody.\"\n```\n\n- Generate sound effect or music based on a list of text\n\n```shell\naudioldm2 -tl batch.lst\n```\n\n- Generate speech based on (1) the transcription and (2) the description of the speaker\n\n```shell\naudioldm2 -t \"A female reporter is speaking full of emotion\" --transcription \"Wish you have a good day\"\n\naudioldm2 -t \"A female reporter is speaking\" --transcription \"Wish you have a good day\"\n```\n\nText-to-Speech use the *audioldm2-speech-gigaspeech* checkpoint by default. If you like to run TTS with LJSpeech pretrained checkpoint, simply set *--model_name audioldm2-speech-ljspeech*.\n\n## Random Seed Matters\n\nSometimes model may not perform well (sounds weird or low quality) when changing into a different hardware. In this case, please adjust the random seed and find the optimal one for your hardware. \n\n```shell\naudioldm2 --seed 1234 -t \"Musical constellations twinkling in the night sky, forming a cosmic melody.\"\n```\n\n## Pretrained Models\n\nYou can choose model checkpoint by setting up \"model_name\":\n\n```shell\n# CUDA\naudioldm2 --model_name \"audioldm2-full\" --device cuda -t \"Musical constellations twinkling in the night sky, forming a cosmic melody.\"\n\n# MPS\naudioldm2 --model_name \"audioldm2-full\" --device mps -t \"Musical constellations twinkling in the night sky, forming a cosmic melody.\"\n```\n\nWe have five checkpoints you can choose:\n\n1. **audioldm2-full** (default): Generate both sound effect and music generation with the AudioLDM2 architecture.\n2. **audioldm_48k**: This checkpoint can generate high fidelity sound effect and music.\n3. **audioldm_16k_crossattn_t5**: The improved version of [AudioLDM 1.0](https://github.com/haoheliu/AudioLDM).\n4. **audioldm2-full-large-1150k**: Larger version of audioldm2-full. \n5. **audioldm2-music-665k**: Music generation. \n6. **audioldm2-speech-gigaspeech** (default for TTS): Text-to-Speech, trained on GigaSpeech Dataset.\n7. **audioldm2-speech-ljspeech**: Text-to-Speech, trained on LJSpeech Dataset.\n\nWe currently support 3 devices:\n\n- cpu\n- cuda\n- mps ( Notice that the computation requires about 20GB of RAM. )\n\n## Other options\n\n```shell\n  usage: audioldm2 [-h] [-t TEXT] [-tl TEXT_LIST] [-s SAVE_PATH]\n                 [--model_name {audioldm_48k, audioldm_16k_crossattn_t5, audioldm2-full,audioldm2-music-665k,audioldm2-full-large-1150k,audioldm2-speech-ljspeech,audioldm2-speech-gigaspeech}] [-d DEVICE]\n                 [-b BATCHSIZE] [--ddim_steps DDIM_STEPS] [-gs GUIDANCE_SCALE] [-n N_CANDIDATE_GEN_PER_TEXT]\n                 [--seed SEED] [--mode {generation, sr_inpainting}] [-f FILE_PATH]\n\n  optional arguments:\n    -h, --help            show this help message and exit\n    --mode {generation,sr_inpainting}\n                        generation: text-to-audio generation; sr_inpainting: super resolution inpainting \n    -t TEXT, --text TEXT  Text prompt to the model for audio generation\n    -f FILE_PATH, --file_path FILE_PATH\n                        (--mode sr_inpainting): Original audio file for inpainting; Or \n                        (--mode generation): the guidance audio file for generating similar audio, DEFAULT None\n    --transcription TRANSCRIPTION\n                        Transcription used for speech synthesis\n    -tl TEXT_LIST, --text_list TEXT_LIST\n                          A file that contains text prompt to the model for audio generation\n    -s SAVE_PATH, --save_path SAVE_PATH\n                          The path to save model output\n    --model_name {audioldm2-full,audioldm2-music-665k,audioldm2-full-large-1150k,audioldm2-speech-ljspeech,audioldm2-speech-gigaspeech}\n                          The checkpoint you gonna use\n    -d DEVICE, --device DEVICE\n                          The device for computation. If not specified, the script will automatically choose the device based on your environment. [cpu, cuda, mps, auto]\n    -b BATCHSIZE, --batchsize BATCHSIZE\n                          Generate how many samples at the same time\n    --ddim_steps DDIM_STEPS\n    -dur DURATION, --duration DURATION\n                        The duration of the samples\n                          The sampling step for DDIM\n    -gs GUIDANCE_SCALE, --guidance_scale GUIDANCE_SCALE\n                          Guidance scale (Large =\u003e better quality and relavancy to text; Small =\u003e better diversity)\n    -n N_CANDIDATE_GEN_PER_TEXT, --n_candidate_gen_per_text N_CANDIDATE_GEN_PER_TEXT\n                          Automatic quality control. This number control the number of candidates (e.g., generate three audios and choose the best to show you). A Larger value usually lead to better quality with\n                          heavier computation\n    --seed SEED           Change this value (any integer number) will lead to a different generation result.\n```\n\n## Hugging Face 🧨 Diffusers\n\nAudioLDM 2 is available in the Hugging Face [🧨 Diffusers](https://github.com/huggingface/diffusers) library from v0.21.0 \nonwards. The official checkpoints can be found on the [Hugging Face Hub](https://huggingface.co/cvssp/audioldm2#checkpoint-details), \nalongside [documentation](https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm2) and \n[examples scripts](https://colab.research.google.com/github/sanchit-gandhi/notebooks/blob/main/AudioLDM-2.ipynb).\n\nThe Diffusers version of the code runs upwards of **3x faster** than the native AudioLDM 2 implementation, and supports \ngenerating audios of arbitrary length.\n\nTo install 🧨 Diffusers and 🤗 Transformers, run:\n\n```bash\npip install --upgrade git+https://github.com/huggingface/diffusers.git transformers accelerate\n```\n\nYou can then load pre-trained weights into the [AudioLDM2 pipeline](https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm2),\nand generate text-conditional audio outputs by providing a text prompt:\n\n```python\nfrom diffusers import AudioLDM2Pipeline\nimport torch\nimport scipy\n\nrepo_id = \"cvssp/audioldm2\"\npipe = AudioLDM2Pipeline.from_pretrained(repo_id, torch_dtype=torch.float16)\npipe = pipe.to(\"cuda\")\n\nprompt = \"Techno music with a strong, upbeat tempo and high melodic riffs.\"\naudio = pipe(prompt, num_inference_steps=200, audio_length_in_s=10.0).audios[0]\n\nscipy.io.wavfile.write(\"techno.wav\", rate=16000, data=audio)\n```\n\nTips for obtaining high-quality generations can be found under the AudioLDM 2 [docs](https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm2#tips),\nincluding the use of prompt engineering and negative prompting.\n\nTips for optimising inference speed can be found in the blog post [AudioLDM 2, but faster ⚡️](https://huggingface.co/blog/audioldm2).\n\n## Cite this work\n\nIf you found this tool useful, please consider citing\n\n```bibtex\n@article{audioldm2-2024taslp,\n  author={Liu, Haohe and Yuan, Yi and Liu, Xubo and Mei, Xinhao and Kong, Qiuqiang and Tian, Qiao and Wang, Yuping and Wang, Wenwu and Wang, Yuxuan and Plumbley, Mark D.},\n  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, \n  title={AudioLDM 2: Learning Holistic Audio Generation With Self-Supervised Pretraining}, \n  year={2024},\n  volume={32},\n  pages={2871-2883},\n  doi={10.1109/TASLP.2024.3399607}\n}\n```\n\n```bibtex\n@article{liu2023audioldm,\n  title={{AudioLDM}: Text-to-Audio Generation with Latent Diffusion Models},\n  author={Liu, Haohe and Chen, Zehua and Yuan, Yi and Mei, Xinhao and Liu, Xubo and Mandic, Danilo and Wang, Wenwu and Plumbley, Mark D},\n  journal={Proceedings of the International Conference on Machine Learning},\n  year={2023}\n  pages={21450-21474}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhaoheliu%2FAudioLDM2","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhaoheliu%2FAudioLDM2","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhaoheliu%2FAudioLDM2/lists"}