Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/haoheliu/audioldm2
Text-to-Audio/Music Generation
https://github.com/haoheliu/audioldm2
audio-generation
Last synced: 3 days ago
JSON representation
Text-to-Audio/Music Generation
- Host: GitHub
- URL: https://github.com/haoheliu/audioldm2
- Owner: haoheliu
- License: other
- Created: 2023-08-04T14:43:09.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-09-29T15:05:21.000Z (2 months ago)
- Last Synced: 2024-12-03T15:01:51.708Z (10 days ago)
- Topics: audio-generation
- Language: Python
- Homepage:
- Size: 3.73 MB
- Stars: 2,318
- Watchers: 45
- Forks: 182
- Open Issues: 60
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- ai-game-devtools - AudioLDM 2 - supervised Pretraining. |[arXiv](https://arxiv.org/abs/2308.05734) | | Audio | (<span id="audio">Audio</span> / <span id="tool">Tool (AI LLM)</span>)
README
# AudioLDM 2
[![arXiv](https://img.shields.io/badge/arXiv-2308.05734-brightgreen.svg?style=flat-square)](https://arxiv.org/abs/2308.05734) [![githubio](https://img.shields.io/badge/GitHub.io-Audio_Samples-blue?logo=Github&style=flat-square)](https://audioldm.github.io/audioldm2/) [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/haoheliu/audioldm2-text2audio-text2music)
This repo currently support Text-to-Audio (including Music), Text-to-Speech Generation and Super Resolution Inpainting.
## Change Log
- 2023-08-27: Add two new checkpoints!
- ๐ **48kHz AudioLDM model**: Now we support high-fidelity audio generation! [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/haoheliu/AudioLDM_48K_Text-to-HiFiAudio_Generation)
- **16kHz improved AudioLDM model**: Trained with more data and optimized model architecture.## TODO
- [x] Add the text-to-speech checkpoint
- [x] Open-source the [AudioLDM training code](https://github.com/haoheliu/AudioLDM-training-finetuning).
- [x] Support the generation of longer audio (> 10s)
- [x] Optimizing the inference speed of the model.
- [x] Integration with the Diffusers library (see [๐งจ Diffusers](#hugging-face--diffusers))
- [ ] Add the style-transfer and inpainting code for the audioldm_48k checkpoint (PR welcomed, same logic as [AudioLDMv1](https://github.com/haoheliu/AudioLDM))## Web APP
1. Prepare running environment
```shell
conda create -n audioldm python=3.8; conda activate audioldm
pip3 install git+https://github.com/haoheliu/AudioLDM2.git
git clone https://github.com/haoheliu/AudioLDM2; cd AudioLDM2
```2. Start the web application (powered by Gradio)
```shell
python3 app.py
```3. A link will be printed out. Click the link to open the browser and play.
## Commandline Usage
### Installation
Prepare running environment
```shell
# Optional
conda create -n audioldm python=3.8; conda activate audioldm
# Install AudioLDM
pip3 install git+https://github.com/haoheliu/AudioLDM2.git
```If you plan to play around with text-to-speech generation. Please also make sure you have installed [espeak](https://espeak.sourceforge.net/download.html). On linux you can do it by
```shell
sudo apt-get install espeak
```### Run the model in commandline
- Generate sound effect or Music based on a text prompt
```shell
audioldm2 -t "Musical constellations twinkling in the night sky, forming a cosmic melody."
```- Generate sound effect or music based on a list of text
```shell
audioldm2 -tl batch.lst
```- Generate speech based on (1) the transcription and (2) the description of the speaker
```shell
audioldm2 -t "A female reporter is speaking full of emotion" --transcription "Wish you have a good day"audioldm2 -t "A female reporter is speaking" --transcription "Wish you have a good day"
```Text-to-Speech use the *audioldm2-speech-gigaspeech* checkpoint by default. If you like to run TTS with LJSpeech pretrained checkpoint, simply set *--model_name audioldm2-speech-ljspeech*.
## Random Seed Matters
Sometimes model may not perform well (sounds weird or low quality) when changing into a different hardware. In this case, please adjust the random seed and find the optimal one for your hardware.
```shell
audioldm2 --seed 1234 -t "Musical constellations twinkling in the night sky, forming a cosmic melody."
```## Pretrained Models
You can choose model checkpoint by setting up "model_name":
```shell
# CUDA
audioldm2 --model_name "audioldm2-full" --device cuda -t "Musical constellations twinkling in the night sky, forming a cosmic melody."# MPS
audioldm2 --model_name "audioldm2-full" --device mps -t "Musical constellations twinkling in the night sky, forming a cosmic melody."
```We have five checkpoints you can choose:
1. **audioldm2-full** (default): Generate both sound effect and music generation with the AudioLDM2 architecture.
2. **audioldm_48k**: This checkpoint can generate high fidelity sound effect and music.
3. **audioldm_16k_crossattn_t5**: The improved version of [AudioLDM 1.0](https://github.com/haoheliu/AudioLDM).
4. **audioldm2-full-large-1150k**: Larger version of audioldm2-full.
5. **audioldm2-music-665k**: Music generation.
6. **audioldm2-speech-gigaspeech** (default for TTS): Text-to-Speech, trained on GigaSpeech Dataset.
7. **audioldm2-speech-ljspeech**: Text-to-Speech, trained on LJSpeech Dataset.We currently support 3 devices:
- cpu
- cuda
- mps ( Notice that the computation requires about 20GB of RAM. )## Other options
```shell
usage: audioldm2 [-h] [-t TEXT] [-tl TEXT_LIST] [-s SAVE_PATH]
[--model_name {audioldm_48k, audioldm_16k_crossattn_t5, audioldm2-full,audioldm2-music-665k,audioldm2-full-large-1150k,audioldm2-speech-ljspeech,audioldm2-speech-gigaspeech}] [-d DEVICE]
[-b BATCHSIZE] [--ddim_steps DDIM_STEPS] [-gs GUIDANCE_SCALE] [-n N_CANDIDATE_GEN_PER_TEXT]
[--seed SEED] [--mode {generation, sr_inpainting}] [-f FILE_PATH]optional arguments:
-h, --help show this help message and exit
--mode {generation,sr_inpainting}
generation: text-to-audio generation; sr_inpainting: super resolution inpainting
-t TEXT, --text TEXT Text prompt to the model for audio generation
-f FILE_PATH, --file_path FILE_PATH
(--mode sr_inpainting): Original audio file for inpainting; Or
(--mode generation): the guidance audio file for generating similar audio, DEFAULT None
--transcription TRANSCRIPTION
Transcription used for speech synthesis
-tl TEXT_LIST, --text_list TEXT_LIST
A file that contains text prompt to the model for audio generation
-s SAVE_PATH, --save_path SAVE_PATH
The path to save model output
--model_name {audioldm2-full,audioldm2-music-665k,audioldm2-full-large-1150k,audioldm2-speech-ljspeech,audioldm2-speech-gigaspeech}
The checkpoint you gonna use
-d DEVICE, --device DEVICE
The device for computation. If not specified, the script will automatically choose the device based on your environment. [cpu, cuda, mps, auto]
-b BATCHSIZE, --batchsize BATCHSIZE
Generate how many samples at the same time
--ddim_steps DDIM_STEPS
-dur DURATION, --duration DURATION
The duration of the samples
The sampling step for DDIM
-gs GUIDANCE_SCALE, --guidance_scale GUIDANCE_SCALE
Guidance scale (Large => better quality and relavancy to text; Small => better diversity)
-n N_CANDIDATE_GEN_PER_TEXT, --n_candidate_gen_per_text N_CANDIDATE_GEN_PER_TEXT
Automatic quality control. This number control the number of candidates (e.g., generate three audios and choose the best to show you). A Larger value usually lead to better quality with
heavier computation
--seed SEED Change this value (any integer number) will lead to a different generation result.
```## Hugging Face ๐งจ Diffusers
AudioLDM 2 is available in the Hugging Face [๐งจ Diffusers](https://github.com/huggingface/diffusers) library from v0.21.0
onwards. The official checkpoints can be found on the [Hugging Face Hub](https://huggingface.co/cvssp/audioldm2#checkpoint-details),
alongside [documentation](https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm2) and
[examples scripts](https://colab.research.google.com/github/sanchit-gandhi/notebooks/blob/main/AudioLDM-2.ipynb).The Diffusers version of the code runs upwards of **3x faster** than the native AudioLDM 2 implementation, and supports
generating audios of arbitrary length.To install ๐งจ Diffusers and ๐ค Transformers, run:
```bash
pip install --upgrade git+https://github.com/huggingface/diffusers.git transformers accelerate
```You can then load pre-trained weights into the [AudioLDM2 pipeline](https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm2),
and generate text-conditional audio outputs by providing a text prompt:```python
from diffusers import AudioLDM2Pipeline
import torch
import scipyrepo_id = "cvssp/audioldm2"
pipe = AudioLDM2Pipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")prompt = "Techno music with a strong, upbeat tempo and high melodic riffs."
audio = pipe(prompt, num_inference_steps=200, audio_length_in_s=10.0).audios[0]scipy.io.wavfile.write("techno.wav", rate=16000, data=audio)
```Tips for obtaining high-quality generations can be found under the AudioLDM 2 [docs](https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm2#tips),
including the use of prompt engineering and negative prompting.Tips for optimising inference speed can be found in the blog post [AudioLDM 2, but faster โก๏ธ](https://huggingface.co/blog/audioldm2).
## Cite this work
If you found this tool useful, please consider citing
```bibtex
@article{audioldm2-2024taslp,
author={Liu, Haohe and Yuan, Yi and Liu, Xubo and Mei, Xinhao and Kong, Qiuqiang and Tian, Qiao and Wang, Yuping and Wang, Wenwu and Wang, Yuxuan and Plumbley, Mark D.},
journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
title={AudioLDM 2: Learning Holistic Audio Generation With Self-Supervised Pretraining},
year={2024},
volume={32},
pages={2871-2883},
doi={10.1109/TASLP.2024.3399607}
}
``````bibtex
@article{liu2023audioldm,
title={{AudioLDM}: Text-to-Audio Generation with Latent Diffusion Models},
author={Liu, Haohe and Chen, Zehua and Yuan, Yi and Mei, Xinhao and Liu, Xubo and Mandic, Danilo and Wang, Wenwu and Plumbley, Mark D},
journal={Proceedings of the International Conference on Machine Learning},
year={2023}
pages={21450-21474}
}
```