Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/sony/soundctm
Pytorch implementation of SoundCTM
https://github.com/sony/soundctm
audio-generation diffusion-models pytorch text-to-audio
Last synced: 1 day ago
JSON representation
Pytorch implementation of SoundCTM
- Host: GitHub
- URL: https://github.com/sony/soundctm
- Owner: sony
- License: mit
- Created: 2024-06-04T22:35:22.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2024-10-01T03:12:25.000Z (about 2 months ago)
- Last Synced: 2024-11-08T14:35:02.588Z (9 days ago)
- Topics: audio-generation, diffusion-models, pytorch, text-to-audio
- Language: Python
- Homepage:
- Size: 3.74 MB
- Stars: 70
- Watchers: 3
- Forks: 6
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation
This repository is the official implementation of "SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation"
- Paper: [arxiv](https://arxiv.org/abs/2405.18503)
- Demo page: [Audio Samples](https://koichi-saito-sony.github.io/soundctm/)
- Chekpoints: [Hugging Face (Now only checkpoints are avaiable.)](https://huggingface.co/Sony/soundctm)Contact:
- Koichi SAITO: [email protected]## Checkpoints
- Download and put the [teacher model's checkpoints](https://huggingface.co/Sony/soundctm/tree/main/ckpt/teacher) and [AudioLDM-s-full checkpoints for VAE+Vocoder part](https://huggingface.co/Sony/soundctm/blob/main/ckpt/audioldm-s-full.ckpt) to `soundctm/ckpt`
- [SoundCTM checkpoint](https://huggingface.co/Sony/soundctm/tree/main/soundctm_ckpt) on AudioCaps (ema=0.999, 30K training iterations)For inference, both [AudioLDM-s-full (for VAE's decoder+Vocoder)](https://huggingface.co/Sony/soundctm/blob/main/ckpt/audioldm-s-full.ckpt) and [SoundCTM](https://huggingface.co/Sony/soundctm/tree/main/soundctm_ckpt) checkpoints will be used.
## Prerequisites
Install docker to your own server and build docker container:
```bash
docker build -t soundctm .
```Then run scripts in the container.
## Training
Please see `ctm_train.sh` and `ctm_train.py` and modify folder path dependeing on your environment.Then run `bash ctm_train.sh`
## Inference
Please see `ctm_inference.sh` and `ctm_inference.py` and modify folder path dependeing on your environment.Then run `bash ctm_inference.sh`
## Numerical evaluation
Please see `numerical_evaluation.sh` and `numerical_evaluation.py` and modify folder path dependeing on your environment.Then run `bash numerical_evaluation.sh`
## Dataset
Follow the instructions given in the [AudioCaps repository](https://github.com/cdjkim/audiocaps) for downloading the data.
Data locations are needed to be spesificied in `ctm_train.sh`.
You can also see some examples at `data/train.csv`.## WandB for logging
The training code also requires a [Weights & Biases](https://wandb.ai/site) account to log the training outputs and demos. Create an account and log in with:
```bash
$ wandb login
```
Or you can also pass an API key as an environment variable `WANDB_API_KEY`.
(You can obtain the API key from https://wandb.ai/authorize after logging in to your account.)
```bash
$ WANDB_API_KEY="12345x6789y..."
```## Citation
```
@article{saito2024soundctm,
title={SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation},
author={Koichi Saito and Dongjun Kim and Takashi Shibuya and Chieh-Hsin Lai and Zhi Zhong and Yuhta Takida and Yuki Mitsufuji},
journal={arXiv preprint arXiv:2405.18503},
year={2024}
}
```## Reference
Part of the code is borrowed from the following repos. We would like to thank the authors of these repos for their contribution.
> https://github.com/sony/ctm> https://github.com/declare-lab/tango
> https://github.com/haoheliu/AudioLDM
> https://github.com/haoheliu/audioldm_eval