Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ai16z/ljspeechtools
Tools for making LJSpeech datasets
https://github.com/ai16z/ljspeechtools
Last synced: 21 days ago
JSON representation
Tools for making LJSpeech datasets
- Host: GitHub
- URL: https://github.com/ai16z/ljspeechtools
- Owner: ai16z
- License: mit
- Created: 2023-01-07T05:47:42.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2024-02-07T17:09:53.000Z (11 months ago)
- Last Synced: 2024-12-18T01:14:01.403Z (23 days ago)
- Language: Python
- Size: 850 KB
- Stars: 21
- Watchers: 4
- Forks: 8
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# LJ Speech Tools
Tools for creating voices for Webaverse.## What do I need to make a dataset?
30-60 minutes of clean audio (no sounds, noises or other speakers). Can be one file or many. The rest will be done for you.Mono 22khz WAV is ideal but the system will try to convert other formats.
# Installation
You will need Python 3 installed on your system.To install:
```bash
sudo apt install ffmpeg
pip install -r requirements.txt
```# Pipeline (Easy Mode)
1. Put your files in the `put_audio_files_here` folder2. Run the python script:
```sh
python pipeline.py
```The dataset.zip folder will be what you need to train a voice with.
# Training
A colab Notebook has been provided [here](https://colab.research.google.com/drive/1F1BMC18XIJNFvkncdZHZvxd5oEAVsHxY)
Upload your dataset to your Google drive, then press "run all cells" to run the notebook.
You will need a Google account, but Colab is free. However, upgrading to Colab Pro will make training a lot easier, as you have guaranteed access to faster hardware.
# Individual Tools
`pipeline.py` is made from a collection of other scripts, which you can run individually. Speaker Separation is not included in the pipeline-- it is assumed that you have cleaned, separated tracks, but you can run it in a preprocess if you need it.## Speaker Separation
In our tests, separation usually only worked in the case where two or three speakers with distinctly different voices were speaking.To remove any wav files which contain audio which isn't the source speaker
First, place some example audio from your speaker in the 'target' folder.
Then, place any example audio from speakers who are *not* your speaker in the 'ignore' folder
Then run the separate.py with a --threshold, probably somewhere between 0.6 and 0.9
```bash
bash separate.sh
# or
python separator.py --threshold=0.65
```## Audio Transcription
To transcribe audio files in the wavs folder
```bash
bash transcribe.sh
# or
python transcriber.py
```Transcription will create an LJSpeech compatible 'metadata.csv'
Transcription removes swearing and replaces with ****. If you want some swearing back, you can run `python swearing.py` -- if you don't want swearing in your dataset you should remove that data entirely, as the asteriks will negatively affect alignment.
## Get length of audio dataset
```bash
python count_length.py
```Will give you the total length, longest and shortest file lengths from the wavs folder
## Split long audio into shorter audio samples
```bash
python audiosplitter.py
```
Most training scripts prefer a variety of audio sample lengths from 2-12 seconds in length. The splitter will try to find silent points and break the audio into chunks up to 12 seconds. You can modify the script to your flavor, just change the 12 to a 10 or whatever you want.## Prepare dataset
```bash
python make_dataset.py
```This will create a train_filelist.txt, val_filelist.txt and dataset.zip which can be uploaded to the SortAnon TalkNet training colab notebook here: https://github.com/bycloudai/TalkNET-colab
You may need to reformat the name of the files if you use other notebooks.## Complete Pipeline
If you are trying to process a voice in a noisy track, you should really bring the audio files into an audio software, mute any noise or speech from other speakers, and normalize and compress the remaining audio so it all sounds as similar as possible.Assuming you wanted to record your own voice and didn't need to deal with speaker separation or isolation, here are the steps you would talk to do that.
1. Put all of your files in the wavs folder
If they are not wavs, convert them using ffmpeg - the default we are targeting is Mono WAV 22050 Hz2. If your files are longer than 12 seconds and not hand-split, run the audiosplitter
```bash
python audiosplitter.py
```
Verify that the data is good, then delete the contents of the wavs folder and move the data_outputs there3. Transcribe your dataset
```bash
bash transcribe.sh
# or
python transcriber.py
```
This will create a metadata.csv, which is the standard format of LJSpeech -- for most training needs, the metadata.csv and wavs folder is all you need as input.### Good Luck!
And thanks to all the hard working Ponies who took the time to document this. The compendium of knowledge created by the Pony Preservation Project was instrumental in giving these tools shape and form.