https://github.com/dwgnr/speech-conversion

Whisper to Normal Speech Conversion with SC-MelGAN and SC-VQ-VAE
https://github.com/dwgnr/speech-conversion

speech-synthesis voice-conversion

Last synced: 2 months ago
JSON representation

Whisper to Normal Speech Conversion with SC-MelGAN and SC-VQ-VAE

Host: GitHub
URL: https://github.com/dwgnr/speech-conversion
Owner: dwgnr
Created: 2021-10-05T12:05:17.000Z (about 4 years ago)
Default Branch: main
Last Pushed: 2022-12-03T19:45:33.000Z (almost 3 years ago)
Last Synced: 2025-07-04T12:07:49.592Z (3 months ago)
Topics: speech-synthesis, voice-conversion
Language: Jupyter Notebook
Homepage: https://th-nuernberg.github.io/speech-conversion-demo/
Size: 3.15 MB
Stars: 14
Watchers: 1
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Whisper to Normal Speech Conversion with SC-MelGAN and SC-VQ-VAE

This repository contains the source code for the paper *Generative Models for Improved Naturalness, Intelligibility, and Voicing of Whispered Speech*.
The goal was to adapt [MelGAN](https://arxiv.org/abs/1910.06711) and [VQ-VAE](https://arxiv.org/abs/1711.00937) systems to convert whispered speech into normal speech.

The MelGAN code used as basis for this project can be found [here](https://github.com/descriptinc/melgan-neurips).

The VQ-VAE model is based on Deepmind's VQ-VAE implementation (see [here](https://github.com/deepmind/sonnet/blob/v2/sonnet/src/nets/vqvae.py)),
Andrej Karpathy's [implementation](https://github.com/karpathy/deep-vector-quantization), and [this repo](https://github.com/swasun/VQ-VAE-Speech).

The WaveGlow system is a slightly adapted version of the code provided by [NVIDIA](https://github.com/NVIDIA/waveglow).

Please visit our [demo website](https://th-nuernberg.github.io/speech-conversion-demo/) for samples.

## Structure of this Repository

The repo is structured as follows:
```
.
└── speech-conversion
├── melgan -> Sources for training MelGAN models
│ └── mel2wav
├── vqvae -> Sources for training VQ-VAE models
└── waveglow -> Sources for training WaveGlow models
└── tacotron2
```

## Dataset

The code is designed to be used with the [wTIMIT](http://isle.illinois.edu/sst/data/wTIMIT/index.html) corpus.
The corpus can be downloaded [here](http://ifp-08.ifp.uiuc.edu/protected/wTIMIT) (**Note:** Requires authentication).
The wTIMIT dataset is sampled at 44 kHz and needs to be resampled to 16 kHz.
The 16 kHz setting is hardcoded in several places of this project.
Hence, using a different sample rate without any source code modifications will likely lead to errors.

### Preparing the Dataset
Create a directory with all samples stored for example in the `wavs/` subfolder.
You'll need to provide filelists containing the your test and training data.
A simple way to create these filelists looks as follows:
```command
ls wavs/*n.WAV | tail -n+10 > train_files.txt
ls wavs/*w.WAV | head -n10 > test_files.txt
ls wavs/*n.WAV | head -n10 > normal_test_files.txt -> normal test data for waveglow
```
Note that we only grab the *whispered utterances* (the ones with "w" at the end) for the test set.

## Training
See the following scripts for examples on how to train MelGAN, VQ-VAE and WaveGlow models:
- `train_melgan.sh`
- Add your own paths to the variables `SAVE_PATH`, `DATA_PATH`, and `LOAD_PATH`
- `train_vqvae.sh`
- Add your own paths to the variables `SAVE_PATH`, `DATA_PATH`, `LOAD_PATH`, and `WG_PATH`
- `train_waveglow.sh`
- Create your own config file for the WaveGlow model or use an existing one and point to it via the `--config` flag
- Note that the original WaveGlow model is incompatible with the Mel spectrogram features generated for MelGAN and VQ-VAE training
- Hence using a pretrained WaveGlow model **will not yield good results**, when spectrogram inputs generated by VQ-VAE are used as input
- A training script that provides a compatible model can be found in `speech-conversion/waveglow/train_melgan_comapt.py` and is also referenced in `train_waveglow.sh`.

**Note:** The Python scripts need to run with the `-m` [command line flag](https://docs.python.org/3.8/using/cmdline.html#cmdoption-m) and without the `.py` extension (e.g. `python -m app.sub1.mod1`) due to the relative imports used across the sub-packages.

## Inference

Inference can be done with the following scripts:
- `inference_melgan.sh`
- `inference_vqvae.sh`

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/dwgnr/speech-conversion

Awesome Lists containing this project

README