Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/Hiroshiba/realtime-yukarin

An application for real-time voice conversion
https://github.com/Hiroshiba/realtime-yukarin

Last synced: 3 months ago
JSON representation

An application for real-time voice conversion

Host: GitHub
URL: https://github.com/Hiroshiba/realtime-yukarin
Owner: Hiroshiba
License: mit
Created: 2018-03-10T17:18:46.000Z (almost 7 years ago)
Default Branch: master
Last Pushed: 2020-01-22T11:30:22.000Z (about 5 years ago)
Last Synced: 2024-06-18T08:34:04.074Z (8 months ago)
Language: Python
Homepage:
Size: 1010 KB
Stars: 331
Watchers: 17
Forks: 51
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-voice-conversion - [Project

README

# Realtime Yukarin: an application for real-time voice conversion
Realtime Yukarin is the application for real-time voice conversion with a single command.
This application needs trained deep learning models and a GPU computer.
The source code is an OSS and MIT license.
So you can modify this code, or use it for your applications whether commercial or non-commercial.

[Japanese README](./README_jp.md)

## Supported environment
* Windows
* GeForce GTX 1060
* 6GB GPU memory
* Intel Core i7-7700 CPU @ 3.60GHz
* Python 3.6

## Preparation
### Installation required libraries
```bash
pip install -r requirements.txt
```

### Prepare trained models
You need two trained models, a first stage model responsible for voice conversion
and a second stage model for enhancing the quality of the converted results.
You can create a first stage model with [Yukarin](https://github.com/Hiroshiba/yukarin)
and a second stage model with [Become Yukarin](https://github.com/Hiroshiba/become-yukarin).

Also, for voice pitch conversion, you need a file of frequency statistics
at [Yukarin](https://github.com/Hiroshiba/yukarin).

Here, each filename is as follows:

| Content | Filename |
| ---- | ---- |
| Frequency statistics for input voice | `./sample/input_statistics.npy` |
| Frequency statistics for target voice | `./sample/target_statistics.npy` |
| First stage model from [Yukarin](https://github.com/Hiroshiba/yukarin) | `./sample/model_stage1/predictor.npz` |
| First stage's config file | `./sample/model_stage1/config.json` |
| Second stage model from [Become Yukarin](https://github.com/Hiroshiba/become-yukarin) | `./sample/model_stage2/predictor.npz` |
| Second stage's config file | `./sample/model_stage2/config.json` |

## Verification
You can verify prepared files with executing `./check.py`.
The following example converts 5 seconds voice data of `input.wav`, and save to `output.wav`.

```bash
python check.py \
--input_path 'input.wav' \
--input_time_length 5 \
--output_path 'output.wav' \
--input_statistics_path './sample/input_statistics.npy' \
--target_statistics_path './sample/target_statistics.npy' \
--stage1_model_path './sample/model_stage1/predictor.npz' \
--stage1_config_path './sample/model_stage1/config.json' \
--stage2_model_path './sample/model_stage2/predictor.npz' \
--stage2_config_path './sample/model_stage2/config.json' \

```

If you have problems, you can ask questions
on [Github Issue](https://github.com/Hiroshiba/realtime-yukarin/issues).

## Run
To perform real-time voice conversion, create a config file `config.yaml` and run `./run.py`.

```bash
python run.py ./config.yaml
```

### Description of config file
```yaml
# Name of input sound device. Partial Match. Details are below.
input_device_name: str

# Name of output sound device. Partial Match. Details are below.
output_device_name: str

# Input sampling rate
input_rate: int

# Output sampling rate
output_rate: int

# frame_period for Acoustic feature
frame_period: int

# Length of voice to convert at one time (seconds).
# If it is too long, delay will increase, and if it is too short, processing will not catch up.
buffer_time: float

# Method to calclate the fundamental frequency. world ofr crepe.
# CREPE needs additional libraries, details are requirements.txt
extract_f0_mode: world

# Length of voice to be synthesized at one time (number of samples)
vocoder_buffer_size: int

# Amplitude scaling for input.
# When it is more than 1, the amplitude becomes large, and when it is less than 1, the amplitude becomes small.
input_scale: float

# Amplitude scaling for output.
# When it is more than 1, the amplitude becomes large, and when it is less than 1, the amplitude becomes small.
output_scale: float

# Silence threshold for input (db).
# The smaller the value, the easier it is to silence.
input_silent_threshold: float

# Silence threshold for output (db).
# The smaller the value, the easier it is to silence.
output_silent_threshold: float

# Overlap for encoding (seconds)
encode_extra_time: float

# Overlap for converting (seconds)
convert_extra_time: float

# Overlap for decoding (seconds)
decode_extra_time: float

# Path of frequency statistics file
input_statistics_path: str
target_statistics_path: str

# Path of trained model file
stage1_model_path: str
stage1_config_path: str
stage2_model_path: str
stage2_config_path: str
```

#### (preliminary knowledge) Name of sound device
In the example below, `Logitech Speaker` is the name of the sound device.

## License
[MIT License](./LICENSE)