Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/bshall/hubert
HuBERT content encoders for: A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion
https://github.com/bshall/hubert
pytorch representation-learning speech voice-conversion
Last synced: 4 days ago
JSON representation
HuBERT content encoders for: A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion
- Host: GitHub
- URL: https://github.com/bshall/hubert
- Owner: bshall
- License: mit
- Created: 2021-10-15T17:13:11.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2024-10-01T10:08:22.000Z (4 months ago)
- Last Synced: 2025-01-05T22:03:48.706Z (12 days ago)
- Topics: pytorch, representation-learning, speech, voice-conversion
- Language: Python
- Homepage: https://bshall.github.io/soft-vc/
- Size: 457 KB
- Stars: 338
- Watchers: 4
- Forks: 53
- Open Issues: 10
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# HuBERT
[![arXiv](https://img.shields.io/badge/arXiv-Paper-.svg)](https://arxiv.org/abs/2111.02392)
[![demo](https://img.shields.io/static/v1?message=Audio%20Samples&logo=Github&labelColor=grey&color=blue&logoColor=white&label=%20&style=flat)](https://bshall.github.io/soft-vc/)
[![colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/bshall/soft-vc/blob/main/soft-vc-demo.ipynb)Training and inference scripts for the HuBERT content encoders in [A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion](https://ieeexplore.ieee.org/abstract/document/9746484).
For more details see [soft-vc](https://github.com/bshall/soft-vc). Audio samples can be found [here](https://bshall.github.io/soft-vc/). Colab demo can be found [here](https://colab.research.google.com/github/bshall/soft-vc/blob/main/soft-vc-demo.ipynb).
Fig 1: Architecture of the voice conversion system. a) The discrete content encoder clusters audio features to produce a sequence of discrete speech units. b) The soft content encoder is trained to predict the discrete units. The acoustic model transforms the discrete/soft speech units into a target spectrogram. The vocoder converts the spectrogram into an audio waveform.
## Example Usage
### Programmatic Usage
```python
import torch, torchaudio# Load checkpoint (either hubert_soft or hubert_discrete)
hubert = torch.hub.load("bshall/hubert:main", "hubert_soft", trust_repo=True).cuda()# Load audio
wav, sr = torchaudio.load("path/to/wav")
assert sr == 16000
wav = wav.unsqueeze(0).cuda()# Extract speech units
units = hubert.units(x)
```### Script-Based Usage
```
usage: encode.py [-h] [--extension EXTENSION] {soft,discrete} in-dir out-dirEncode an audio dataset.
positional arguments:
{soft,discrete} available models (HuBERT-Soft or HuBERT-Discrete)
in-dir path to the dataset directory.
out-dir path to the output directory.optional arguments:
-h, --help show this help message and exit
--extension EXTENSION
extension of the audio files (defaults to .flac).
```## Training
### Step 1: Dataset Preparation
Download and extract the [LibriSpeech](https://www.openslr.org/12) corpus. The training script expects the following tree structure for the dataset directory:
```
│ lengths.json
│
└───wavs
├───dev-*
│ ├───84
│ ├───...
│ └───8842
└───train-*
├───19
├───...
└───8975
```The `train-*` and `dev-*` directories should contain the training and validation splits respectively. Note that there can be multiple `train` and `dev` folders e.g., `train-clean-100`, `train-other-500`, etc. Finally, the `lengths.json` file should contain key-value pairs with the file path and number of samples:
```json
{
"dev-clean/1272/128104/1272-128104-0000": 93680,
"dev-clean/1272/128104/1272-128104-0001": 77040,
}
```### Step 2: Extract Discrete Speech Units
Encode LibriSpeech using the HuBERT-Discrete model and `encode.py` script:
```
usage: encode.py [-h] [--extension EXTENSION] {soft,discrete} in-dir out-dirEncode an audio dataset.
positional arguments:
{soft,discrete} available models (HuBERT-Soft or HuBERT-Discrete)
in-dir path to the dataset directory.
out-dir path to the output directory.optional arguments:
-h, --help show this help message and exit
--extension EXTENSION
extension of the audio files (defaults to .flac).
```for example:
```
python encode.py discrete path/to/LibriSpeech/wavs path/to/LibriSpeech/discrete
```At this point the directory tree should look like:
```
│ lengths.json
│
├───discrete
│ ├───...
└───wavs
├───...
```### Step 3: Train the HuBERT-Soft Content Encoder
```
usage: train.py [-h] [--resume RESUME] [--warmstart] [--mask] [--alpha ALPHA] dataset-dir checkpoint-dirTrain HuBERT soft content encoder.
positional arguments:
dataset-dir path to the data directory.
checkpoint-dir path to the checkpoint directory.optional arguments:
-h, --help show this help message and exit
--resume RESUME path to the checkpoint to resume from.
--warmstart whether to initialize from the fairseq HuBERT checkpoint.
--mask whether to use input masking.
--alpha ALPHA weight for the masked loss.
```## Links
- [Soft-VC repo](https://github.com/bshall/soft-vc)
- [Soft-VC paper](https://ieeexplore.ieee.org/abstract/document/9746484)
- [Official HuBERT repo](https://github.com/pytorch/fairseq)
- [HuBERT paper](https://arxiv.org/abs/2106.07447)## Citation
If you found this work helpful please consider citing our paper:
```
@inproceedings{
soft-vc-2022,
author={van Niekerk, Benjamin and Carbonneau, Marc-André and Zaïdi, Julian and Baas, Matthew and Seuté, Hugo and Kamper, Herman},
booktitle={ICASSP},
title={A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion},
year={2022}
}
```