https://github.com/interactiveaudiolab/ppgs
High-Fidelity Neural Phonetic Posteriorgrams
https://github.com/interactiveaudiolab/ppgs
distance intelligibility phonemes posteriorgram pronunciation speech
Last synced: about 1 month ago
JSON representation
High-Fidelity Neural Phonetic Posteriorgrams
- Host: GitHub
- URL: https://github.com/interactiveaudiolab/ppgs
- Owner: interactiveaudiolab
- License: mit
- Created: 2022-08-11T18:18:41.000Z (about 3 years ago)
- Default Branch: master
- Last Pushed: 2025-02-23T04:31:17.000Z (7 months ago)
- Last Synced: 2025-07-01T23:36:35.258Z (3 months ago)
- Topics: distance, intelligibility, phonemes, posteriorgram, pronunciation, speech
- Language: Python
- Homepage: https://maxrmorrison.com/sites/ppgs
- Size: 98.6 MB
- Stars: 112
- Watchers: 10
- Forks: 8
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
High-Fidelity Neural Phonetic Posteriorgrams
[](https://pypi.python.org/pypi/ppgs)
[](https://opensource.org/licenses/MIT)
[](https://pepy.tech/project/ppgs)Training, evaluation, and inference of neural phonetic posteriorgrams (PPGs) in PyTorch
[[Paper]](https://www.maxrmorrison.com/pdfs/churchwell2024high.pdf) [[Website]](https://www.maxrmorrison.com/sites/ppgs/)
## Table of contents
- [Installation](#installation)
- [Inference](#inference)
* [Application programming interface (API)](#application-programming-interface-api)
* [`ppgs.from_audio`](#ppgsfrom_audio)
* [`ppgs.from_file`](#ppgsfrom_file)
* [`ppgs.from_file_to_file`](#ppgsfrom_file_to_file)
* [`ppgs.from_files_to_files`](#ppgsfrom_files_to_files)
* [Command-line interface (CLI)](#command-line-interface-cli)
- [Distance](#distance)
- [Interpolate](#interpolate)
- [Edit](#edit)
* [`ppgs.edit.grid.constant`](#ppgseditgridconstant)
* [`ppgs.edit.grid.from_alignments`](#ppgseditgridfrom_alignments)
* [`ppgs.edit.grid.of_length`](#ppgseditgridof_length)
* [`ppgs.edit.grid.sample`](#ppgseditgridsample)
* [`ppgs.edit.reallocate`](#ppgseditreallocate)
* [`ppgs.edit.regex`](#ppgseditregex)
* [`ppgs.edit.shift`](#ppgseditshift)
* [`ppgs.edit.swap`](#ppgseditswap)
- [Sparsify](#sparsify)
- [Training](#training)
* [Download](#download)
* [Preprocess](#preprocess)
* [Partition](#partition)
* [Train](#train)
* [Monitor](#monitor)
* [Evaluate](#evaluate)
- [Citation](#citation)## Installation
An inference-only installation with our best model is pip-installable
`pip install ppgs`
To perform training, install training dependencies and FFMPEG.
```bash
pip install ppgs[train]
conda install -c conda-forge ffmpeg
```If you wish to use the Charsiu representation, download the code,
install both inference and training dependencies, and install
Charsiu as a Git submodule.```bash
# Clone
git clone git@github.com/interactiveaudiolab/ppgs
cd ppgs/# Install dependencies
pip install -e .[train]
conda install -c conda-forge ffmpeg# Download Charsiu
git submodule init
git submodule update
```## Inference
```python
import ppgs# Load speech audio at correct sample rate
audio = ppgs.load.audio(audio_file)# Choose a gpu index to use for inference. Set to None to use cpu.
gpu = 0# Infer PPGs
ppgs = ppgs.from_audio(audio, ppgs.SAMPLE_RATE, gpu=gpu)
```### Application programming interface (API)
#### `ppgs.from_audio`
```python
def from_audio(
audio: torch.Tensor,
sample_rate: Union[int, float],
representation: str = ppgs.REPRESENTATION,
checkpoint: Optional[Union[str, bytes, os.PathLike]] = None,
gpu: int = None
) -> torch.Tensor:
"""Infer ppgs from audioArguments
audio
Batched audio to process
shape=(batch, 1, samples)
sample_rate
Audio sampling rate
representation
The representation to use, 'mel' and 'w2v2fb' are currently supported
checkpoint
The checkpoint file
gpu
The index of the GPU to use for inferenceReturns
ppgs
Phonetic posteriorgrams
shape=(batch, len(ppgs.PHONEMES), frames)
"""
```#### `ppgs.from_file`
```python
def from_file(
file: Union[str, bytes, os.PathLike],
representation: str = ppgs.REPRESENTATION,
checkpoint: Optional[Union[str, bytes, os.PathLike]] = None,
gpu: Optional[int] = None
) -> torch.Tensor:
"""Infer ppgs from an audio fileArguments
file
The audio file
representation
The representation to use, 'mel' and 'w2v2fb' are currently supported
checkpoint
The checkpoint file
gpu
The index of the GPU to use for inferenceReturns
ppgs
Phonetic posteriorgram
shape=(len(ppgs.PHONEMES), frames)
"""
```#### `ppgs.from_file_to_file`
```python
def from_file_to_file(
audio_file: Union[str, bytes, os.PathLike],
output_file: Union[str, bytes, os.PathLike],
representation: str = ppgs.REPRESENTATION,
checkpoint: Optional[Union[str, bytes, os.PathLike]] = None,
gpu: Optional[int] = None
) -> None:
"""Infer ppg from an audio file and save to a torch tensor fileArguments
audio_file
The audio file
output_file
The .pt file to save PPGs
representation
The representation to use, 'mel' and 'w2v2fb' are currently supported
checkpoint
The checkpoint file
gpu
The index of the GPU to use for inference
"""
```#### `ppgs.from_files_to_files`
```python
def from_files_to_files(
audio_files: List[Union[str, bytes, os.PathLike]],
output_files: List[Union[str, bytes, os.PathLike]],
representation: str = ppgs.REPRESENTATION,
checkpoint: Optional[Union[str, bytes, os.PathLike]] = None,
num_workers: int = 0,
gpu: Optional[int] = None,
max_frames: int = ppgs.MAX_INFERENCE_FRAMES
) -> None:
"""Infer ppgs from audio files and save to torch tensor filesArguments
audio_files
The audio files
output_files
The .pt files to save PPGs
representation
The representation to use, 'mel' and 'w2v2fb' are currently supported
checkpoint
The checkpoint file
num_workers
Number of CPU threads for multiprocessing
gpu
The index of the GPU to use for inference
max_frames
The maximum number of frames on the GPU at once
"""
```### Command-line interface (CLI)
```
usage: python -m ppgs
[-h]
[--audio_files AUDIO_FILES [AUDIO_FILES ...]]
[--output_files OUTPUT_FILES [OUTPUT_FILES ...]]
[--representation REPRESENTATION]
[--checkpoint CHECKPOINT]
[--num-workers NUM_WORKERS]
[--gpu GPU]
[--max-frames MAX_TRAINING_FRAMES]arguments:
--audio_files AUDIO_FILES [AUDIO_FILES ...]
Paths to input audio files
--output_files OUTPUT_FILES [OUTPUT_FILES ...]
The one-to-one corresponding output filesoptional arguments:
-h, --help
Show this help message and exit
--representation REPRESENTATION
Representation to use for inference
--checkpoint CHECKPOINT
The checkpoint file
--num-workers NUM_WORKERS
Number of CPU threads for multiprocessing
--gpu GPU
The index of the GPU to use for inference. Defaults to CPU.
--max-frames MAX_FRAMES
Maximum number of frames in a batch
```## Distance
To compute the proposed normalized Jenson-Shannon divergence pronunciation
distance between two PPGs, use `ppgs.distance()`.```python
def distance(
ppgX: torch.Tensor,
ppgY: torch.Tensor,
reduction: str = 'mean',
normalize: bool = True,
exponent: float = ppgs.SIMILARITY_EXPONENT
) -> torch.Tensor:
"""Compute the pronunciation distance between two aligned PPGsArguments
ppgX
Input PPG X
shape=(len(ppgs.PHONEMES), frames)
ppgY
Input PPG Y to compare with PPG X
shape=(len(ppgs.PHONEMES), frames)
reduction
Reduction to apply to the output. One of ['mean', 'none', 'sum'].
normalize
Apply similarity based normalization
exponent
Similarty exponentReturns
Normalized Jenson-shannon divergence between PPGs
"""
```## Interpolate
```python
def interpolate(
ppgX: torch.Tensor,
ppgY: torch.Tensor,
interp: Union[float, torch.Tensor]
) -> torch.Tensor:
"""Linear interpolationArguments
ppgX
Input PPG X
shape=(len(ppgs.PHONEMES), frames)
ppgY
Input PPG Y
shape=(len(ppgs.PHONEMES), frames)
interp
Interpolation values
scalar float OR shape=(frames,)Returns
Interpolated PPGs
shape=(len(ppgs.PHONEMES), frames)
"""
```## Edit
```python
import ppgs# Get PPGs to edit
ppg = ppgs.from_file(audio_file, gpu=gpu)# Constant-ratio time-stretching (slowing down)
grid = ppgs.edit.grid.constant(ppg, ratio=0.8)
slow = ppgs.edit.grid.sample(ppg, grid)# Stretch to a desired length (e.g., 100 frames)
grid = ppgs.edit.grid.of_length(ppg, 100)
fixed = ppgs.edit.grid.sample(ppg, grid)
```### `ppgs.edit.grid.constant`
```python
def constant(ppg: torch.Tensor, ratio: float) -> torch.Tensor:
"""Create a grid for constant-ratio time-stretchingArguments
ppg
Input PPG
ratio
Time-stretching ratio; lower is slowerReturns
Constant-ratio grid for time-stretching ppg
"""
```### `ppgs.edit.grid.from_alignments`
```python
def from_alignments(
source: pypar.Alignment,
target: pypar.Alignment,
sample_rate: int = ppgs.SAMPLE_RATE,
hopsize: int = ppgs.HOPSIZE
) -> torch.Tensor:
"""Create time-stretch grid to convert source alignment to targetArguments
source
Forced alignment of PPG to stretch
target
Forced alignment of target PPG
sample_rate
Audio sampling rate
hopsize
Hopsize in samplesReturns
Grid for time-stretching source PPG
"""
```### `ppgs.edit.grid.of_length`
```python
def of_length(ppg: torch.Tensor, length: int) -> torch.Tensor:
"""Create time-stretch grid to resample PPG to a specified lengthArguments
ppg
Input PPG
length
Target lengthReturns
Grid of specified length for time-stretching ppg
"""
```### `ppgs.edit.grid.sample`
```python
def grid_sample(ppg: torch.Tensor, grid: torch.Tensor) -> torch.Tensor:
"""Grid-based PPG interpolationArguments
ppg
Input PPG
grid
Grid of desired length; each item is a float-valued index into ppgReturns
Interpolated PPG
"""
```### `ppgs.edit.reallocate`
```python
def reallocate(
ppg: torch.Tensor,
source: str,
target: str,
value: Optional[float] = None
) -> torch.Tensor:
"""Reallocate probability from source phoneme to target phonemeArguments
ppg
Input PPG
shape=(len(ppgs.PHONEMES), frames)
source
Source phoneme
target
Target phoneme
value
Max amount to reallocate. If None, reallocates all probability.Returns
Edited PPG
"""
```### `ppgs.edit.regex`
```python
def regex(
ppg: torch.Tensor,
source_phonemes: List[str],
target_phonemes: List[str]
) -> torch.Tensor:
"""Regex match and replace (via swap) for phoneme sequencesArguments
ppg
Input PPG
shape=(len(ppgs.PHONEMES), frames)
source_phonemes
Source phoneme sequence
target_phonemes
Target phoneme sequenceReturns
Edited PPG
"""
```### `ppgs.edit.shift`
```python
def shift(ppg: torch.Tensor, phoneme: str, value: float):
"""Shift probability of a phoneme and reallocate proportionallyArguments
ppg
Input PPG
shape=(len(ppgs.PHONEMES), frames)
phoneme
Input phoneme
value
Maximal shift amountReturns
Edited PPG
"""
```### `ppgs.edit.swap`
```python
def swap(ppg: torch.Tensor, phonemeA: str, phonemeB: str) -> torch.Tensor:
"""Swap the probabilities of two phonemesArguments
ppg
Input PPG
shape=(len(ppg.PHONEMES), frames)
phonemeA
Input phoneme A
phonemeB
Input phoneme BReturns
Edited PPG
"""
```## Sparsify
```python
def sparsify(
ppg: torch.Tensor,
method: str = 'percentile',
threshold: torch.Tensor = torch.Tensor([0.85])
) -> torch.Tensor:
"""Make phonetic posteriorgrams sparseArguments
ppg
Input PPG
shape=(batch, len(ppgs.PHONEMES), frames)
method
Sparsification method. One of ['constant', 'percentile', 'topk'].
threshold
In [0, 1] for 'contant' and 'percentile'; integer > 0 for 'topk'.Returns
Sparse phonetic posteriorgram
shape=(batch, len(ppgs.PHONEMES), frames)
"""
```## Training
### Download
Downloads, unzips, and formats datasets. Stores datasets in `data/datasets/`.
Stores formatted datasets in `data/cache/`.**N.B.** Common voice and TIMIT cannot be automatically downloaded. You must
manually download the tarballs and place them in `data/sources/commonvoice`
or `data/sources/timit`, respectively, prior to running the following.```bash
python -m ppgs.data.download --datasets
```### Preprocess
Prepares representations for training. Representations are stored
in `data/cache/`.```
python -m ppgs.preprocess \
--datasets \
--representatations \
--gpu \
--num-workers
```### Partition
Partitions a dataset. You should not need to run this, as the partitions
used in our work are provided for each dataset in
`ppgs/assets/partitions/`.```
python -m ppgs.partition --datasets
```### Train
Trains a model. Checkpoints and logs are stored in `runs/`.
```
python -m ppgs.train --config --dataset --gpu
```If the config file has been previously run, the most recent checkpoint will
automatically be loaded and training will resume from that checkpoint.### Monitor
You can monitor training via `tensorboard`.
```
tensorboard --logdir runs/ --port --load_fast true
```To use the `torchutil` notification system to receive notifications for long
jobs (download, preprocess, train, and evaluate), set the
`PYTORCH_NOTIFICATION_URL` environment variable to a supported webhook as
explained in [the Apprise documentation](https://pypi.org/project/apprise/).### Evaluate
Performs objective evaluation of phoneme accuracy. Results are stored
in `eval/`.```
python -m ppgs.evaluate \
--config \
--datasets \
--checkpoint \
--gpu
```## Citation
### IEEE
C. Churchwell, M. Morrison, and B. Pardo, "High-Fidelity Neural Phonetic Posteriorgrams,"
ICASSP 2024 Workshop on Explainable Machine Learning for Speech and Audio, April 2024.### BibTex
```
@inproceedings{churchwell2024high,
title={High-Fidelity Neural Phonetic Posteriorgrams},
author={Churchwell, Cameron and Morrison, Max and Pardo, Bryan},
booktitle={ICASSP 2024 Workshop on Explainable Machine Learning for Speech and Audio},
month={April},
year={2024}
}
```