Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/daemon/pytorch-pcen
PyTorch reimplementation of per-channel energy normalization for audio.
https://github.com/daemon/pytorch-pcen
audio pytorch speech
Last synced: 2 months ago
JSON representation
PyTorch reimplementation of per-channel energy normalization for audio.
- Host: GitHub
- URL: https://github.com/daemon/pytorch-pcen
- Owner: daemon
- License: mit
- Created: 2018-10-15T21:09:36.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2019-03-29T23:37:21.000Z (almost 6 years ago)
- Last Synced: 2023-08-07T04:07:08.328Z (over 1 year ago)
- Topics: audio, pytorch, speech
- Language: Python
- Homepage:
- Size: 10.7 KB
- Stars: 83
- Watchers: 3
- Forks: 15
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# PyTorch-PCEN
Efficient PyTorch reimplementation of [per-channel energy normalization](https://arxiv.org/pdf/1607.05666.pdf) with Mel
spectrogram features.## Overview
Robustness to loudness differences in near- and far-field conditions is critical in high-quality speech recognition applications.
Obviously, spectrogram energies differ significantly between, say, shouting at arms-length and whispering from a distance.
This can worsen model quality, since the model itself would need to be robust across a wide range of input. The
log-compression step in the popular log-Mel transform partially addresses this issue by reducing the dynamic range of audio;
however, it ignores per-channel energy differences and is static by definition.[Per-channel energy normalization](https://arxiv.org/pdf/1607.05666.pdf) is one such solution to the aforementioned problems.
It provides a per-channel, trainable front-end in place of the log compression, greatly improving model robustness in keyword spotting systems -- all the while being resource-efficient and easy to implement.## Installation and Usage
1. PyTorch and NumPy are required. LibROSA and matplotlib are required only for the example.
2. To install via pip, run `pip install git+https://github.com/daemon/pytorch-pcen`. Otherwise, clone this repository and run `python setup.py install`.
3. To run the example in the module, place a 16kHz WAV file named `yes.wav` in the current directory. Then, do `python -m pcen.pcen`.The following is a self-contained example for using a streaming PCEN layer:
```python
import pcen
import torch# 40-dimensional features, 30-millisecond window, 10-millisecond shift; trainable is false by default
transform = pcen.StreamingPCENTransform(n_mels=40, n_fft=480, hop_length=160, trainable=True)
audio = torch.empty(1, 16000).normal_(0, 0.1) # Gaussian noise# 1600 is an arbitrary chunk size; This step is unnecessary but demonstrates the streaming nature
streaming_chunks = audio.split(1600, 1)
pcen_chunks = [transform(chunk) for chunk in streaming_chunks] # Transform each chunk
transform.reset() # Reset the persistent streaming state
pcen_ = torch.cat(pcen_chunks, 1)
```## Citation
Wang, Yuxuan, Pascal Getreuer, Thad Hughes, Richard F. Lyon, and Rif A. Saurous. [Trainable frontend for robust and far-field keyword spotting](https://arxiv.org/pdf/1607.05666.pdf). In _Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on_, pp. 5670-5674. IEEE, 2017.
```tex
@inproceedings{wang2017trainable,
title={Trainable frontend for robust and far-field keyword spotting},
author={Wang, Yuxuan and Getreuer, Pascal and Hughes, Thad and Lyon, Richard F and Saurous, Rif A},
booktitle={Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on},
pages={5670--5674},
year={2017},
organization={IEEE}
}
```