Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/daemon/pytorch-pcen

PyTorch reimplementation of per-channel energy normalization for audio.
https://github.com/daemon/pytorch-pcen

audio pytorch speech

Last synced: 2 months ago
JSON representation

PyTorch reimplementation of per-channel energy normalization for audio.

Host: GitHub
URL: https://github.com/daemon/pytorch-pcen
Owner: daemon
License: mit
Created: 2018-10-15T21:09:36.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2019-03-29T23:37:21.000Z (almost 6 years ago)
Last Synced: 2023-08-07T04:07:08.328Z (over 1 year ago)
Topics: audio, pytorch, speech
Language: Python
Homepage:
Size: 10.7 KB
Stars: 83
Watchers: 3
Forks: 15
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # PyTorch-PCEN

Efficient PyTorch reimplementation of [per-channel energy normalization](https://arxiv.org/pdf/1607.05666.pdf) with Mel 

spectrogram features.

## Overview

Robustness to loudness differences in near- and far-field conditions is critical in high-quality speech recognition applications. 

Obviously, spectrogram energies differ significantly between, say, shouting at arms-length and whispering from a distance. 

This can worsen model quality, since the model itself would need to be robust across a wide range of input. The 

log-compression step in the popular log-Mel transform partially addresses this issue by reducing the dynamic range of audio; 

however, it ignores per-channel energy differences and is static by definition.

[Per-channel energy normalization](https://arxiv.org/pdf/1607.05666.pdf) is one such solution to the aforementioned problems. 

It provides a per-channel, trainable front-end in place of the log compression, greatly improving model robustness in keyword spotting systems -- all the while being resource-efficient and easy to implement.

## Installation and Usage

1. PyTorch and NumPy are required. LibROSA and matplotlib are required only for the example.

2. To install via pip, run `pip install git+https://github.com/daemon/pytorch-pcen`. Otherwise, clone this repository and run `python setup.py install`.

3. To run the example in the module, place a 16kHz WAV file named `yes.wav` in the current directory. Then, do `python -m pcen.pcen`.

The following is a self-contained example for using a streaming PCEN layer:

```python

import pcen

import torch

# 40-dimensional features, 30-millisecond window, 10-millisecond shift; trainable is false by default

transform = pcen.StreamingPCENTransform(n_mels=40, n_fft=480, hop_length=160, trainable=True)

audio = torch.empty(1, 16000).normal_(0, 0.1) # Gaussian noise

# 1600 is an arbitrary chunk size; This step is unnecessary but demonstrates the streaming nature

streaming_chunks = audio.split(1600, 1)

pcen_chunks = [transform(chunk) for chunk in streaming_chunks] # Transform each chunk

transform.reset() # Reset the persistent streaming state

pcen_ = torch.cat(pcen_chunks, 1)

```

## Citation

Wang, Yuxuan, Pascal Getreuer, Thad Hughes, Richard F. Lyon, and Rif A. Saurous. [Trainable frontend for robust and far-field keyword spotting](https://arxiv.org/pdf/1607.05666.pdf). In _Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on_, pp. 5670-5674. IEEE, 2017.

```tex

@inproceedings{wang2017trainable,

  title={Trainable frontend for robust and far-field keyword spotting},

  author={Wang, Yuxuan and Getreuer, Pascal and Hughes, Thad and Lyon, Richard F and Saurous, Rif A},

  booktitle={Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on},

  pages={5670--5674},

  year={2017},

  organization={IEEE}

}

```