https://github.com/facebookresearch/WavAugment
A library for speech data augmentation in time-domain
https://github.com/facebookresearch/WavAugment
Last synced: 5 months ago
JSON representation
A library for speech data augmentation in time-domain
- Host: GitHub
- URL: https://github.com/facebookresearch/WavAugment
- Owner: facebookresearch
- License: mit
- Archived: true
- Created: 2020-06-26T13:43:30.000Z (over 5 years ago)
- Default Branch: main
- Last Pushed: 2021-08-30T18:49:46.000Z (over 4 years ago)
- Last Synced: 2025-02-23T00:14:17.023Z (11 months ago)
- Language: Python
- Size: 1.94 MB
- Stars: 655
- Watchers: 25
- Forks: 59
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
- awesome-diarization - WavAugment
README
# WavAugment
WavAugment performs data augmentation on audio data.
The audio data is represented as [pytorch](https://pytorch.org/) tensors.
It is particularly useful for speech data.
Among others, it implements the augmentations that we found to be most useful for self-supervised learning
(_Data Augmenting Contrastive Learning of Speech Representations in the Time Domain_, E. Kharitonov, M. Rivière, G. Synnaeve, L. Wolf, P.-E. Mazaré, M. Douze, E. Dupoux. [[arxiv]](https://arxiv.org/abs/2007.00991)):
* Pitch randomization,
* Reverberation,
* Additive noise,
* Time dropout (temporal masking),
* Band reject,
* Clipping
Internally, WavAugment uses [libsox](http://sox.sourceforge.net/libsox.html) and allows interleaving of libsox- and pytorch-based effects.
### Requirements
* Linux or MacOS
* [pytorch](pytorch.org) >= 1.7
* [torchaudio](pytorch.org/audio) >= 0.7
### Installation
To install WavAugment, run the following command:
```bash
git clone git@github.com:facebookresearch/WavAugment.git && cd WavAugment && python setup.py develop
```
### Testing
Requires pytest (`pip install pytest`)
```bash
python -m pytest -v --doctest-modules
```
## Usage
First of all, we provide thouroughly documented [examples](./examples/python), where we demonstrate how a data-augmented dataset interface works. We also provide a Jupyter-based [tutorial](./examples/python/WavAugment_walkthrough.ipynb) [(open in colab)](https://colab.research.google.com/github/facebookresearch/WavAugment/blob/master/examples/python/WavAugment_walkthrough.ipynb) that illlustrates how one can apply various useful effects to a piece of speech (recorded over the mic or pre-recorded).
### The `EffectChain`
The central object is the chain of effects, `EffectChain`, that are applied on a `torch.Tensor` to produce another `torch.Tensor`.
This chain can have multiple effects composed:
```python
import augment
effect_chain = augment.EffectChain().pitch(100).rate(16_000)
```
Parameters of the effect coincide with those of libsox (http://sox.sourceforge.net/libsox.html); however, you can also randomize the parameters by providing a python `Callable` and mix them with standard parameters:
```python
import numpy as np
random_pitch_shift = lambda: np.random.randint(-100, +100)
# the pitch will be changed by a shift somewhere between (-100, +100)
effect_chain = augment.EffectChain().pitch("-q", random_pitch_shift).rate(16_000)
```
Here, the flag`-q` makes `pitch` run faster at some expense of the quality.
If some parameters are provided by a Callable, this Callable will be invoked every time `EffectChain` is applied (eg. to generate random parameters).
### Applying the chain
To apply a chain of effects on a torch.Tensor, we code the following:
```python
output_tensor = augment.EffectChain().pitch(100).rate(16_000).apply(input_tensor, \
src_info=src_info, target_info=target_info)
```
WavAugment expects `input_tensor` to have a shape of (channels, length). As `input_tensor` does not contain important meta-information, such as sampling rate, we need to provide it manually.
This is done by passing two dictionaries, `src_info` (meta-information about the input format) and `target_info` (our expectated format for the output).
At minimum, we need to set the sampling rate for the input tensor: `{'rate': 16_000}`.
### Example usage
Below is a small gist of a potential usage:
```python
import augment
import numpy as np
x, sr = torchaudio.load(test_wav)
# input signal properties
src_info = {'rate': sr}
# output signal properties
target_info = {'channels': 1,
'length': 0, # not known beforehand
'rate': 16_000}
# write down the chain of effects with their string parameters and call .apply()
# effects are specified as a chain of method calls with parameters that can be
# strings, numbers, or callables. The latter case is used for generating randomized
# transformations
random_pitch = lambda: np.random.randint(-400, -200)
y = augment.EffectChain().pitch(random_pitch).rate(16_000).apply(x, \
src_info=src_info, target_info=target_info)
```
## Important notes
It often happens that a command-line invocation of sox would change effect chain. To get a better idea of what sox executes internally, you can launch it with a -V flag, eg by running:
```bash
sox -V tests/test.wav out.wav reverb 0 50 100
```
we will see something like:
```
sox INFO sox: effects chain: input 16000Hz 1 channels
sox INFO sox: effects chain: reverb 16000Hz 2 channels
sox INFO sox: effects chain: channels 16000Hz 1 channels
sox INFO sox: effects chain: dither 16000Hz 1 channels
sox INFO sox: effects chain: output 16000Hz 1 channels
```
This output tells us that the `reverb` effect changes the number of channels, which are squashed into 1 channel by the `channel` effect. Sox also added `dither` effect to hide processing artifacts.
WavAugment remains explicit and doesn't add effects under the hood.
If you want to emulate a Sox command that decomposes into several effects, we advise to consult `sox -V` and apply the effects manually.
Try it out on some files before running a heavy machine-learning job.
## Citation
If you find WavAugment useful in your research, please consider citing:
```
@article{wavaugment2020,
title={Data Augmenting Contrastive Learning of Speech Representations in the Time Domain},
author={Kharitonov, Eugene and Rivi{\`e}re, Morgane and Synnaeve, Gabriel and Wolf, Lior and Mazar{\'e}, Pierre-Emmanuel and Douze, Matthijs and Dupoux, Emmanuel},
journal={arXiv preprint arXiv:2007.00991},
year={2020}
}
```
## Contributing
See the [CONTRIBUTING](CONTRIBUTING.md) file for how to help out.
## License
WavAugment is MIT licensed, as found in the LICENSE file.