https://github.com/jaketae/conformer

PyTorch implementation of Conformer: Convolution-augmented Transformer for Speech Recognition
https://github.com/jaketae/conformer

conformer convolution speech-recognition transformer

Last synced: 11 months ago
JSON representation

PyTorch implementation of Conformer: Convolution-augmented Transformer for Speech Recognition

Host: GitHub
URL: https://github.com/jaketae/conformer
Owner: jaketae
License: mit
Created: 2021-04-24T09:16:26.000Z (about 5 years ago)
Default Branch: master
Last Pushed: 2021-04-25T04:21:10.000Z (about 5 years ago)
Last Synced: 2025-05-13T02:04:16.179Z (about 1 year ago)
Topics: conformer, convolution, speech-recognition, transformer
Language: Python
Homepage:
Size: 9.77 KB
Stars: 15
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Conformer

PyTorch implementation of [Conformer: Convolution-augmented Transformer for Speech Recognition](https://arxiv.org/abs/2005.08100) (Gulati et al., 2020).

## Quickstart

Clone this repository.

```
git clone https://github.com/jaketae/conformer.git
```

Navigate to the cloned directory. You can start using the model via

```python
>>> from conformer import ConformerEncoder
>>> model = ConformerEncoder()
```

By default, the model comes with the following parameters:

```python
ConformerEncoder(
num_blocks=6,
d_model=256,
num_heads=4,
max_len=512,
expansion_factor=4,
kernel_size=31,
dropout=0.1,
)
```

## Introduction

The [Transformer](https://arxiv.org/abs/1706.03762) (Vaswani et al., 2017) has proven to be immensely successful in various domains, such as machine translation, [language modeling](https://arxiv.org/abs/1810.04805), and more recently, [computer vision](https://arxiv.org/abs/2010.11929). An important reason behind the success of the transformer architecture is self-attention, which allows the model to attend to the entire input sequence to generate rich feature representations.

A more traditional model architecture, convolution neural networks have widely been used in the vision domain. The sliding kernel structure encodes meaningful inductive biases such as translational invariance, making them suitable as local feature extractors.

The Conformer seeks to combine the best of both worlds: global features are extracted by the transformer, whereas local features are learned by the convolution module. The Conformer model has proven to be effective in [automatic speech recognition](https://paperswithcode.com/task/speech-recognition).

## Note

While the original paper used Conformer specifically in the context of ASR, I implemented this model in the hopes of applying it as a general feature encoder in audio generation tasks, such as speech prosody transfer, voice conversation, and singing voice synthesis. For this reason, the current implementation only includes the encoder portion of the Conformer architecture and also lacks components such as downsampling and SpecAugment.

## Credit

This implementation was heavily influenced by [Soohwan Kim's implementation of Conformer](https://github.com/sooftware/conformer). The skewing logic employed in relative positional encoding was inspired by [Prayag Chatha's implementation](https://github.com/chathasphere/pno-ai) of [Music Transformer](https://arxiv.org/abs/1809.04281) (Huang et al., 2018).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jaketae/conformer

Awesome Lists containing this project

README