Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/luluw8071/conformer

Implementation of Conformer Paper for end-to-end speech recognition training in Lightning AI :zap:
https://github.com/luluw8071/conformer

comet-ml common-voice-dataset conformer lightning-ai pytorch sox

Last synced: 2 days ago
JSON representation

Implementation of Conformer Paper for end-to-end speech recognition training in Lightning AI :zap:

Host: GitHub
URL: https://github.com/luluw8071/conformer
Owner: LuluW8071
License: mit
Created: 2024-11-09T14:36:44.000Z (5 days ago)
Default Branch: main
Last Pushed: 2024-11-11T17:25:13.000Z (3 days ago)
Last Synced: 2024-11-11T18:28:16.867Z (3 days ago)
Topics: comet-ml, common-voice-dataset, conformer, lightning-ai, pytorch, sox
Language: Jupyter Notebook
Homepage:
Size: 4.38 MB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.md

Awesome Lists containing this project

README

        # Conformer: Convolution-augmented Transformer for Speech Recognition



![Code in Progress](https://img.shields.io/badge/status-in_progress-yellow.svg) ![License](https://img.shields.io/github/license/LuluW8071/Conformer) ![Open Issues](https://img.shields.io/github/issues/LuluW8071/Conformer) ![Repo Size](https://img.shields.io/github/repo-size/LuluW8071/Conformer) ![Last Commit](https://img.shields.io/github/last-commit/LuluW8071/Conformer)



This repository contains an implementation of the paper __Conformer: Convolution-augmented Transformer for Speech Recognition__ with the training scripts supporting training for distributed parallel gpu nodes using __Lightning AI :zap:__.

## 📜 Paper & Blogs Review 

- [x] [Attention Is All You Need](https://arxiv.org/pdf/1706.03762)

- [x] [Conformer: Convolution-augmented Transformer for Speech Recognition](https://arxiv.org/pdf/2005.08100)

- [x] [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/pdf/1901.02860)

- [x] [KenLM](https://kheafield.com/code/kenlm/)

- [x] [Boosting Sequence Generation Performance with Beam Search Language Model Decoding](https://towardsdatascience.com/boosting-your-sequence-generation-performance-with-beam-search-language-model-decoding-74ee64de435a)

## Conformer Model

![Conformer](assets/conformer.png)

### Model Architecture Params

| Model           | Conformer (Small) | Conformer (Medium) | Conformer (Large) |

|-----------------|---------------|---------------|---------------|

| Encoder Layers  | 16            | 16            | 17            |

| Encoder Dim     | 144           | 256           | 512           |

| Attention Heads | 4             | 4             | 8             |

| Conv Kernel Size| 32            | 32            | 32            |

| Decoder Layers  | 1             | 1             | 1             |

| Decoder Dim     | 320           | 640           | 640           |

## Installation

1. Clone the repository:

   ```bash

   git clone --recursive https://github.com/LuluW8071/Conformer.git

   cd Conformer

   ```

2. Install Required Dependencies

> [!IMPORTANT]  

> Before installing dependencies from `requirements.txt`, make sure you have installed \

>  __No need to install **CUDA ToolKit** and **PyTorch CUDA** for inferencing. But make sure to install **PyTorch CPU**.__

> - [**CUDA ToolKit v11.8/12.1**](https://developer.nvidia.com/cuda-toolkit-archive)

> - [**PyTorch**](https://pytorch.org/)

> - [**SOX**](https://sourceforge.net/projects/sox/)

>     - **For Linux:**

>         ```bash

>         sudo apt update

>         sudo apt install sox libsox-fmt-all build-essential zlib1g-dev libbz2-dev liblzma-dev

>         ```

> 

> - [**PyAudio**](https://people.csail.mit.edu/hubert/pyaudio/)

>     - **For Linux:**

>       ```bash

>       sudo apt-get install libasound-dev portaudio19-dev libportaudio2 libportaudiocpp0

>       sudo apt-get install ffmpeg libav-tools

>       sudo pip install pyaudio    

>       ```

```bash

pip install -r requirements.txt

```

## Usage

### Audio Conversion

> [!NOTE]

> `--not-convert` if you don't want audio conversion.

```bash

py common_voice.py --file_path file_path/to/validated.tsv --save_json_path file_path/to/save/json -w 4 --percent 10 --output_format wav/flac

```

### Training

>[!IMPORTANT]

> Before training make sure you have placed __comet ml api key__ and __project name__ in the environment variable file `.env`.

To train the __Conformer__ model, use the following command:

```bash

python3 train.py 

-g 4                   # Number of GPUs per node for parallel gpu training

-w 8                   # Number of CPU workers for parallel data loading

--epochs 10            # Number of total epochs to run

--batch_size 64        # Size of the batch

-lr 2e-5               # Learning rate

--precision 16-mixed   # Precision of the training

--checkpoint_path path_to_checkpoint.ckpt    # Checkpoint path to resume training from

```

## Experiment Results

The __confomer model (small)__ was trained on __Mozilla Common Voice Dataset 7.0__, __my personal recordings__ and __LibriSpeech Train Set__ (~ 1200 hrs and 960 hrs) and validated on splitted dataset and __LibriSpeech Test Set__ (~ 100 hrs and 10.5 hrs).

#### Train Configuration

| Parameter                        | Value                       |

|-----------------------------------|-----------------------------|

| **GPU Device**                    | 1 L4                        |

| **Batch Size**                    | 16                          |

| **Epochs**                        | 26                          |

| **Optimizer**                     | AdamW                       |

| **Learning Rate (lr)**            | 1e-4                        |

| **Scheduler**                     | Cosine Annealing with Warmup Restart |

| **Min Learning Rate**             | 3e-5                        |

#### Loss Curves 

![Loss Curve](assets/train_loss,val_loss%20VS%20step.jpeg)

#### Metric Evaluation

| Dataset    | Train Loss | Validation Loss | Greedy WER  | Link |

|---------------|----|----|----|---|

|   | 0.4484 | 0.3119 |22.94% | [:link:](https://drive.google.com/uc?id=1XcouMWSncUeNBvGZednuWYK1jdfKisCr)

|    | ![Status](https://img.shields.io/badge/status-in_progress-yellow.svg) |![Status](https://img.shields.io/badge/status-in_progress-yellow.svg) | ![Status](https://img.shields.io/badge/status-in_progress-yellow.svg)

> _Expected WER for CTC+KEN-LM to be __<15% WER__ and inference with CTC+KEN-LM is found to be in the [notebook](https://github.com/LuluW8071/Conformer/blob/main/notebooks/Conformer_Inference_With_CTC_Decoder.ipynb)._

## Citations

```bibtex

@misc{gulati2020conformerconvolutionaugmentedtransformerspeech,

      title={Conformer: Convolution-augmented Transformer for Speech Recognition}, 

      author={Anmol Gulati and James Qin and Chung-Cheng Chiu and Niki Parmar and Yu Zhang and Jiahui Yu and Wei Han and Shibo Wang and Zhengdong Zhang and Yonghui Wu and Ruoming Pang},

      year={2020},

      url={https://arxiv.org/abs/2005.08100}, 

}

```