https://github.com/luluw8071/deep-speech-2
Implementation of Deep Speech 2 paper with BiGRU and BiLSTM using LibriSpeech Dataset
https://github.com/luluw8071/deep-speech-2
asr ctc-decode deep-speech hacktoberfest kenlm-toolkit librispeech
Last synced: 11 days ago
JSON representation
Implementation of Deep Speech 2 paper with BiGRU and BiLSTM using LibriSpeech Dataset
- Host: GitHub
- URL: https://github.com/luluw8071/deep-speech-2
- Owner: LuluW8071
- License: mit
- Created: 2024-10-08T04:07:32.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-01-27T11:57:23.000Z (over 1 year ago)
- Last Synced: 2025-02-28T13:13:15.820Z (over 1 year ago)
- Topics: asr, ctc-decode, deep-speech, hacktoberfest, kenlm-toolkit, librispeech
- Language: Jupyter Notebook
- Homepage: https://arxiv.org/abs/1512.02595
- Size: 2.08 MB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
# Deep Speech 2
      
This repository contains an implementation of the paper **Deep Speech 2: End-to-End Speech Recognition**, a state-of-the-art ASR model designed for end-to-end speech-to-text transcription using deep learning techniques. The implementation leverages **Lightning AI ⚡** for efficient training and experimentation.
---
## 📜 Paper & Blog Reviews
- ✅ [Gated Recurrent Neural Networks](https://arxiv.org/pdf/1412.3555)
- ✅ [Deep Speech 2: End-to-End Speech Recognition](https://arxiv.org/abs/1512.02595)
- ✅ [KenLM](https://kheafield.com/code/kenlm/)
- ✅ [Boosting Sequence Generation Performance with Beam Search Language Model Decoding](https://towardsdatascience.com/boosting-your-sequence-generation-performance-with-beam-search-language-model-decoding-74ee64de435a)
---
## 🚀 Installation
1. **Clone the repository:**
```bash
git clone https://github.com/LuluW8071/Deep-Speech-2.git
cd Deep-Speech-2
```
2. **Install dependencies:**
```bash
pip install -r requirements.txt
```
Ensure you have `PyTorch` and `Lightning AI` installed.
---
## 📖 Usage
### 🔥 Training
> **Important:** Before training, make sure to set your **Comet ML API key** and **project name** in the `.env` file.
To train the **Deep Speech 2** model with default configurations:
```bash
python3 train.py
```
To customize the training parameters, modify `train.py` or pass arguments:
| Argument | Description | Default |
|----------|-------------|---------|
| `-g`, `--gpus` | Number of GPUs per node | `1` |
| `-w`, `--num_workers` | Number of data loading workers | `4` |
| `-db`, `--dist_backend` | Distributed backend | `'ddp_find_unused_parameters_true'` |
| `-m`, `--model_type` | Type of RNN (`lstm` or `gru`) | `'lstm'` |
| `-cl`, `--resnet_layers` | Number of residual CNN layers | `2` |
| `-nl`, `--rnn_layers` | Number of RNN layers | `3` |
| `-rd`, `--rnn_dim` | RNN hidden size | `512` |
| `--epochs` | Number of training epochs | `50` |
| `--batch_size` | Batch size | `32` |
| `-gc`, `--grad_clip` | Gradient clipping | `0.6` |
| `-lr`, `--learning_rate` | Learning rate | `2e-4` |
| `--precision` | Precision mode | `'16-mixed'` |
| `--checkpoint_path` | Path to checkpoint file | `None` |
---
### 🧊 Export TorchScript Model
```bash
python3 freeze.py --model_checkpoint saved_checkpoint/deepspeech2.ckpt
```
### 🎙️ Inference
To perform inference using a trained model:
```bash
python3 demo.py --model_path optimized_model.pt --share
```
---
## 📊 Experiment Results
The model was trained on **LibriSpeech train set** (100 + 360 + 500 hours) and validated on the **LibriSpeech test set** (~10.5 hours) using **16-bit mixed precision**.
🔗 **Download Checkpoint**: [Google Drive Link](https://drive.google.com/file/d/14J6HhN_Op4c0y-up096eY_6_6D5JLIHb/view?usp=sharing)
### Model Performance
| Model Type | ResCNN Layers | RNN Layers | RNN Dim | Epochs | Batch Size | Grad Clip | LR |
|------------|---------------|------------|---------|--------|------------|-----------|----|
| BiLSTM | 2 | 3 | 512 | 25 | 64 | 0.6 | 2e-4 |
#### 📉 Loss Curves

#### 📝 WER & CER Metrics (Greedy Decoding)

#### 🔍 Beam Search Decoding
| Word Score | LM Weight | N-gram LM | Beam Size | Beam Threshold |
|------------|-----------|-----------|-----------|----------------|
| -0.26 | 0.3 | 4-gram | 25 | 10 |

#### 🔎 Alignments Visualization

---
## 🔗 Citations
```bibtex
@misc{amodei2015deepspeech2endtoend,
title={Deep Speech 2: End-to-End Speech Recognition in English and Mandarin},
author={Dario Amodei and Rishita Anubhai and Eric Battenberg and Carl Case and others},
year={2015},
url={https://arxiv.org/abs/1512.02595}
}
```