https://github.com/jiwidi/las-pytorch

Listen, Attend and spell model for E2E ASR. Implementation in Pytorch
https://github.com/jiwidi/las-pytorch

asr e2e las listen-attend-and-spell pytorch

Last synced: 3 months ago
JSON representation

Listen, Attend and spell model for E2E ASR. Implementation in Pytorch

Host: GitHub
URL: https://github.com/jiwidi/las-pytorch
Owner: jiwidi
Created: 2020-05-21T14:31:30.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2022-06-22T02:03:24.000Z (over 3 years ago)
Last Synced: 2025-04-13T20:48:56.674Z (6 months ago)
Topics: asr, e2e, las, listen-attend-and-spell, pytorch
Language: Python
Homepage:
Size: 861 KB
Stars: 41
Watchers: 3
Forks: 5
Open Issues: 3
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

## LAS-Pytorch

This is my pytorch implementation for the [Listen, Attend and Spell](https://arxiv.org/abs/1508.01211v2) (LAS) google ASR deep learning model. I used both the mozilla [Common voice](https://voice.mozilla.org/en/datasets) dataset and the [LibriSpeech](https://www.openslr.org/12) dataset.

![LAS Network architecture](img/las.png)

The feature transformation is done on the fly while loading the files thanks to torchaudio.

## Results

This are the LER (letter error rate) and loss metrics for 4 epochs of training with a considerably smaller architecture since my gpu didnt have enough memory. Listener had 128 neurons and 2 layers while the Speller had 256 neurons with 2 layers as well.

If we try to predict a sample of audio the results now look like:

`true_y`: ['A', 'N', 'D', '', 'S', 'T', 'I', 'L', 'L', '', 'N', 'O', '', 'A',
'T', 'T', 'E', 'M', 'P', 'T', '', 'B', 'Y', '', 'T', 'H', 'E', '',
'P', 'O']

`pred_y`:['A', 'N', 'D', '', 'T', 'H', 'E', 'L', 'L', '', 'T', 'O', 'T', 'M',
'', 'T', 'E', 'N', 'P', 'T', '', 'O', 'E', '', 'T', 'H', 'E', '',
'S', 'R']

Only the conjunction are being properly indentified, this led us to think the model needs higher training times to be able to learn more specific words.

#Will train more and update results here, still looking for credits in cloud compute

## How to run it

### Requirements
Code is setup to run with both the mozilla [Common voice](https://voice.mozilla.org/en/datasets) dataset and the [LibriSpeech](https://www.openslr.org/12) dataset. If you want to run the code you should download the datasets and extract them under data/ or run the script `utils/download_data.py` which will download it and extract it in the following format:

### Data
```
data
├── LibriSpeech
│   ├── BOOKS.TXT
│   ├── CHAPTERS.TXT
│   ├── dev-clean/
│   ├── LICENSE.TXT
│   ├── README.TXT
│   ├── SPEAKERS.TXT
│   ├── test-clean/
│   └── train-clean-100/
└── mozilla
├── dev.tsv
├── invalidated.tsv
├── mp3/
├── other.tsv
├── test.tsv
├── train.tsv
└── validated.tsv
```

So run
```

#Remove flags if you want to avoid download that specific dataset
$ python utils/download_data.py --libri --common
```

And run the following commands to process and collect all files.

```
#Still in utils/
$ python utils/prepare_librispeech.py --root $ABSOLUTEPATH TO DATASET
$ python uitls/prepare_common-voice.py --root $ABSOLUTEPATH TO DATASET
```
This will create a `processed/` folder inside each of the datassets containing the csvs with teh data neccesary to train along vocabulary and word count files.

### Training
Execute the train script along with the yaml config file for the desired dataset.
```
$ python train.py --config_path config/librispeech-config.yaml
# Or
$ python train.py --config_path config/common_voice-config.yaml
```

Loss and lert will be logged to the `runs/` folder, you can check them by running tensoboard in the root directory.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jiwidi/las-pytorch

Awesome Lists containing this project

README