Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/egorsmkv/asr-datasets-cleaner

A pipeline to make ASR datasets better
https://github.com/egorsmkv/asr-datasets-cleaner

asr data-processing datasets lid machine-learning speech-to-text

Last synced: 3 days ago
JSON representation

A pipeline to make ASR datasets better

Host: GitHub
URL: https://github.com/egorsmkv/asr-datasets-cleaner
Owner: egorsmkv
License: apache-2.0
Created: 2024-07-18T11:10:45.000Z (5 months ago)
Default Branch: master
Last Pushed: 2024-07-25T12:03:49.000Z (5 months ago)
Last Synced: 2024-11-01T16:22:19.916Z (about 2 months ago)
Topics: asr, data-processing, datasets, lid, machine-learning, speech-to-text
Language: Python
Homepage:
Size: 88.9 KB
Stars: 1
Watchers: 3
Forks: 0
Open Issues: 3
Metadata Files:
- Readme: README.md
- License: LICENSE.txt

Awesome Lists containing this project

README

# `asr-datasets-cleaner`

> [!WARNING]
> Currently, this work is in progress.

> This repository contains a pipeline for better ASR training solving these two tasks: **(1)** remove incorrect audio samples from ASR datasets by LID filtering and **(2)** normalize text samples.

Authors:

- Yehor Smoliakov: [@egorsmkv][4] on GitHub, and for private discussions.

## Idea

1. Use https://huggingface.co/facebook/mms-lid-126 to detect the language in audio samples.

2. Use https://github.com/pemistahl/lingua-py to detect the language in text samples.

3. Use https://huggingface.co/skypro1111/mbart-large-50-verbalization to do text normalization
(convert numerals/abbreviations to their textual representation, that is, $5 -> five dollars).

## Details

- We use the *Ukrainian* subset of [YODAS2][1] in our command examples.
- We patch the YODAS2's dataset builder script to download only a part of the dataset.

## Required software

- Python 3.12
- [uv][2]
- [nq][3]
- CUDA device

## Install

```shell
uv venv --python 3.12

source .venv/bin/activate

uv pip install -r requirements.txt

# in development mode
uv pip install -r requirements-dev.txt
```

## Usage

1. Generate a bash file to download required files from [YODAS2][1]:

```shell
python generate_commands.py --dataset_dir `pwd`/uk_yodas2 --subset uk000 --max_files 10 > download_dataset.sh
```

2. Download the dataset:

```shell
bash download_dataset.sh
```

3. Convert the dataset to `datasets` format:

Copy the `yodas2_dsbuilder.py` file to your `dataset_dir` directory and rename it as `dataset_dir`. So in the following example, the `dataset_dir` is `uk_yodas2` and the script must be renamed as `uk_yodas2.py`.

Then convert the dataset, it will unarchive files and generate metadata:

```shell
python convert_dataset.py --dataset_dir `pwd`/uk_yodas2 --subset uk000 --max_files 10 --cache_dir cache-yodas2-uk000
```

4. Extract utterances:

```shell
python extract_utterances.py --dataset_dir `pwd`/uk_yodas2 --subset uk000 --cache_dir ../cache-yodas2-uk000 --batch_size 128 > data/uk000.jsonl
```

5. Text LID:

```shell
python text_lid.py --file data/uk000.jsonl --to data/uk000_+tlid.jsonl
```

6. Filter by a language:

```shell
python filter_by_language.py --file data/uk000_+tlid.jsonl --to data/uk000_+only_uk.jsonl --language uk --score 0.95
```

7. Audio LID:

```shell
python audio_lid.py --dataset_dir `pwd`/uk_yodas2 --subset uk000 --cache_dir ../cache-yodas2-uk000 --batch_size 16 --model_id facebook/mms-lid-126 --file data/uk000_+tlid.jsonl --to data/uk000_+tlid_+alid.jsonl --device cuda:0
```

8. Normalize utterances:

```shell
python normalize_utterances.py --file data/uk000.jsonl --to data/uk000_normalized.jsonl --batch_size 8 --device cuda:0
```

## Examples

0. Go to `examples/`

1. Inference audio samples by the different variants of MMS LID model to see their outputs:

```shell
python audio_lid.py --model_id facebook/mms-lid-126 --dataset_dir `pwd`/../uk_yodas2 --subset uk000 --cache_dir ../cache-yodas2-uk000 --device cuda:0 > ../mms-checkpoints-test/mms-lid-126.txt
```

2. Inference text samples by lingua-py to see their text language:

```shell
python text_lid.py --dataset_dir `pwd`/../uk_yodas2 --subset uk000 --cache_dir ../cache-yodas2-uk000
```

3. Inference text samples by the MBART model for text normalization:

```shell
python normalize_utterances.py
```

4. Calculate the duration of the dataset:

```shell
python count_durations.py --dataset_dir `pwd`/../uk_yodas2 --subset uk000 --cache_dir ../cache-yodas2-uk000 --batch_size 128
```

## Development

```shell
ruff check
ruff format
```

## Misc

MMS has these models for the LID task:

- https://huggingface.co/facebook/mms-lid-4017
- https://huggingface.co/facebook/mms-lid-2048
- https://huggingface.co/facebook/mms-lid-1024
- https://huggingface.co/facebook/mms-lid-512
- https://huggingface.co/facebook/mms-lid-256
- https://huggingface.co/facebook/mms-lid-126

[1]: https://huggingface.co/datasets/espnet/yodas2
[2]: https://github.com/astral-sh/uv
[3]: https://github.com/leahneukirchen/nq
[4]: https://github.com/egorsmkv