https://github.com/coqui-ai/data-checker

🫠 check your data, before you wreck your model
https://github.com/coqui-ai/data-checker

Last synced: 6 months ago
JSON representation

🫠 check your data, before you wreck your model

Host: GitHub
URL: https://github.com/coqui-ai/data-checker
Owner: coqui-ai
License: mit
Created: 2021-08-16T10:37:35.000Z (about 4 years ago)
Default Branch: main
Last Pushed: 2022-08-11T14:05:51.000Z (about 3 years ago)
Last Synced: 2025-04-02T17:11:09.612Z (6 months ago)
Language: Python
Homepage:
Size: 41 KB
Stars: 16
Watchers: 5
Forks: 5
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE.txt

Awesome Lists containing this project

README

# 🫠 data-checker

Code for checking goodness of data for STT and TTS.

### Install with Docker

```
$ git clone https://github.com/coqui-ai/data-checker.git
$ cd data-checker
$ docker build . -t data-checker
```

### Check your install

```
$ docker run data-checker python data_checks.py "/code/data/smoke_test/russian_sample_data/ru.csv" 2
.
.
.
👀 ─ Found 1 pairs in /code/data/smoke_test/russian_sample_data/ru.csv
· First audio file found: ru.wav of type audio/wav
· Checking if audio is readable...
😊 Found no unreadable audiofiles
· Reading audio duration...
👀 ─ Found a total of 0.00 hours of readable data
· Get transcript length...
· Get num feature vectors...
😊 Found no audio clips over 30 seconds in length
😊 Found no transcripts under 10 characters in length
· Get ratio (num_feats / transcript_len)...
😊 Found no offending pairs
· Calculating ratio (num_feats : transcript_len)...
😊 Found no pairs more than 2.0 standard deviations from the mean
🎉 ┬ Saved a total of 0.00 hours of data to BEST dataset
├ Removed a total of 0.00 hours (0.00% of original data)
├ Removed a total of 0 samples (0.00% of original data)
└ Wrote best data to /code/data/smoke_test/russian_sample_data/ru.BEST
```

### Run on your data

`data-checker` assumes your CSV has two columns: `wav_filename` and `transcript`. Note that you don't actually need to use WAV files, but the header still should be `wav_filename`.

```
$ docker run data-checker --mount "type=bind,src=/path/to/my/local/data,dst=/mnt" python data_checks.py "/mnt/my-data.csv" 2
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/coqui-ai/data-checker

Awesome Lists containing this project

README