https://github.com/egorsmkv/cv10-uk-testset-clean

The cleaned Common Voice 10 (test set) that has been checked by a human for Ukrainian 🇺🇦
https://github.com/egorsmkv/cv10-uk-testset-clean

asr automatic-speech-recognition speech speech-recognition speech-to-text ukrainian

Last synced: 8 months ago
JSON representation

The cleaned Common Voice 10 (test set) that has been checked by a human for Ukrainian 🇺🇦

Host: GitHub
URL: https://github.com/egorsmkv/cv10-uk-testset-clean
Owner: egorsmkv
Created: 2022-09-26T12:31:37.000Z (about 3 years ago)
Default Branch: main
Last Pushed: 2025-02-20T14:09:54.000Z (9 months ago)
Last Synced: 2025-02-20T14:38:20.453Z (9 months ago)
Topics: asr, automatic-speech-recognition, speech, speech-recognition, speech-to-text, ukrainian
Homepage:
Size: 389 KB
Stars: 1
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Citation: CITATION.cff

Awesome Lists containing this project

README

          # The cleaned Common Voice 10 (test set) that has been checked by a human for Ukrainian 🇺🇦

## Overview

This repository contains the archive of [CV10 (test set)][1] with checked Ukrainian transcriptions and audios. All audios have been checked by a human to be sure that they are correct. 

This archive is used to test all ASR models listed here: https://github.com/egorsmkv/speech-recognition-uk

## Hugging Face dataset

- URL: https://huggingface.co/datasets/Yehor/cv10-uk-testset-clean

### Usage

Example with `datasets`:

```python

from datasets import load_dataset

ds = load_dataset('Yehor/cv10-uk-testset-clean')

print(ds)

for row in ds['train']:

  audio = row["audio"]

  sampling_rate = audio["sampling_rate"]

  audio_bytes = audio["array"]

  filename = audio["path"]

  print(len(audio_bytes), sampling_rate, filename)

  print(row["duration"], row["transcription"])

  print('---')

```

Example with `polars`: https://colab.research.google.com/drive/1upeXw3WbLjK37b1LetpM0HxFXDdOZqSK?usp=sharing

## Google Colabs

Use the following colabs to see how you can download this dataset in Python:

`datasets`:

- https://colab.research.google.com/drive/1qqnr5-WkaJi8iqHa_Pmlx7PbbXwXiimD?usp=sharing

`polars`:

- https://colab.research.google.com/drive/1upeXw3WbLjK37b1LetpM0HxFXDdOZqSK?usp=sharing

## Statistics

## Duration statistics

Duration: 4.6 hours

| Metrics | Value |

| ------ | ------ |

| mean | 5.201474 |

| std | 1.764957 |

| min | 1.704 |

| 25% | 3.816 |

| 50% | 4.896 |

| 75% | 6.384 |

| max | 10.536 |

## Download from GitHub

We recommend to use Hugging Face dataset, but in case you need raw dataset, use:

- Audio data: https://github.com/egorsmkv/cv10-uk-testset-clean/releases/download/v1.1/filtered-cv10-test.zip

- Labels list (TAB format) with absolute paths: https://github.com/egorsmkv/cv10-uk-testset-clean/blob/main/labels_absolute.lst

- Labels list (CSV format) with absolute paths: https://github.com/egorsmkv/cv10-uk-testset-clean/blob/main/labels_absolute.csv

- Labels list (CSV format) with relative paths: https://github.com/egorsmkv/cv10-uk-testset-clean/blob/main/labels_relative.csv

[1]: https://huggingface.co/datasets/mozilla-foundation/common_voice_10_0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/egorsmkv/cv10-uk-testset-clean

Awesome Lists containing this project

README