An open API service indexing awesome lists of open source software.

https://github.com/sovaai/sova-dataset


https://github.com/sovaai/sova-dataset

audio audio-data audio-dataset audio-datasets chinese-dataset corpus data dataset datasets english-datasets open-data open-source opendata opensource russian-datasets sova-dataset voice-data voice-dataset voice-datasets

Last synced: 3 months ago
JSON representation

Awesome Lists containing this project

README

          

# SOVA Dataset

SOVA Dataset is free public STT/ASR dataset.

Key facts:
- Russian, English and Chinese languages
- ~ 32 328 hours
- ~ 3,21 TB in `.wav` format

## Dataset composition
|Name||Lang|Hours|Size|Source|Equipment|Annotation|Speech type|Augmentation|Quality|
|-|:-:|-|-|-|-|-|-|-|-|-|
|EngAudiobooksOriginal|[Download](https://disk.yandex.ru/d/jz3k7pnzTpnTgw "Download")|EN|7 130|743 Gb|audiobook|professional|forced alignment|reading|none|95%|
|EngAudiobooksNoisy|[Download](https://disk.yandex.ru/d/jz3k7pnzTpnTgw "Download")|EN|3 873|310 Gb|audiobook|professional|forced alignment|reading|phone calls|95%|
|RuAudiobooksDevices|[Download](https://disk.yandex.ru/d/jz3k7pnzTpnTgw "Download")|RU|298|30,24 Gb|audiobook|unprofessional|manual|reading|none|99%|
|RuDevices|[Download](https://disk.yandex.ru/d/jz3k7pnzTpnTgw "Download")|RU|101|10,42 Gb|audio records|unprofessional|manual|live speech|none|98%|
|RuYoutube|[Download](https://disk.yandex.ru/d/QsnbNTK0yzXSiA "Download")|RU|17 451|1 873 Gb|audio records|unprofessional|asr|live speech|none|95%|
|ZhYoutube|[Download](https://disk.yandex.ru/d/zCY5yRvW7PWjvA "Download")|CN|3 475,1|321 Gb|audio records|unprofessional|asr|live speech|none|97.83%|
|**TOTAL**|-|-|**32 328,1**|**3 287,66 Gb**
**(3,21 TB)**|-|-|-|-|-|-|

## Audio characteristics
* Bit rate mode: constant
* Bit rate: 256 kbps
* Channel(s): 1 channel
* Sample rate: 16.0 kHz
* Bit depth: 16 bit

## Updates
* 08/11/2022: [Release v0.4.0](https://github.com/sovaai/sova-dataset/releases/tag/v0.4.0)
* 10/12/2021: [Release v0.3.0](https://github.com/sovaai/sova-dataset/releases/tag/v0.3.0)
* 22/12/2020: [Release v0.2.0](https://github.com/sovaai/sova-dataset/releases/tag/v0.2.0)
* 24/12/2019: Published dataset with 116 hours.

## Contacts
For all questions please feel free to contact us support@sova.ai

## License

SOVA Dataset is licensed under [Creative Commons BY 4.0](https://creativecommons.org/licenses/by/4.0/) license by Virtual Assistant, LLC.