https://github.com/sovaai/sova-dataset
https://github.com/sovaai/sova-dataset
audio audio-data audio-dataset audio-datasets chinese-dataset corpus data dataset datasets english-datasets open-data open-source opendata opensource russian-datasets sova-dataset voice-data voice-dataset voice-datasets
Last synced: 3 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/sovaai/sova-dataset
- Owner: sovaai
- License: other
- Created: 2019-12-24T15:50:08.000Z (almost 6 years ago)
- Default Branch: master
- Last Pushed: 2022-11-08T14:55:56.000Z (almost 3 years ago)
- Last Synced: 2024-08-08T23:20:41.370Z (about 1 year ago)
- Topics: audio, audio-data, audio-dataset, audio-datasets, chinese-dataset, corpus, data, dataset, datasets, english-datasets, open-data, open-source, opendata, opensource, russian-datasets, sova-dataset, voice-data, voice-dataset, voice-datasets
- Homepage: https://sova.ai
- Size: 43 KB
- Stars: 115
- Watchers: 14
- Forks: 7
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# SOVA Dataset
SOVA Dataset is free public STT/ASR dataset.
Key facts:
- Russian, English and Chinese languages
- ~ 32 328 hours
- ~ 3,21 TB in `.wav` format## Dataset composition
|Name||Lang|Hours|Size|Source|Equipment|Annotation|Speech type|Augmentation|Quality|
|-|:-:|-|-|-|-|-|-|-|-|-|
|EngAudiobooksOriginal|[Download](https://disk.yandex.ru/d/jz3k7pnzTpnTgw "Download")|EN|7 130|743 Gb|audiobook|professional|forced alignment|reading|none|95%|
|EngAudiobooksNoisy|[Download](https://disk.yandex.ru/d/jz3k7pnzTpnTgw "Download")|EN|3 873|310 Gb|audiobook|professional|forced alignment|reading|phone calls|95%|
|RuAudiobooksDevices|[Download](https://disk.yandex.ru/d/jz3k7pnzTpnTgw "Download")|RU|298|30,24 Gb|audiobook|unprofessional|manual|reading|none|99%|
|RuDevices|[Download](https://disk.yandex.ru/d/jz3k7pnzTpnTgw "Download")|RU|101|10,42 Gb|audio records|unprofessional|manual|live speech|none|98%|
|RuYoutube|[Download](https://disk.yandex.ru/d/QsnbNTK0yzXSiA "Download")|RU|17 451|1 873 Gb|audio records|unprofessional|asr|live speech|none|95%|
|ZhYoutube|[Download](https://disk.yandex.ru/d/zCY5yRvW7PWjvA "Download")|CN|3 475,1|321 Gb|audio records|unprofessional|asr|live speech|none|97.83%|
|**TOTAL**|-|-|**32 328,1**|**3 287,66 Gb**
**(3,21 TB)**|-|-|-|-|-|-|## Audio characteristics
* Bit rate mode: constant
* Bit rate: 256 kbps
* Channel(s): 1 channel
* Sample rate: 16.0 kHz
* Bit depth: 16 bit## Updates
* 08/11/2022: [Release v0.4.0](https://github.com/sovaai/sova-dataset/releases/tag/v0.4.0)
* 10/12/2021: [Release v0.3.0](https://github.com/sovaai/sova-dataset/releases/tag/v0.3.0)
* 22/12/2020: [Release v0.2.0](https://github.com/sovaai/sova-dataset/releases/tag/v0.2.0)
* 24/12/2019: Published dataset with 116 hours.## Contacts
For all questions please feel free to contact us support@sova.ai## License
SOVA Dataset is licensed under [Creative Commons BY 4.0](https://creativecommons.org/licenses/by/4.0/) license by Virtual Assistant, LLC.