{"id":13935275,"url":"https://github.com/sovaai/sova-dataset","last_synced_at":"2025-07-19T20:31:43.367Z","repository":{"id":62891542,"uuid":"229967273","full_name":"sovaai/sova-dataset","owner":"sovaai","description":null,"archived":false,"fork":false,"pushed_at":"2022-11-08T14:55:56.000Z","size":44,"stargazers_count":115,"open_issues_count":1,"forks_count":7,"subscribers_count":14,"default_branch":"master","last_synced_at":"2024-08-08T23:20:41.370Z","etag":null,"topics":["audio","audio-data","audio-dataset","audio-datasets","chinese-dataset","corpus","data","dataset","datasets","english-datasets","open-data","open-source","opendata","opensource","russian-datasets","sova-dataset","voice-data","voice-dataset","voice-datasets"],"latest_commit_sha":null,"homepage":"https://sova.ai","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sovaai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-12-24T15:50:08.000Z","updated_at":"2024-08-05T01:46:08.000Z","dependencies_parsed_at":"2022-11-08T16:01:32.991Z","dependency_job_id":null,"html_url":"https://github.com/sovaai/sova-dataset","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sovaai%2Fsova-dataset","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sovaai%2Fsova-dataset/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sovaai%2Fsova-dataset/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sovaai%2Fsova-dataset/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sovaai","download_url":"https://codeload.github.com/sovaai/sova-dataset/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":226666651,"owners_count":17665059,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["audio","audio-data","audio-dataset","audio-datasets","chinese-dataset","corpus","data","dataset","datasets","english-datasets","open-data","open-source","opendata","opensource","russian-datasets","sova-dataset","voice-data","voice-dataset","voice-datasets"],"created_at":"2024-08-07T23:01:32.411Z","updated_at":"2025-07-19T20:31:43.357Z","avatar_url":"https://github.com/sovaai.png","language":null,"funding_links":[],"categories":["Others"],"sub_categories":[],"readme":"# SOVA Dataset\n\nSOVA Dataset is free public STT/ASR dataset.\n\nKey facts:\n- Russian, English and Chinese languages\n- ~ 32 328 hours\n- ~ 3,21 TB in `.wav` format\n\n## Dataset composition\n|Name||Lang|Hours|Size|Source|Equipment|Annotation|Speech type|Augmentation|Quality|\n|-|:-:|-|-|-|-|-|-|-|-|-|\n|EngAudiobooksOriginal|[Download](https://disk.yandex.ru/d/jz3k7pnzTpnTgw \"Download\")|EN|7\u0026nbsp;130|743\u0026nbsp;Gb|audiobook|professional|forced alignment|reading|none|95%|\n|EngAudiobooksNoisy|[Download](https://disk.yandex.ru/d/jz3k7pnzTpnTgw \"Download\")|EN|3\u0026nbsp;873|310\u0026nbsp;Gb|audiobook|professional|forced alignment|reading|phone calls|95%|\n|RuAudiobooksDevices|[Download](https://disk.yandex.ru/d/jz3k7pnzTpnTgw \"Download\")|RU|298|30,24\u0026nbsp;Gb|audiobook|unprofessional|manual|reading|none|99%|\n|RuDevices|[Download](https://disk.yandex.ru/d/jz3k7pnzTpnTgw \"Download\")|RU|101|10,42\u0026nbsp;Gb|audio records|unprofessional|manual|live speech|none|98%|\n|RuYoutube|[Download](https://disk.yandex.ru/d/QsnbNTK0yzXSiA \"Download\")|RU|17\u0026nbsp;451|1 873\u0026nbsp;Gb|audio records|unprofessional|asr|live speech|none|95%|\n|ZhYoutube|[Download](https://disk.yandex.ru/d/zCY5yRvW7PWjvA \"Download\")|CN|3\u0026nbsp;475,1|321\u0026nbsp;Gb|audio records|unprofessional|asr|live speech|none|97.83%|\n|**TOTAL**|-|-|**32\u0026nbsp;328,1**|**3\u0026nbsp;287,66\u0026nbsp;Gb**\u003cbr\u003e**(3,21\u0026nbsp;TB)**|-|-|-|-|-|-|\n\n## Audio characteristics\n* Bit rate mode: constant\n* Bit rate: 256 kbps\n* Channel(s): 1 channel\n* Sample rate: 16.0 kHz\n* Bit depth: 16 bit\n\n## Updates\n* 08/11/2022: [Release v0.4.0](https://github.com/sovaai/sova-dataset/releases/tag/v0.4.0)\n* 10/12/2021: [Release v0.3.0](https://github.com/sovaai/sova-dataset/releases/tag/v0.3.0)\n* 22/12/2020: [Release v0.2.0](https://github.com/sovaai/sova-dataset/releases/tag/v0.2.0)\n* 24/12/2019: Published dataset with 116 hours.\n\n## Contacts\nFor all questions please feel free to contact us \u003ca href=\"mailto:support@sova.ai?subject=SOVA Dataset\"\u003esupport@sova.ai\u003c/a\u003e\n\n## License\n\nSOVA Dataset is licensed under [Creative Commons BY 4.0](https://creativecommons.org/licenses/by/4.0/) license by Virtual Assistant, LLC.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsovaai%2Fsova-dataset","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsovaai%2Fsova-dataset","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsovaai%2Fsova-dataset/lists"}