https://github.com/coqui-ai/open-speech-corpora

💎 A list of accessible speech corpora for ASR, TTS, and other Speech Technologies
https://github.com/coqui-ai/open-speech-corpora
speech-emotion-recognition speech-processing speech-recognition speech-separation speech-synthesis speech-to-text stt text-to-speech tts voice-activity-detection voice-cloning voice-recognition
Last synced: 6 months ago
JSON representation
💎 A list of accessible speech corpora for ASR, TTS, and other Speech Technologies
Host: GitHub
URL: https://github.com/coqui-ai/open-speech-corpora
Owner: coqui-ai
License: mit
Created: 2019-01-31T14:57:39.000Z (over 7 years ago)
Default Branch: master
Last Pushed: 2024-06-06T11:33:44.000Z (about 2 years ago)
Last Synced: 2026-01-12T18:48:33.667Z (6 months ago)
Topics: speech-emotion-recognition, speech-processing, speech-recognition, speech-separation, speech-synthesis, speech-to-text, stt, text-to-speech, tts, voice-activity-detection, voice-cloning, voice-recognition
Homepage:
Size: 139 KB
Stars: 1,382
Watchers: 55
Forks: 149
Open Issues: 169
Metadata Files:
- Readme: README.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project

awesome-ai-list-guide - open-speech-corpora
README

          # 💎 Open Speech Corpora

A list of open speech corpora for Speech Technology research and development.

This list has a preference for free (i.e. no $ cost) and truly open corpora (e.g. released under a [Creative Commons license](https://en.wikipedia.org/wiki/Creative_Commons_license) or a [Community Data License Agreement](https://en.wikipedia.org/wiki/Linux_Foundation#Community_Data_License_Agreement_%28CDLA%29)). Not all these corpora may meet those criteria, but all the following corpora are accessible and usable for research and/or commercial use.

Feel free to propse additions to the list!

*There's a long backlog of corpora to be added in the [Issues](https://github.com/coqui-ai/open-speech-corpora/issues), and Pull Requests are very welcome :)*

## 📜 [CC-0](https://creativecommons.org/share-your-work/public-domain/cc0/)

| CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |

| --- | --- | --- | --- | --- | --- |

| Common Voice | Multilingual | >15,000 hours (validated); >20,000 hours (total) | Multi-speaker |  | [CC-0](https://creativecommons.org/share-your-work/public-domain/cc0/) |

| Yesno | Hebrew | 6 mins | one male |  | [CC-0](https://creativecommons.org/share-your-work/public-domain/cc0/) |

| LJ Speech Corpus | English | ~24 hours | [one female](https://librivox.org/reader/11049) |  | [CC-0](https://creativecommons.org/share-your-work/public-domain/cc0/) |

| NST Danish ASR Database | Danish | 229,992 utterances | 616 speakers | original: , reorganized:  | [CC-0](https://creativecommons.org/publicdomain/zero/1.0/) |

| NST Danish Dictation | Danish | 34,955 utterances | 151 speakers |  | [CC-0](https://creativecommons.org/publicdomain/zero/1.0/) |

| NST Danish Speech Synthesis | Danish | 4,108 utterances | 1 male speaker |  | [CC-0](https://creativecommons.org/publicdomain/zero/1.0/) |

| NST Swedish ASR Database | Swedish | 366,000 utterances | 1,000 speakers | original: , reorganized:  | [CC-0](https://creativecommons.org/publicdomain/zero/1.0/) |

| NST Swedish Dictation | Swedish | 45,620 utterances | 195 speakers |  | [CC-0](https://creativecommons.org/publicdomain/zero/1.0/) |

| NST Swedish Speech Synthesis | Swedish | 5,279 utterances | 1 male speaker |  | [CC-0](https://creativecommons.org/publicdomain/zero/1.0/) |

| NST Norwegian ASR Database | Norwegian | 359,760 utterances | 980 speakers | original: , reorganized:  | [CC-0](https://creativecommons.org/publicdomain/zero/1.0/) |

| NST Norwegian Dictation | Norwegian | 33,360 utterances | 144 speakers |  | [CC-0](https://creativecommons.org/publicdomain/zero/1.0/) |

| NST Norwegian Speech Synthesis | Norwegian | 5,363 utterances | 1 male speaker |  | [CC-0](https://creativecommons.org/publicdomain/zero/1.0/) |

| NB Tale – Speech Database for Norwegian | Norwegian | 7,600 utterances + ~12 hours | 380 speakers |  | [CC-0](https://creativecommons.org/publicdomain/zero/1.0/) |

| Norwegian Parliamentary Speech Corpus (v0.1) | Norwegian | ~59 hours | 203 speakers |  | [CC-0](https://creativecommons.org/publicdomain/zero/1.0/) |

| Wikimedia Commons Odia | Odia | ~8 hours | ~20 speakers |  | mostly(?) [CC-0](https://creativecommons.org/share-your-work/public-domain/cc0/) |

| Thorsten-21.02-neutral | German | ~24 hours | 1 male speaker |  | [CC-0](https://creativecommons.org/share-your-work/public-domain/cc0/) |

| Thorsten-21.06-emotional | German | 2.400 utterances (8 emotions) | 1 male speaker |  | [CC-0](https://creativecommons.org/share-your-work/public-domain/cc0/) |

## 📜 [CC-BY](https://creativecommons.org/licenses/by/4.0/)

| CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |

| --- | --- | --- | --- | --- | --- |

| ARU Speech Corpus | English (UK) | 720 utterances / speaker | 12 (6 femals; 6 male) |  | [CC-BY 3.0](https://creativecommons.org/licenses/by/3.0/) |

| Althingi Parliamentary Speech Corpus  | Icelandic | 542 hours and 25 minutes | 196 speakers |  | [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/) |

| Alþingisumræður Parliamentary Speech Corpus | Icelandic | ~21 hours | |  | [CC-BY 3.0](https://creativecommons.org/licenses/by/3.0/) |

| Hjal Corpus | Icelandic | ~41,000 recordings | 883 speakers |  | [CC-BY 3.0](https://creativecommons.org/licenses/by/3.0/) |

| The Malromur Corpus | Icelandic | 152 hours | 563 speakers |  | [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/) |

| Telecooperation German Corpus for Kinect | German | ~35 hours | ~180 speakers |  | [CC-BY 2.0](https://creativecommons.org/licenses/by/2.0/) |

| African Speech Technology English-English Speech Corpus | English | ~21 hours | |  | [CC-BY 2.5 South Africa](https://creativecommons.org/licenses/by/2.5/za/legalcode) |

| African Speech Technology isiXhosa Speech Corpus | isiXhosa | ~26 hours | |  | [CC-BY 2.5 South Africa](https://creativecommons.org/licenses/by/2.5/za/legalcode) |

| NCHLT Afrikaans | Afrikaans | 56 hours | 210 speakers (98 female / 112 male) |  | CC-BY 3.0 |

| NCHLT English | English | 56 hours | 210 speakers (100 female / 110 male) |  | CC-BY 3.0 |

| NCHLT isiNdebele | isiNdebele | 56 hours | 148 speakers (78 female / 70 male) |  | CC-BY 3.0 |

| NCHLT isiXhosa | isiXhosa | 56 hours | 209 speakers (106 female / 103 male) |  | CC-BY 3.0 |

| NCHLT isiZulu | isiZulu | 56 hours | 210 speakers (98 female / 112 male) |  | CC-BY 3.0 |

| NCHLT Sepedi | Sepedi | 56 hours | 210 speakers (100 female / 110 male) |  | CC-BY 3.0 |

| NCHLT Sesotho | Sesotho | 56 hours | 210 speakers (113 female / 97 male) |  | CC-BY 3.0 |

| NCHLT Setswana | Setswana | 56 hours | 210 speakers (109 female / 101 male) |  | CC-BY 3.0 |

| NCHLT Siswati | Siswati | 56 hours | 197 speakers (96 female / 101 male) |  | CC-BY 3.0 |

| NCHLT Tshivenda | Tshivenda | 56 hours | 208 speakers (83 female / 125 male) |  | CC-BY 3.0 |

| NCHLT Xitsonga | Xitsonga | 56 hours | 198 speakers (95 female/103 male) |  | CC-BY 3.0 |

| Lwazi II Cross-lingual Proper Name Corpus | Afrikaans; English; isiZulu; Sesotho | 2 hours 5 mins| 20 speakers |  | CC-BY 3.0 |

| Lwazi II Proper Name Call Routing Telephone Corpus | English | 2 hours 7 mins | |  | CC-BY 3.0 |

| Lwazi II Afrikaans Trajectory Tracking Corpus | Afrikaans | 4 hours | one male |  | CC-BY 3.0 |

| LibriSpeech | English | ~1000 hours | 2484 speakers (1201 female / 1283 male) |  | CC-BY 4.0 |

| Zeroth-Korean | Korean | 52.8 hours | 115 speakers |  | CC-BY 4.0 |

| Speech Commands | English | 17.8 hours  | >1,000 speakers |  | CC-BY 4.0 |

| ParlamentParla | Catalan | 320 hours  |  |  | CC-BY 4.0 |

|  SIWIS | French | ~10 hours  | one female |  | [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/legalcode) |

|  VCTK | English | 44 hours | 109 speakers  |  | [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/legalcode) |

|  LibriTTS | English | 586 hours | 2,456 speakers (1,185 female / 1,271 male)  |  | [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/legalcode) |

|  Augmented LibriSpeech | Audio (English); Text (English, French) | 236 hours | |  | [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/legalcode) |

|  Helsinki Prosody Corpus | English | 262.5 hours | 1,230 speakers |  | [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/legalcode) |

|Tuva Speech Database | Norwegian | 24 hours | 40 speakers | https://www.nb.no/sprakbanken/show?serial=oai:nb.no:sbr-44&lang= |  [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/legalcode) |

| COERLL Kʼicheʼ corpus | Kʼicheʼ | 34 minutes | ? speakers | https://cl.indiana.edu/~ftyers/resources/utexas-kiche-audio.tar.gz | [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/legalcode) |

| Timers and Such v0.1 | English (synthetic: US, real: various nationalities) | synthetic: 172 hours, real: 0.29 hours | 21 synthetic, 11 real | https://zenodo.org/record/4110812#.X9j0RmBOkYM | [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/legalcode) |

| Large Corpus of Czech Parliament Plenary Hearings | Czech | 444 hours | |  | [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/legalcode) |

## 📜 [CC-BY-SA](https://creativecommons.org/licenses/by-sa/4.0/)

| CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |

| --- | --- | --- | --- | --- | --- |

| Iban | Iban | 8 hours |  |   | CC-BY-SA 2.0 |

| Vystadial 2013 | English; Czech | 41 hours; 15 hours |  |  | CC-BY-SA 3.0 US |

| Vystadial 2016 Czech | Czech | 77 hours; includes Vystadial 2013 Czech | |  | CC-BY-SA 4.0 |

| Free Spoken Digit Dataset | English | 2,000 isolated digits | 4 speakers |  | CC-BY-SA 4.0 |

| Google Javanese | Javanese | 296 hours| 1019 speakers|  | CC-BY-SA 4.0 |

| Google Nepali | Nepali | 165 hours| 527 speakers|  | CC-BY-SA 4.0 |

| Google Bengali | Bengali | 229 hours| 508 speakers|  | CC-BY-SA 4.0 |

| Google Sinhala | Sinhala | 224 hours| 478 speakers|  | CC-BY-SA 4.0 |

| Google Sundanese | Sundanese | 333 hours| 542 speakers|  | CC-BY-SA 4.0 |

| Spoken Wikipedia Corpus (SWC-2017) | English; German; Dutch | 182 hours; 249 hours; 79 hours | 395 speakers; 339 speakers; 145 speakers |  | CC-BY-SA 4.0 |

| Chuvash TTS | Chuvash | 4 hours | 1 speaker |  | CC-BY-SA 4.0  |

| Forschergeist | German | 2 hours | 2 speakers (1 female; 1 male) | female speaker: ; male speaker:  | CC-BY-SA 4.0  |

| Malayalam Speech Corpus by [SMC](https://blog.smc.org.in/malayalam-speech-corpus/) | Malayalam | 1:36 hours | 75 speakers (3 female, 12 male, 60 unidentified) | https://releases.smc.org.in/msc-reviewed-speech/ | CC-BY-SA 4.0  |

| Google Malayalam | Malayalam | 3.02 hours| 24 speakers|  | CC-BY-SA 4.0 |

## 📜 [CC-BY-ND](https://creativecommons.org/licenses/by-nd/4.0/)

| CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |

| --- | --- | --- | --- | --- | --- |

| IBM Recorded Debates v1 | English | 5 hours | 10 speakers |  | CC-BY-ND |

| IBM Recorded Debates v2 | English | ~14 hours  | 14 speakers |  | CC-BY-ND |

## 📜 [CC-BY-NC](https://creativecommons.org/licenses/by-nc/4.0/)

| CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |

| --- | --- | --- | --- | --- | --- |

| TV3Parla | Catalan | 240 hours  |  |  | [CC-BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/) |

| Russian Open STT Corpus | Russian | ~10,000 hours public, ~10,000 more upon request  |  |  | [CC-BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/) with some [exceptions](https://github.com/snakers4/open_stt/blob/master/LICENSE)|

| Russian Open TTS Corpus | Russian | 145 hours  | 3 males |  | [CC-BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/) with some [expections](https://github.com/snakers4/open_tts/blob/master/LICENSE)|

| OVM – Otázky Václava Moravce | Czech | 35 hours  |  |  | [CC-BY-NC 3.0](https://creativecommons.org/licenses/by-nc/3.0/) |

## 📜 [CC-BY-NC-SA](https://creativecommons.org/licenses/by-nc-sa/4.0/)

| CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |

| --- | --- | --- | --- | --- | --- |

| CHiME-Home | English | 6.8 hours |  |  | [CC-BY-NC-SA 3.0](https://creativecommons.org/licenses/by-nc-sa/3.0/) |

| Cameroon Pidgin English Corpus | Cameroon Pidgin English | ~17 hours |  |  | [CC-BY-NC-SA 3.0](https://creativecommons.org/licenses/by-nc-sa/3.0/) |

## 📜 [CC-BY-NC-ND](https://creativecommons.org/licenses/by-nc-nd/4.0/)

| CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |

| --- | --- | --- | --- | --- | --- |

| Tatoeba-Eng | English | ~250 hours (rough estimate) | 6 speakers |  | [CC-BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/) (some audio) / [CC-BY-NC-ND 3.0](https://creativecommons.org/licenses/by-nc-nd/3.0/) (most audio) / [CC-BY 2.0](https://creativecommons.org/licenses/by/2.0/) (all text) |

| TED-LIUM | English | 118 hours | 685 speakers (36h female / 81h male) |  | [CC-BY-NC-ND 3.0](https://creativecommons.org/licenses/by-nc-nd/3.0/) |

| TED-LIUM-2 | English | 207 hours | 1242 speakers (66h female / 141h male) |  | [CC-BY-NC-ND 3.0](https://creativecommons.org/licenses/by-nc-nd/3.0/) |

| TED-LIUM-3 | English | 452 hours | 2028 speakers (134h female / 316h male) |  | [CC-BY-NC-ND 3.0](https://creativecommons.org/licenses/by-nc-nd/3.0/) |

| Pansori TEDxKR | Korean | 3 hours | 41 speakers |  | [CC-BY-NC-ND 4.0](https://creativecommons.org/licenses/by-nc-nd/4.0/) |

| Primewords Mandarin | Mandarin | 100 hours | 296 speakers |  | [CC-BY-NC-ND 4.0](https://creativecommons.org/licenses/by-nc-nd/4.0/)|

| MuST-C v1.0 | Audio (English); Text (Dutch, French, German, Italian, Portuguese, Romanian, Russian, Spanish) | 408, 504, 492, 465, 442, 385, 432, 489 hours per language pair | |  | [CC-BY-NC-ND 4.0](https://creativecommons.org/licenses/by-nc-nd/4.0/) |

| Czech Parliament Meetings | Czech | 88 hours | |  | [CC-BY-NC-ND 3.0](https://creativecommons.org/licenses/by-nc-nd/3.0/) |

| BembaSpeech | Bemba | 24 hours | 17 speakers (9 male / 8 female) |  | [CC-BY-NC-ND 4.0](https://creativecommons.org/licenses/by-nc-nd/4.0/) |

## 📜 [CDLA-Permissive](https://cdla.io/permissive-1-0/)

| CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |

| --- | --- | --- | --- | --- | --- |

| DiPCo | English | ~5 hours | 32 speakers (13 female; 19 male) |  | [CDLA-Permissive-1.0](https://cdla.io/permissive-1-0/) |

## 📜 [GNU General Public License](https://www.gnu.org/licenses/gpl.html)

| CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |

| --- | --- | --- | --- | --- | --- |

| VoxForge | English | ~120 hours | ~2966 speakers |   | GNU-GPL 3.0 |

| VoxForge | Russian |  | |  | GNU-GPL 3.0 |

| VoxForge | German |  | |  | GNU-GPL 3.0 |

## 📜 [Apache License](https://www.apache.org/licenses/LICENSE-2.0)

| CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |

| --- | --- | --- | --- | --- | --- |

| AISHELL-1 | Mandarin | 170 hours | 400 speakers |  | Apache 2.0 |

| Tunisian_MSA | Modern Standard Arabic (Tunisia) | 11.2 hours | 118 speakers |  | Apache 2.0 |

| African Accented French | French | 22 hours | 232 speakers |  | Apache 2.0 |

| THCHS-30 | Mandarin Chinese | 33.57 hours (13,389 utterances) | 40 speakers (31 female; 9 male) |  | Apache 2.0 |

| Living Audio Dataset - Dutch | Dutch | 57:49 min | 1 speaker |  | Apache 2.0 |

| Living Audio Dataset - English | English | 50:50 min | 1 speaker |  | Apache 2.0 |

| Living Audio Dataset - Irish | Irish | 61:56 min | 1 speaker |  | Apache 2.0 |

| Living Audio Dataset - Russian | Russian | 34:58 min | 1 speaker |  | Apache 2.0 |

## 📜 [MIT License](https://opensource.org/licenses/MIT)

| CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |

| --- | --- | --- | --- | --- | --- |

| ALFFA | Amharic;Hausa (paid); Swahili; Wolof |  |  |   | MIT |

## 📜 [BSD 3-Clause License](https://opensource.org/licenses/BSD-3-Clause)

| CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |

| --- | --- | --- | --- | --- | --- |

| M-AILABS German Corpus | German | 237 hours and 22 minutes |  |  | [M-AILABS LICENSE](https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/) (a data-specific [BSD 3-Clause License](https://opensource.org/licenses/BSD-3-Clause))|

| M-AILABS Queen's English Corpus | Queen's English | 45 hours and 35 minutes |  |  | [M-AILABS LICENSE](https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/) (a data-specific [BSD 3-Clause License](https://opensource.org/licenses/BSD-3-Clause))|

| M-AILABS US English Corpus | American English | 102 hours and 7 minutes |  |  | [M-AILABS LICENSE](https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/) (a data-specific [BSD 3-Clause License](https://opensource.org/licenses/BSD-3-Clause))|

| M-AILABS Spanish Corpus | Spanish Spanish | 108 hours and 34 minutes |  |  | [M-AILABS LICENSE](https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/) (a data-specific [BSD 3-Clause License](https://opensource.org/licenses/BSD-3-Clause))|

| M-AILABS Italian Corpus | Italian | 127 hours and 40 minutes |  |  | [M-AILABS LICENSE](https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/) (a data-specific [BSD 3-Clause License](https://opensource.org/licenses/BSD-3-Clause))|

| M-AILABS Ukrainian Corpus | Ukrainian | 87 hours and 8 minutes |  |  | [M-AILABS LICENSE](https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/) (a data-specific [BSD 3-Clause License](https://opensource.org/licenses/BSD-3-Clause))|

| M-AILABS Russian Corpus | Russian | 46 hours and 47 minutes |  |  | [M-AILABS LICENSE](https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/) (a data-specific [BSD 3-Clause License](https://opensource.org/licenses/BSD-3-Clause))|

| M-AILABS French-v0.9 Corpus | French | 190 hours and 30 minutes |  |  | [M-AILABS LICENSE](https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/) (a data-specific [BSD 3-Clause License](https://opensource.org/licenses/BSD-3-Clause))|

| M-AILABS Polish Corpus | Polish | 53 hours and 50 minutes |  |  | [M-AILABS LICENSE](https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/) (a data-specific [BSD 3-Clause License](https://opensource.org/licenses/BSD-3-Clause))|

## 📜 [Custom License](https://en.wikipedia.org/wiki/Copyright)

| CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |

| --- | --- | --- | --- | --- | --- |

| Fluent Speech Commands Corpus | English | 19 hours (30,043 utterances) | 97 speakers |  | [Fluent Speech Commands Public License](https://groups.google.com/a/fluent.ai/forum/#!msg/fluent-speech-commands/MXh_7Y-3QC8/9i2pHPW9AwAJ) |

| CMU Wilderness | 700 Langs | Alignments distributed without audio or text total:~14,000 hours; per lang: ~20 hours |  |  |  |

| CHiME-5 | English | 50 hours | 48 speakers |  | [CHiME-5 License](http://spandh.dcs.shef.ac.uk/chime_challenge/download.html) |

| Fearless Steps Corpus | English | 19,000 hours (20 hours transcribed) | ~450 speakers |  | [NASA Media Usage Guidelines](https://www.nasa.gov/multimedia/guidelines/index.html) |

| Microsoft Speech Corpus (Indian languages) | Telugu; Tamil; Gujarati | | |  | [Microsoft Speech Corpus (Indian Languages) License](https://msropendata.com/datasets/7230b4b1-912d-400e-be58-f84e0512985e) |

| Microsoft Speech Language Translation Corpus | English; Chinese; Japanese| | |  | [Microsoft Research Data License Agreement](https://msrodr-api.azurewebsites.net//licenses/2f933be3-284d-500b-7ea3-2aa2fd0f1bb2/file) |

| Hey Snips Corpus | English | 11K positive "Hey Snips" (~4.4 hours) and 87K negative (~89 hours) utterances | 2215 speakers (positive & negative) and 4028 speakers (negative only) |  | [Snips Data License](https://github.com/snipsco/keyword-spotting-research-datasets/blob/master/LICENSE) |

| Snips SLU Corpus | English; French | 1660 "Smart Lights EN" (~1.3 hours), 1286 "Smart Speaker EN" (~55 minutes), 1138 "Smart Speaker FR" (~50 minutes) utterances | English: 69 speakers; French: 30 speakers |  | [Snips Data License](https://github.com/snipsco/keyword-spotting-research-datasets/blob/master/LICENSE) |

| CMU Sphinx Group - AN4 | English | "an4_clstk"(~50 minutes) "an4test_clstk" (~6 minutes) | "an4_clstk": 21 female, 53 male "an4test_clstk": 3 female, 7 male | http://www.speech.cs.cmu.edu/databases/an4/an4_raw.bigendian.tar.gz | [AN4](http://www.speech.cs.cmu.edu/databases/an4/LICENSE.html) |

| FT Speech | Danish | ~1,857 hours (1,017,244 utterances) | 434 speakers (176 female, 258 male) |  | [FT Speech License](https://ftspeech.dk/LICENSE.html) |

| FalaBrasil-LAPS-Constituicao | Brazilian-Portuguese | 9 hours | 1 speaker |  | ["Bases de áudio transcrito e bases de texto normalizadas (sem pontuação, com números escritos por extenso, etc.) disponibilizadas de forma gratuita* pelo Grupo FalaBrasil. [disponibilizadas de forma gratuita*] / Portanto, apenas as bases livres estão sendo disponibilizadas."](http://labvis.ufpa.br/falabrasil/downloads/) |

| FalaBrasil-LaPSMail | Brazilian-Portuguese | 1 hour | 25 speakers |  | ["Bases de áudio transcrito e bases de texto normalizadas (sem pontuação, com números escritos por extenso, etc.) disponibilizadas de forma gratuita* pelo Grupo FalaBrasil. [disponibilizadas de forma gratuita*] / Portanto, apenas as bases livres estão sendo disponibilizadas."](http://labvis.ufpa.br/falabrasil/downloads/) |

| FalaBrasil-LaPS Benchmark | Brazilian-Portuguese | 1 hour | 1 speaker |  | ["Bases de áudio transcrito e bases de texto normalizadas (sem pontuação, com números escritos por extenso, etc.) disponibilizadas de forma gratuita* pelo Grupo FalaBrasil. [disponibilizadas de forma gratuita*] / Portanto, apenas as bases livres estão sendo disponibilizadas."](http://labvis.ufpa.br/falabrasil/downloads/) |
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/coqui-ai/open-speech-corpora

Awesome Lists containing this project

README