{"id":13748051,"url":"https://github.com/facebookresearch/voxpopuli","last_synced_at":"2025-05-09T10:32:14.583Z","repository":{"id":44632171,"uuid":"327985815","full_name":"facebookresearch/voxpopuli","owner":"facebookresearch","description":"A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation","archived":true,"fork":false,"pushed_at":"2023-04-02T02:52:59.000Z","size":88,"stargazers_count":526,"open_issues_count":14,"forks_count":56,"subscribers_count":18,"default_branch":"main","last_synced_at":"2025-02-23T00:14:17.241Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/facebookresearch.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2021-01-08T18:47:36.000Z","updated_at":"2025-02-21T05:49:49.000Z","dependencies_parsed_at":"2023-01-23T04:01:11.666Z","dependency_job_id":"73b334b6-f7f1-4ebc-b9cd-f78b25e2481f","html_url":"https://github.com/facebookresearch/voxpopuli","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2Fvoxpopuli","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2Fvoxpopuli/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2Fvoxpopuli/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2Fvoxpopuli/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/facebookresearch","download_url":"https://codeload.github.com/facebookresearch/voxpopuli/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253234178,"owners_count":21875561,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T07:00:32.681Z","updated_at":"2025-05-09T10:32:14.291Z","avatar_url":"https://github.com/facebookresearch.png","language":"Python","funding_links":[],"categories":["Audio Datasets","Python","语音识别与合成_其他","Datasets","Data","Popular Training Datasets"],"sub_categories":["网络服务_其他","Speech Recognition (STT) Datasets","Spoken language corpora","Mixed Tokenizers"],"readme":" VoxPopuli\n=====\n[https://aclanthology.org/2021.acl-long.80](https://aclanthology.org/2021.acl-long.80)\n\nA large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation.\n\n# Overview\n\nVoxPopuli provides\n- 400K hours of unlabelled speech data for 23 languages\n- 1.8K hours of transcribed speech data for 16 languages\n- 17.3K hours of speech-to-speech interpretation data for 15x15 directions\n- 29 hours of transcribed speech data of non-native English intended for research in ASR for accented speech (15 L2 accents)\n\nThe raw data is collected from 2009-2020 [European Parliament event recordings](https://multimedia.europarl.europa.eu/en/home).\nWe acknowledge the European Parliament for creating and sharing these materials.\n\n#### Detailed statistics\n\n\u003cdetails\u003e\u003csummary\u003eUnlabelled and transcribed data\u003c/summary\u003e\u003cp\u003e\n\n| Language | Code | Unlabelled Hours (v1/v2) | Transcribed Hours | Transcribed Speakers | Transcribed Tokens | LM Tokens |\n|:---:|:---:|:---:|:---:|:---:|:---:|:---:|\n| English | En | 4.5K/24.1K | 543 | 1313 | 4.8M | 60.1M |\n| German | De | 4.5K/23.2K | 282 | 531 | 2.3M | 50.0M |\n| French | Fr | 4.5K/22.8K | 211 | 534 | 2.1M | 58.6M |\n| Spanish | Es | 4.4K/21.4K | 166 | 305 | 1.6M | 57.4M |\n| Polish | Pl | 4.5K/21.2K | 111 | 282 | 802K | 13.6M |\n| Italian | It | 4.6K/21.9K | 91 | 306 | 757K | 52.1M |\n| Romanian | Ro | 4.5K/17.9K | 89 | 164 | 739K | 10.3M |\n| Hungarian | Hu | 4.4K/17.7K | 63 | 143 | 431K | 13.0M |\n| Czech | Cs | 4.5K/18.7K | 62 | 138 | 461K | 13.5M |\n| Dutch | Nl | 4.5K/19.0K | 53 | 221 | 488K | 54.6M |\n| Finnish | Fi | 4.4K/14.2K | 27 | 84 | 160K | 34.5M |\n| Croatian | Hr | 2.7K/8.1K | 43 | 83 | 337K | 285K |\n| Slovak | Sk | 4.4K/12.1K | 35 | 96 | 270K | 13.3M |\n| Slovene | Sl | 4.4K/11.3K | 10 | 45 | 76K | 12.6M |\n| Estonian | Et | 4.3K/10.6K | 3 | 29 | 18K | 11.3M |\n| Lithuanian | Lt | 4.3K/14.4K | 2 | 21 | 10K | 11.5M |\n| Portuguese | Pt | 4.4K/17.5K | - | - | - | - |\n| Bulgarian | Bg | 4.3K/17.6K | - | - | - | - |\n| Greek | El | 4.4K/17.7K | - | - | - | - |\n| Latvian | Lv | 4.4K/13.1K | - | - | - | - |\n| Maltese | Mt | 4.4K/9.1K | - | - | - | - |\n| Swedish | Sv | 4.5K/16.3K | - | - | - | - |\n| Danish | Da | 4.3K/13.6K | - | - | - | - |\n| Total | | 100K/384K | 1791 | 4295 | 15M | 467M |\n\n\u003c/p\u003e\u003c/details\u003e\n\n\u003cdetails\u003e\u003csummary\u003eSpeech-to-speech interpretation data\u003c/summary\u003e\u003cp\u003e\n\n| Source/Target | En | De | Fr | Es | Pl | It | Ro | Hu | Cs | Nl | Fi | Sk | Sl | Lt | Da | Total |\n|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|\n| En | - | 463 | 427 | 441 | 432 | 461 | 457 | 382 | 427 | 400 | 442 | 433 | 434 | 398 | 370 | 6.0K |\n| De | 187 | - | 196 | 204 | 214 | 217 | 198 | 205 | 214 | 196 | 217 | 208 | 218 | 164 | 179 | 2.8K |\n| Fr | 169 | 187 | - | 187 | 172 | 197 | 195 | 144 | 170 | 158 | 168 | 168 | 156 | 139 | 134 | 2.3K |\n| Es | 130 | 138 | 135 | - | 118 | 148 | 128 | 93 | 118 | 115 | 124 | 114 | 108 | 83 | 86 | 1.6K |\n| Pl | 68 | 66 | 54 | 55 | - | 67 | 55 | 43 | 67 | 42 | 55 | 62 | 57 | 50 | 34 | 775 |\n| It | 69 | 77 | 76 | 79 | 72 | - | 75 | 61 | 68 | 64 | 71 | 66 | 70 | 53 | 60 | 961 |\n| Ro | 60 | 59 | 59 | 58 | 49 | 61 | - | 38 | 50 | 43 | 48 | 50 | 46 | 38 | 29 | 688 |\n| Hu | 30 | 38 | 25 | 27 | 29 | 30 | 27 | - | 27 | 20 | 31 | 29 | 26 | 21 | 18 | 378 |\n| Cs | 39 | 35 | 29 | 30 | 36 | 32 | 31 | 23 | - | 23 | 29 | 55 | 29 | 25 | 18 | 434 |\n| Nl | 31 | 43 | 35 | 29 | 27 | 38 | 24 | 25 | 25 | - | 32 | 25 | 23 | 19 | 25 | 401 |\n| Fi | 15 | 18 | 15 | 13 | 13 | 13 | 13 | 12 | 13 | 11 | - | 14 | 12 | 11 | 9 | 182 |\n| Hr | 31 | 27 | 27 | 24 | 27 | 28 | 24 | 22 | 24 | 22 | 24 | 26 | 37 | 21 | 20 | 384 |\n| Sk | 21 | 22 | 14 | 16 | 19 | 16 | 16 | 14 | 32 | 13 | 16 | - | 17 | 13 | 10 | 239 |\n| Sl | 6 | 6 | 4 | 5 | 5 | 6 | 5 | 4 | 5 | 4 | 5 | 6 | - | 4 | 3 | 68 |\n| Lt | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | - | 0 | 13 |\n| Total | 857 | 1.2K | 1.1K | 1.2K | 1.2K | 1.3K | 1.2K | 1.1K | 1.2K | 1.1K | 1.3K | 1.3K | 1.2K | 1.0K | 995 | 17.3K |\n\n\u003c/p\u003e\u003c/details\u003e\n\n\u003cdetails\u003e\u003csummary\u003eAccented speech transcribed data\u003c/summary\u003e\u003cp\u003e\n\n| Accent | Code | Transcribed Hours | Transcribed Speakers |\n|:---:|:---:|:---:|:---:|\n| Dutch | en_nl | 3.52 | 45 |\n| German | en_de | 3.52 | 84 |\n| Czech | en_cs | 3.30 | 26 |\n| Polish | en_pl | 3.23 | 33 |\n| French | en_fr | 2.56 | 27 |\n| Hungarian | en_hu | 2.33 | 23 |\n| Finnish | en_fi | 2.18 | 20 |\n| Romanian | en_ro | 1.85 | 27 |\n| Slovak | en_sk | 1.46 | 17 |\n| Spanish | en_es | 1.42 | 18 |\n| Italian | en_it | 1.11 | 15 |\n| Estonian | en_et | 1.08 | 6 |\n| Lithuanian | en_lt | 0.65 | 7 |\n| Croatian | en_hr | 0.42 | 9 |\n| Slovene | en_sl | 0.25 | 7 |\n\n\u003c/p\u003e\u003c/details\u003e\n\n# What's New\n- __2022-02-01__: New labelled accented English speech data released.\n- __2022-01-15__: New [wav2vec 2.0 pre-trained models](https://github.com/facebookresearch/voxpopuli#wav2vec-20) released.\n- __2021-07-26__: New unlabelled data (additional 300K hours) released.\n- __2021-03-03__: VoxPopuli released.\n\n# Getting Data\nWe provide raw audios as well as scripts to segment and align them with transcription/interpretation. The output format\nis [Ogg Vorbis](https://en.wikipedia.org/wiki/Vorbis) (16000Hz, 16-bit, mono-channel),\nwhich is supported by common libraries such as `libsndfile` and `libsox` (they have Python frontends\nby [soundfile](https://github.com/bastibe/python-soundfile), [torchaudio](https://github.com/pytorch/audio), etc.).\n\nAs the first step, clone this repo for the processing scripts\n```bash\ngit clone https://github.com/facebookresearch/voxpopuli.git\n```\nand install required PyPI packages:\n```bash\npip install -r requirements.txt\n```\n\n\n### Unlabelled Data\nFirst, download raw audios via\n```bash\npython -m voxpopuli.download_audios --root [ROOT] --subset [SUBSET]\n```\nwhich saves audios to `${ROOT}/raw_audios/[language]/[year]/[recording_id].ogg`.\n\n`SUBSET` specifies the data subset to download:\n\n|  --subset | # Languages | Hours | Years | Size |\n|:---:|:---:|:---:|:---:|:---:|\n| en, de, fr, es, pl, it, ro, hu, cs, nl, fi, hr, sk, sl, et, lt, pt, bg, el, lv, mt, sv or da | 1 | 2.7K-4.6K | 2009-2020 | 44G-75G |\n| en_v2, de_v2, fr_v2, es_v2, pl_v2, it_v2, ro_v2, hu_v2, cs_v2, nl_v2, fi_v2, hr_v2, sk_v2, sl_v2, et_v2, lt_v2, pt_v2, bg_v2, el_v2, lv_v2, mt_v2, sv_v2 or da_v2 | 1 | 8.1K-24.1K | 2009-2020 | 130G-385G |\n| 10k | 23 | 10K | 2019-2020 | 170G |\n| 100k | 23 | 100K | 2009-2020 | 1.7T |\n| 400k | 23 | 400K | 2009-2020 | 6.4T |\n\nThen, segment these audios via\n```bash\npython -m voxpopuli.get_unlabelled_data --root [ROOT] --subset [SUBSET]\n```\nwhich outputs to `${ROOT}/unlabelled_data/[language]/[year]/[segment_id].ogg`\n\n### Transcribed (ASR) Data\nFirst, download raw audios via\n```bash\npython -m voxpopuli.download_audios --root [ROOT] --subset asr\n```\nwhich saves audios to `${ROOT}/raw_audios/original/[year]/[recording_id].ogg`.\n\nThen, segment these audios and align them with transcripts via\n```bash\npython -m voxpopuli.get_asr_data --root [ROOT] --lang [LANGUAGE]\n```\nwhich outputs\n- audios `${ROOT}/transcribed_data/[language]/[year]/[segment_id].ogg`\n- per-split manifest (ID, transcript, speaker ID) `${ROOT}/transcribed_data/[language]/asr_[split].tsv`\n\n**Accented transcribed data**\nTo retrieve the transcribed accented speech data, follow the above steps with `--lang [LANGUAGE]_accented` (e.g. `--lang en_accented`).\nNote that the accented speech data is only composed of a test set for now.\n\n### Speech-to-Speech Interpretation Data\nFirst, follow the instructions above to set up ASR data (source audios and transcripts).\n\nThen, download target audios via\n```bash\npython -m voxpopuli.download_audios --root [ROOT] --subset [TARGET_LANGUAGE]\n```\nwhich saves audios to `${ROOT}/raw_audios/[target_language]/[year]/[recording_id].ogg`.\n\nFinally, segment these audios and match them with source ones via\n```bash\npython -m voxpopuli.get_s2s_data --root [ROOT] --source-lang [SOURCE_LANGUAGE] --target-lang [TARGET_LANGUAGE]\n```\nwhich outputs\n- target audios `${ROOT}/transcribed_data/[language]/[target_language]/[year]/[segment_id].ogg`\n- manifest (source ID, transcript, speaker ID, target ID) `${ROOT}/transcribed_data/[language]/[target_language]/s2s.tsv`\n\nWe also human-transcribe part of the target audios (for English, French and Spanish only) to allow more accurate alignments.\nTo use them instead of machine transcriptions in the alignments, add `--use-annotated-target` to the command line.\n\n### Language Modeling (LM) Data\nWe combine VoxPopuli transcripts and text data from [Europarl](https://www.statmt.org/europarl/) for LM training.\n\nDownload VoxPopuli and Europarl text data, process the raw text and generate the vocabulary via\n```bash\npython -m voxpopuli.get_lm_data --root [ROOT] --lang [LANGUAGE]\n```\nwhich outputs\n- sentences `${ROOT}/lm_data/[language]/sentences.txt`\n- vocabulary `${ROOT}/lm_data/[language]/vocabulary.txt`\n\nTo train an n-gram LM with [KenLM](https://github.com/kpu/kenlm), run\n```bash\n${KENLM_PATH}/lmplz -o ${n} --limit_vocab_file [OUT_VOCAB_FILE] \u003c [OUT_TEXT_FILE] \u003e ${n}gram_lm.arpa\n${KENLM_PATH}/build_binary ${n}gram_lm.arpa ${n}gram_lm.bin\n```\n\n#  Pre-trained Models\n## wav2vec 2.0\nWe provide pre-trained wav2vec 2.0 models\n(implemented in [fairseq](https://github.com/pytorch/fairseq) and [wav2letter/flashlight](https://github.com/facebookresearch/flashlight))\nfor downstream speech tasks. Each language is covered by a monolingual _Base_ model and multilingual _Large_ models that\ncombine languages in the same family or all languages. See also [XLS-R](https://github.com/pytorch/fairseq/tree/main/examples/wav2vec/xlsr)\nfor larger-scale (up to 2B) multilingual models trained on VoxPopuli (400K hours).\n\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003eDownload\u003c/b\u003e\u003c/summary\u003e\u003cp\u003e\n\n|   Language(s)    |     Family     |  PT Hours  |                                                                             Base Model (95M)                                                                              |                                                                                      Large Model (317M)                                                                                       |\n|:----------------:|:--------------:|:----------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|\n|    Es (V1/V2)    |    Romance     | 4.4K/21.4K |     fairseq [V1](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_es.pt) / [V2](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_es_v2.pt)      |        fairseq [V1](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_es.pt) / [V2 Romance](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_romance_v2.pt)        |\n|    Fr (V1/V2)    |    Romance     | 4.5K/22.8K |     fairseq [V1](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_fr.pt) / [V2](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_fr_v2.pt)      |        fairseq [V1](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_fr.pt) / [V2 Romance](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_romance_v2.pt)        |\n|    It (V1/V2)    |    Romance     | 4.6K/21.9K |     fairseq [V1](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_it.pt) / [V2](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_it_v2.pt)      |        fairseq [V1](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_it.pt) / [V2 Romance](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_romance_v2.pt)        |\n|     Pt (V2)      |    Romance     |   17.5K    |                                             [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_pt_v2.pt)                                             |                                              [fairseq V2 Romance](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_romance_v2.pt)                                               |\n|     Ro (V2)      |    Romance     |   17.9K    |                                             [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_ro_v2.pt)                                             |                                              [fairseq V2 Romance](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_romance_v2.pt)                                               |\n|    Nl (V1/V2)    | West Germanic  | 4.5K/19.0K |     fairseq [V1](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_nl.pt) / [V2](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_nl_v2.pt)      |  fairseq [V1](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_nl.pt) / [V2 West Germanic](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_west_germanic_v2.pt)  |\n|     En (V2)      | West Germanic  |   24.1K    |                                             [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_en_v2.pt)                                             |                                        [fairseq V2 West Germanic](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_west_germanic_v2.pt)                                         |\n|     De (V2)      | West Germanic  |   23.2K    |                                             [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_de_v2.pt)                                             |                                        [fairseq V2 West Germanic](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_west_germanic_v2.pt)                                         |\n|    Sv (V1/V2)    | North Germanic | 4.5K/16.3K |     fairseq [V1](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_sv.pt) / [V2](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_sv_v2.pt)      | fairseq [V1](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_sv.pt) / [V2 North Germanic](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_north_germanic_v2.pt) |\n|     Da (V2)      | North Germanic |   13.6K    |                                             [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_da_v2.pt)                                             |                                       [fairseq V2 North Germanic](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_north_germanic_v2.pt)                                        |\n|     Bg (V2)      |     Slavic     |   17.6K    |                                             [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_bg_v2.pt)                                             |                                                 [fairseq V2 Slavic](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_slavic_v2.pt)                                                 |\n|     Cs (V2)      |     Slavic     |   18.7K    |                                             [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_cs_v2.pt)                                             |                                                 [fairseq V2 Slavic](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_slavic_v2.pt)                                                 |\n|     Hr (V2)      |     Slavic     |    8.1K    |                                             [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_hr_v2.pt)                                             |                                                 [fairseq V2 Slavic](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_slavic_v2.pt)                                                 |\n|     Pl (V2)      |     Slavic     |   21.2K    |                                             [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_pl_v2.pt)                                             |                                                 [fairseq V2 Slavic](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_slavic_v2.pt)                                                 |\n|     Sk (V2)      |     Slavic     |   12.1K    |                                             [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_sk_v2.pt)                                             |                                                 [fairseq V2 Slavic](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_slavic_v2.pt)                                                 |\n|     Sl (V2)      |     Slavic     |   11.3K    |                                             [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_sl_v2.pt)                                             |                                                 [fairseq V2 Slavic](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_slavic_v2.pt)                                                 |\n|     Et (V2)      |     Uralic     |   10.6K    |                                             [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_et_v2.pt)                                             |                                                 [fairseq V2 Uralic](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_uralic_v2.pt)                                                 |\n|     Fi (V2)      |     Uralic     |   14.2K    |                                             [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_fi_v2.pt)                                             |                                                 [fairseq V2 Uralic](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_uralic_v2.pt)                                                 |\n|     Hu (V2)      |     Uralic     |   17.7K    |                                             [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_hu_v2.pt)                                             |                                                 [fairseq V2 Uralic](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_uralic_v2.pt)                                                 |\n|     Lv (V2)      |     Baltic     |   13.1K    |                                             [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_lv_v2.pt)                                             |                                                 [fairseq V2 Baltic](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_baltic_v2.pt)                                                 |\n|     Lt (V2)      |     Baltic     |   14.4K    |                                             [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_lt_v2.pt)                                             |                                                 [fairseq V2 Baltic](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_baltic_v2.pt)                                                 |\n|     El (V2)      |     Greek      |   17.7K    |                                             [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_el_v2.pt)                                             |                                                      [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_el_v2.pt)                                                       |\n|     Mt (V2)      |    Semitic     |    9.1K    |                                             [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_mt_v2.pt)                                             |                                                      [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_mt_v2.pt)                                                       |\n| All 23 languages |       -        |    10K     |                                              [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_10k.pt)                                              |                                                       [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_10k.pt)                                                        |\n| All 23 languages |       -        |    100K    | [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_100k.pt) / [wav2letter](https://dl.fbaipublicfiles.com/voxpopuli/vox_populi_100k_500iters.tar.gz) |                                                       [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_100k.pt)                                                       |\n\n\u003c/p\u003e\u003c/details\u003e\n\nIn [our paper](https://arxiv.org/pdf/2101.00390.pdf) (Section 4.3.1), we evaluated part of these models on the [Common Voice](https://commonvoice.mozilla.org/) corpus\nin the normal setting and the [few-shot phoneme recognition setting](https://github.com/facebookresearch/CPC_audio#cross-lingual-transfer).\n\n## Wav2letter C++ implementation\n\nA wav2letter implementation as well as a checkpoint pretrained on VoxPopuli 100k (base model) is also available in the [Wav2letter respository](https://github.com/flashlight/wav2letter/tree/master/recipes/joint_training_vox_populi).\n\nThe complete fine-tuned ASR baselines for this codebase shoulda come soon.\nThe wav2letter implementation follows [this paper](https://arxiv.org/abs/2011.00093).\n\n## ASR and LM\nFor the VoxPopuli ASR task, we provide Transformer baselines, fine-tuned wav2vec2 models (Base 10K) as well as n-gram LMs (trained with [KenLM](https://github.com/kpu/kenlm)) and their lexicons.\n\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003eDownload\u003c/b\u003e\u003c/summary\u003e\u003cp\u003e\n\n|  Language | ASR (fairseq) | LM (kenLM) | Lexicon |\n|:---:|:---:|:---:|:---:|\n| Cs | [baseline](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/s2t_transformer_s_cs.tar), [fine-tuned wav2vec2](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/wav2vec2_base_10k_ft_cs.tar) | [3-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/cs/cs_3gram_lm.bin), [5-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/cs/cs_5gram_lm.bin) | [lexicon](https://dl.fbaipublicfiles.com/voxpopuli/lm/cs/cs_lm.lexicon) |\n| De | [baseline](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/s2t_transformer_s_de.tar), [fine-tuned wav2vec2](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/wav2vec2_base_10k_ft_de.tar) | [3-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/de/de_3gram_lm.bin), [5-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/de/de_5gram_lm.bin) | [lexicon](https://dl.fbaipublicfiles.com/voxpopuli/lm/de/de_lm.lexicon) |\n| En | [baseline](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/s2t_transformer_s_en.tar), [fine-tuned wav2vec2](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/wav2vec2_base_10k_ft_en.tar) | [3-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/en/en_3gram_lm.bin), [5-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/en/en_5gram_lm.bin) | [lexicon](https://dl.fbaipublicfiles.com/voxpopuli/lm/en/en_lm.lexicon) |\n| Es | [baseline](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/s2t_transformer_s_es.tar), [fine-tuned wav2vec2](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/wav2vec2_base_10k_ft_es.tar) | [3-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/es/es_3gram_lm.bin), [5-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/es/es_5gram_lm.bin) | [lexicon](https://dl.fbaipublicfiles.com/voxpopuli/lm/es/es_lm.lexicon) |\n| Et | [baseline](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/s2t_transformer_s_et.tar), [fine-tuned wav2vec2](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/wav2vec2_base_10k_ft_et.tar) | [3-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/et/et_3gram_lm.bin), [5-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/et/et_5gram_lm.bin) | [lexicon](https://dl.fbaipublicfiles.com/voxpopuli/lm/et/et_lm.lexicon) |\n| Fi | [baseline](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/s2t_transformer_s_fi.tar), [fine-tuned wav2vec2](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/wav2vec2_base_10k_ft_fi.tar) | [3-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/fi/fi_3gram_lm.bin), [5-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/fi/fi_5gram_lm.bin) | [lexicon](https://dl.fbaipublicfiles.com/voxpopuli/lm/fi/fi_lm.lexicon) |\n| Fr | [baseline](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/s2t_transformer_s_fr.tar), [fine-tuned wav2vec2](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/wav2vec2_base_10k_ft_fr.tar) | [3-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/fr/fr_3gram_lm.bin), [5-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/fr/fr_5gram_lm.bin) | [lexicon](https://dl.fbaipublicfiles.com/voxpopuli/lm/fr/fr_lm.lexicon) |\n| Hr | [baseline](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/s2t_transformer_s_hr.tar), [fine-tuned wav2vec2](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/wav2vec2_base_10k_ft_hr.tar) | [3-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/hr/hr_3gram_lm.bin), [5-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/hr/hr_5gram_lm.bin) | [lexicon](https://dl.fbaipublicfiles.com/voxpopuli/lm/hr/hr_lm.lexicon) |\n| Hu | [baseline](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/s2t_transformer_s_hu.tar), [fine-tuned wav2vec2](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/wav2vec2_base_10k_ft_hu.tar) | [3-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/hu/hu_3gram_lm.bin), [5-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/hu/hu_5gram_lm.bin) | [lexicon](https://dl.fbaipublicfiles.com/voxpopuli/lm/hu/hu_lm.lexicon) |\n| It | [baseline](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/s2t_transformer_s_it.tar), [fine-tuned wav2vec2](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/wav2vec2_base_10k_ft_it.tar) | [3-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/it/it_3gram_lm.bin), [5-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/it/it_5gram_lm.bin) | [lexicon](https://dl.fbaipublicfiles.com/voxpopuli/lm/it/it_lm.lexicon) |\n| Lt | [baseline](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/s2t_transformer_s_lt.tar), [fine-tuned wav2vec2](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/wav2vec2_base_10k_ft_lt.tar) | [3-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/lt/lt_3gram_lm.bin), [5-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/lt/lt_5gram_lm.bin) | [lexicon](https://dl.fbaipublicfiles.com/voxpopuli/lm/lt/lt_lm.lexicon) |\n| Nl | [baseline](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/s2t_transformer_s_nl.tar), [fine-tuned wav2vec2](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/wav2vec2_base_10k_ft_nl.tar) | [3-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/nl/nl_3gram_lm.bin), [5-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/nl/nl_5gram_lm.bin) | [lexicon](https://dl.fbaipublicfiles.com/voxpopuli/lm/nl/nl_lm.lexicon) |\n| Pl | [baseline](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/s2t_transformer_s_pl.tar), [fine-tuned wav2vec2](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/wav2vec2_base_10k_ft_pl.tar) | [3-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/pl/pl_3gram_lm.bin), [5-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/pl/pl_5gram_lm.bin) | [lexicon](https://dl.fbaipublicfiles.com/voxpopuli/lm/pl/pl_lm.lexicon) |\n| Ro | [baseline](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/s2t_transformer_s_ro.tar), [fine-tuned wav2vec2](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/wav2vec2_base_10k_ft_ro.tar) | [3-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/ro/ro_3gram_lm.bin), [5-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/ro/ro_5gram_lm.bin) | [lexicon](https://dl.fbaipublicfiles.com/voxpopuli/lm/ro/ro_lm.lexicon) |\n| Sk | [baseline](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/s2t_transformer_s_sk.tar), [fine-tuned wav2vec2](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/wav2vec2_base_10k_ft_sk.tar) | [3-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/sk/sk_3gram_lm.bin), [5-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/sk/sk_5gram_lm.bin) | [lexicon](https://dl.fbaipublicfiles.com/voxpopuli/lm/sk/sk_lm.lexicon) |\n| Sl | [baseline](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/s2t_transformer_s_sl.tar), [fine-tuned wav2vec2](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/wav2vec2_base_10k_ft_sl.tar) | [3-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/sl/sl_3gram_lm.bin), [5-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/sl/sl_5gram_lm.bin) | [lexicon](https://dl.fbaipublicfiles.com/voxpopuli/lm/sl/sl_lm.lexicon) |\n\n\u003c/p\u003e\u003c/details\u003e\n \nWe also provide [CoVoST 2](https://github.com/facebookresearch/covost) +\n[EuroParl-ST](https://www.mllp.upv.es/europarl-st/) ASR Transformer models that are self-trained on 3000h VoxPopuli\nunlabelled data.\n\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003eDownload\u003c/b\u003e\u003c/summary\u003e\u003cp\u003e\n\n|  Language | CoVoST 2 Test (WER) | EuroParl-ST Test (WER) | Model (fairseq) |\n|:---:|:---:|:---:|:---:|\n| De | 17.3 | 21.4 | [s2t_transformer_l](https://dl.fbaipublicfiles.com/voxpopuli/models/cvst_epst/s2t_transformer_l_de.tar) |\n| Es | 13.2 | 15.3 | [s2t_transformer_l](https://dl.fbaipublicfiles.com/voxpopuli/models/cvst_epst/s2t_transformer_l_es.tar) |\n| Fr | 17.0 | 19.0 | [s2t_transformer_l](https://dl.fbaipublicfiles.com/voxpopuli/models/cvst_epst/s2t_transformer_l_fr.tar) |\n \n\u003c/p\u003e\u003c/details\u003e\n\nPlease refer to the [S2T examples](https://github.com/pytorch/fairseq/tree/master/examples/speech_to_text) for the use\nof Transformer model checkpoints.\n\n## Speech-to-Text Translation (ST)\nWe provide [CoVoST 2](https://github.com/facebookresearch/covost) +\n[EuroParl-ST](https://www.mllp.upv.es/europarl-st/) ST Transformer models that are jointly trained with 400h VoxPopuli\nweakly labelled data.\n\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003eDownload\u003c/b\u003e\u003c/summary\u003e\u003cp\u003e\n\n| Direction | CoVoST 2 Test (BLEU) | EuroParl-ST Test (BLEU) | Model (fairseq) |\n|:---:|:---:|:---:|:---:|\n| De-En | 23.4 | 24.4 | [s2t_transformer_l](https://dl.fbaipublicfiles.com/voxpopuli/models/cvst_epst/s2t_transformer_l_de-en.tar) |\n| Es-En | 29.7 | 28.4 | [s2t_transformer_l](https://dl.fbaipublicfiles.com/voxpopuli/models/cvst_epst/s2t_transformer_l_es-en.tar) |\n| Fr-En | 30.3 | 31.1 | [s2t_transformer_l](https://dl.fbaipublicfiles.com/voxpopuli/models/cvst_epst/s2t_transformer_l_fr-en.tar) |\n\n\u003c/p\u003e\u003c/details\u003e\n \nPlease refer to the\n[S2T examples](https://github.com/pytorch/fairseq/tree/master/examples/speech_to_text) for the use of these checkpoints.\n\n# License\n|  | License |\n|:---:|:---:|\n| VoxPopuli Data | [CC0](https://creativecommons.org/share-your-work/public-domain/cc0/) (see also European Parliament's [legal notice](https://www.europarl.europa.eu/legal-notice/en/) for the raw data) |\n| LM Data | (Please check out the [Europarl website](https://www.statmt.org/europarl/) for the Europarl portion) |\n| Pre-trained Models | [CC BY-NC 4.0](https://github.com/facebookresearch/covost/blob/master/LICENSE) |\n| Code | [CC BY-NC 4.0](https://github.com/facebookresearch/covost/blob/master/LICENSE) |\n\n# Contact\nChanghan Wang (changhan@fb.com), Morgane Rivière (mriviere@fb.com), Ann Lee (annl@fb.com)\n\n# Citation\n```\n@inproceedings{wang-etal-2021-voxpopuli,\n    title = \"{V}ox{P}opuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation\",\n    author = \"Wang, Changhan  and\n      Riviere, Morgane  and\n      Lee, Ann  and\n      Wu, Anne  and\n      Talnikar, Chaitanya  and\n      Haziza, Daniel  and\n      Williamson, Mary  and\n      Pino, Juan  and\n      Dupoux, Emmanuel\",\n    booktitle = \"Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)\",\n    month = aug,\n    year = \"2021\",\n    address = \"Online\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://aclanthology.org/2021.acl-long.80\",\n    pages = \"993--1003\",\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffacebookresearch%2Fvoxpopuli","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffacebookresearch%2Fvoxpopuli","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffacebookresearch%2Fvoxpopuli/lists"}