{"id":28504924,"url":"https://github.com/common-voice/cv-dataset","last_synced_at":"2025-07-06T09:30:31.991Z","repository":{"id":38086307,"uuid":"280222096","full_name":"common-voice/cv-dataset","owner":"common-voice","description":"Metadata and versioning details for the Common Voice dataset ","archived":false,"fork":false,"pushed_at":"2025-06-25T06:55:02.000Z","size":615,"stargazers_count":148,"open_issues_count":15,"forks_count":15,"subscribers_count":17,"default_branch":"main","last_synced_at":"2025-06-25T07:43:56.308Z","etag":null,"topics":["asr","dataset","open-data","open-datasets","speech-recognition","voice"],"latest_commit_sha":null,"homepage":"https://commonvoice.mozilla.org/datasets","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mpl-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/common-voice.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2020-07-16T17:56:58.000Z","updated_at":"2025-06-25T06:55:07.000Z","dependencies_parsed_at":"2023-02-01T05:45:54.434Z","dependency_job_id":"73e577d6-d93f-4710-bc0e-3d2255984661","html_url":"https://github.com/common-voice/cv-dataset","commit_stats":{"total_commits":48,"total_committers":6,"mean_commits":8.0,"dds":0.5625,"last_synced_commit":"39e82378ceaea97879c9ff4fba158baae2eecb0e"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/common-voice/cv-dataset","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/common-voice%2Fcv-dataset","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/common-voice%2Fcv-dataset/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/common-voice%2Fcv-dataset/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/common-voice%2Fcv-dataset/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/common-voice","download_url":"https://codeload.github.com/common-voice/cv-dataset/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/common-voice%2Fcv-dataset/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":263877651,"owners_count":23523810,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["asr","dataset","open-data","open-datasets","speech-recognition","voice"],"created_at":"2025-06-08T18:31:59.388Z","updated_at":"2025-07-06T09:30:31.979Z","avatar_url":"https://github.com/common-voice.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Common Voice Dataset\nThis repo contains release details and metadata for the Common Voice dataset. Please visit https://commonvoice.mozilla.org/datasets to download the full dataset.\n\n## About this repo\n\nThis repo contains [statistics for each dataset](datasets) we have released in JSON format, as well as a [changelog](CHANGELOG.md) with brief summaries of the release. The JSON structure may have changed slightly from release-to-release, so if you plan on doing any comparisons you may need to normalize them between versions. Currently, changelogs and statistics from datasets released in the last year are available, and we are working to backfill this information for older versions as well. \n\nAny demographic split (i.e. sex, age, accent) is applied to **the entire dataset**, not just the validated set. Unless otherwise indicated, durations are measured in miliseconds, and file sizes are measured in bytes.\n\nPlease only use this repo to provide feedback on **technical issues** with the dataset, such as file corruptions, problems with the partitions, and so on. For more expansive discussions of qualitative discussions, please join us in [Discourse](https://discourse.mozilla.org/c/voice).\n\n## About the Dataset\n\nThis dataset features contributions from the Common Voice community on our [web platform](https://commonvoice.mozilla.org). New datasets are released approximately every six months.\n\nAll voice contributions are released as part of datasets, regardless of validation status. We only remove clips from datasets at the request of the user. The clips are bundled and uploaded to S3 using the [Common Voice Bundler tool](https://github.com/Common-Voice/common-voice-bundler/).\n\nEach downloaded `.tar.gz` file will have the following structure, where `[lang]` represents the [ISO 639-1 code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) for that language:\n\n```\n[lang].tar.gz/\n├── clips/\n│   ├── *.mp3 files\n|__ dev.tsv\n|__ invalidated.tsv\n|__ other.tsv\n|__ test.tsv\n|__ train.tsv\n|__ validated.tsv\n|__ reported.tsv (as of Corpus 5.0)\n```\n\nEach `.tsv` file contains a list of files, the annotation (original source sentence) for that clip, a hashed `client_id`, validation data, as well as any relevant demographics. If a language has fewer than 5 unique speakers, demographic data is removed to preserve privacy.\n\n* `validated` contains a list of all clips that have received two or more validations where `up_votes` \u003e `down_votes`\n* `invalidated` contains a list of all clips that have received two or more validations where `down_votes` \u003e `up_votes`, or clips that have received three or more validations where `down_votes` = `up_votes`\n* `other` contains a list of all clips that have not received sufficient validations to determine their status\n\nAs of Corpus 5.0, we are publishing a list of all of the sentences that have been flagged or reported by our contributors for each language, at the request of language communities that wish to use this data to do better quality control on their source sentences.\n\n## Fields\nEach row of a tsv file represents a single audio clip, and contains the following information:\n\n* client_id - hashed UUID of a given user\n* path - relative path of the audio file\n* text - supposed transcription of the audio\n* up_votes - number of people who said audio matches the text\n* down_votes - number of people who said audio does not match text\n* age - age of the speaker*\n* gender - gender of the speaker*\n* accent - accent of the speaker*\n* segment - if sentence belongs to a custom dataset segment, it will be listed here\n\n*For a full list of age, gender, and accent options, see the [demograpics spec](https://github.com/common-voice/common-voice/blob/main/web/src/stores/demographics.ts). These will only be reported if the speaker opted in to provide that information.\n\n## Use for machine-learning\n\nWe use the [Mozilla Corpora Creator](https://github.com/mozilla/CorporaCreator) tool to parse through metadata to generate [test, train, and dev](https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets) sets. The Corpora Creator eliminates duplication in clips and maximized for speaker diversity.\n\nEach test/train/dev set is generated non-deterministically, meaning that they will vary from release to release even for minor updates. This is to avoid reproducing and perpetuating any demographic skews in each subsequent set.\n\n## Dataset access\n\nWe're aware that downloading large files (\u003e 1-2 GB) over HTTP is not ideal, and we are working on improving our dataset access mechanisms to make it easier for researchers and developers to make use of our corpus. In the meantime, if you find that you are experiencing interruptions to your download, we suggest using `curl` on the command line for this, so that you can resume interrupted downloads with the `-C` option. For more information on how to use `curl`, please see [the man page documentation](https://www.mit.edu/afs.new/sipb/user/ssen/src/curl-7.11.1/docs/curl.html).\n\n## Citation\n\nIf you use the data in a published academic work we would appreciate if you cite the following article:\n\n- Ardila, R., Branson, M., Davis, K., Henretty, M., Kohler, M., Meyer, J., Morais, R., Saunders, L., Tyers, F. M. and Weber, G. (2020) \"[Common Voice: A Massively-Multilingual Speech Corpus](https://arxiv.org/abs/1912.06670)\". _Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020)._ pp. 4211—4215\n\nThe BiBTex is:\n\n```\n@inproceedings{commonvoice:2020,\n  author = {Ardila, R. and Branson, M. and Davis, K. and Henretty, M. and Kohler, M. and Meyer, J. and Morais, R. and Saunders, L. and Tyers, F. M. and Weber, G.},\n  title = {Common Voice: A Massively-Multilingual Speech Corpus},\n  booktitle = {Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020)},\n  pages = {4211--4215},\n  year = 2020\n}\n```\n\n```\n\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcommon-voice%2Fcv-dataset","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcommon-voice%2Fcv-dataset","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcommon-voice%2Fcv-dataset/lists"}