{"id":19695829,"url":"https://github.com/wenet-e2e/wenetspeech","last_synced_at":"2025-04-05T02:07:04.802Z","repository":{"id":43872174,"uuid":"392220961","full_name":"wenet-e2e/WenetSpeech","owner":"wenet-e2e","description":"A 10000+ hours dataset for Chinese speech recognition","archived":false,"fork":false,"pushed_at":"2023-07-03T02:57:34.000Z","size":3865,"stargazers_count":523,"open_issues_count":7,"forks_count":50,"subscribers_count":7,"default_branch":"main","last_synced_at":"2025-03-29T01:05:38.399Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/wenet-e2e.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2021-08-03T06:55:18.000Z","updated_at":"2025-03-28T10:16:42.000Z","dependencies_parsed_at":"2024-01-17T13:12:06.775Z","dependency_job_id":"0c3e1e6e-6c66-423f-a13f-e0d0a855762c","html_url":"https://github.com/wenet-e2e/WenetSpeech","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wenet-e2e%2FWenetSpeech","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wenet-e2e%2FWenetSpeech/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wenet-e2e%2FWenetSpeech/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wenet-e2e%2FWenetSpeech/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/wenet-e2e","download_url":"https://codeload.github.com/wenet-e2e/WenetSpeech/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247276163,"owners_count":20912288,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-11T19:31:04.505Z","updated_at":"2025-04-05T02:06:59.780Z","avatar_url":"https://github.com/wenet-e2e.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"## WenetSpeech\n\n[**Official website**](https://wenet-e2e.github.io/WenetSpeech/)\n| [**Paper**](https://arxiv.org/pdf/2110.03370.pdf)\n\nA 10000+ Hours Multi-domain Chinese Corpus for Speech Recognition\n\n![WenetSpeech](res/wenetspeech.jpg)\n\n\n## Download\n\nPlease visit the [official website](https://wenet-e2e.github.io/WenetSpeech/),\nread the license, and follow the instruction to apply for the `PASSWORD` to download the data.\n\n``` bash\necho 'PASSWORD' \u003e SAFEBOX/password\n```\n\n### From Tecent Meeting (default)\n\nDownload WenetSpeech:\n\n``` bash\nbash utils/download_wenetspeech.sh DOWNLOAD_DIR UNTAR_DIR\n```\n\n### From ModelScope\n\nInstall `modelscope` (depends on `torch`) before downloading:\n\n``` bash\nconda create -n modelscope python=3.7\nconda activate modelscope\npip install torch\npip install modelscope -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html\n```\n\nDownload [WenetSpeech](https://modelscope.cn/datasets/wenet/WenetSpeech) from modelscope:\n\n``` bash\nsed -i 's/modelscope=false/modelscope=true/g' utils/download_wenetspeech.sh\nbash utils/download_wenetspeech.sh DOWNLOAD_DIR UNTAR_DIR\n```\n\n## Discussion \u0026 Communication\n\nPlease scan the QR code on the left to follow our offical account of WeNet.\nWe created a WeChat group for better discussion and quicker response.\nPlease scan the personal QR code on the right, and the guy is responsible for inviting you to the chat group.\n\n| \u003cimg src=\"https://github.com/robin1001/qr/blob/master/wenet.jpeg\" width=\"250px\"\u003e | \u003cimg src=\"https://github.com/wenet-e2e/wenet-contributors/blob/main/wenetspeech/lvhang.jpg\" width=\"250px\"\u003e |\n| ---- | ---- |\n\n\n## Benchmark\n\n| Toolkit | Dev  | Test\\_Net | Test\\_Meeting | AIShell-1 |\n|---------|------|:---------:|:-------------:|:---------:|\n| Kaldi   | 9.07 |   12.83   |     24.72     |    5.41   |\n| ESPNet  | 9.70 |    8.90   |     15.90     |    3.90   |\n| WeNet   | 8.88 |    9.70   |     15.59     |    4.61   |\n\n## Description\n\n### Creation\n\nAll the data are collected from YouTube and Podcast. Optical character recognition (OCR) and automatic speech recognition (ASR) techniques are adopted to label each YouTube and Podcast recording, respectively. To improve the quality of the corpus, we use a novel end-to-end label error detection method to further validate and filter the data.\n\n\n### Categories\n\nIn summary, WenetSpeech groups all data into 3 categories, as the following table shows:\n\n| Set        | Hours | Confidence  | Usage                                 |\n|------------|-------|-------------|---------------------------------------|\n| High Label | 10005 | \u003e=0.95      | Supervised Training                   |\n| Weak Label | 2478  | [0.6, 0.95] | Semi-supervised or noise training     |\n| Unlabel    | 9952  | /           | Unsupervised training or Pre-training |\n| In Total   | 22435 | /           | All above                             |\n\n### High Label Data\n\nWe classify the high label into 10 groups according to its domain, speaking style, and scenarios.\n\n| Domain      | Youtube | Podcast | Total  |\n|-------------|---------|---------|--------|\n| audiobook   | 0       | 250.9   | 250.9  |\n| commentary  | 112.6   | 135.7   | 248.3  |\n| documentary | 386.7   | 90.5    | 477.2  |\n| drama       | 4338.2  | 0       | 4338.2 |\n| interview   | 324.2   | 614     | 938.2  |\n| news        | 0       | 868     | 868    |\n| reading     | 0       | 1110.2  | 1110.2 |\n| talk        | 204     | 90.7    | 294.7  |\n| variety     | 603.3   | 224.5   | 827.8  |\n| others      | 144     | 507.5   | 651.5  |\n| Total       | 6113    | 3892    | 10005  |\n\nAs shown in the following table, we provide 3 training subsets, namely `S`, `M` and `L` for building ASR systems on different data scales.\n\n| Training Subsets | Confidence  | Hours |\n|------------------|-------------|-------|\n| L                | [0.95, 1.0] | 10005 |\n| M                | 1.0         | 1000  |\n| S                | 1.0         | 100   |\n\n### Evaluation Sets\n\n| Evaluation Sets | Hours | Source       | Description                                                                             |\n|-----------------|-------|--------------|-----------------------------------------------------------------------------------------|\n| DEV             | 20    | Internet     | Specially designed for some speech tools which require cross-validation set in training |\n| TEST\\_NET       | 23    | Internet     | Match test                                                                              |\n| TEST\\_MEETING   | 15    | Real meeting | Mismatch test which is a far-field, conversational, spontaneous, and meeting dataset   |\n\n## Contributors\n\n\n| \u003ca href=\"http://lxie.npu-aslp.org\" target=\"_blank\"\u003e\u003cimg src=\"https://raw.githubusercontent.com/wenet-e2e/wenet-contributors/main/colleges/nwpu.png\" width=\"250px\"\u003e\u003c/a\u003e | \u003ca href=\"https://www.chumenwenwen.com\" target=\"_blank\"\u003e\u003cimg src=\"https://raw.githubusercontent.com/wenet-e2e/wenet-contributors/main/companies/chumenwenwen.png\" width=\"250px\"\u003e\u003c/a\u003e | \u003ca href=\"http://www.aishelltech.com\" target=\"_blank\"\u003e\u003cimg src=\"https://raw.githubusercontent.com/wenet-e2e/wenet-contributors/main/companies/aishelltech.png\" width=\"250px\"\u003e\u003c/a\u003e |\n| ---- | ---- | ---- |\n\n|\u003ca href=\"\" target=\"_blank\"\u003e\u003cimg src=\"https://raw.githubusercontent.com/wenet-e2e/WenetSpeech/gh-pages/assets/img/tencent.png\" width=\"250px\"\u003e\u003c/a\u003e | \u003ca href=\"\" target=\"_blank\"\u003e\u003cimg src=\"https://raw.githubusercontent.com/wenet-e2e/WenetSpeech/gh-pages/assets/img/MindSpore.png\" width=\"250px\"\u003e\u003c/a\u003e | \u003ca href=\"\" target=\"_blank\"\u003e\u003cimg src=\"https://raw.githubusercontent.com/wenet-e2e/WenetSpeech/gh-pages/assets/img/xian.png\" width=\"250px\"\u003e\u003c/a\u003e |\n| ---- | ---- | ---- |\n\n\n\n## ACKNOWLEDGEMENTS\n\n* WenetSpeech refers a lot of work of [GigaSpeech](https://github.com/SpeechColab/GigaSpeech), and we thank Jiayu Du and Guoguo Chen for their suggestions on this work.\n* We thank Tencent Ethereal Audio Lab and Xi'an Future AI Innovation Center for providing hosting service for WenetSpeech. We also thank [MindSpore](https://www.mindspore.cn/) for the support of this work, which is a new deep learning computing framework.\n* Our gratitude goes to Lianhui Zhang and Yu Mao for collecting some of the YouTube data.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwenet-e2e%2Fwenetspeech","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwenet-e2e%2Fwenetspeech","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwenet-e2e%2Fwenetspeech/lists"}