{"id":13754317,"url":"https://github.com/CLUEbenchmark/CLUECorpus2020","last_synced_at":"2025-05-09T22:31:51.703Z","repository":{"id":40530726,"uuid":"236152498","full_name":"CLUEbenchmark/CLUECorpus2020","owner":"CLUEbenchmark","description":"Large-scale Pre-training Corpus for Chinese 100G 中文预训练语料 ","archived":false,"fork":false,"pushed_at":"2022-10-17T03:51:05.000Z","size":315,"stargazers_count":938,"open_issues_count":10,"forks_count":81,"subscribers_count":20,"default_branch":"master","last_synced_at":"2025-02-22T12:29:52.667Z","etag":null,"topics":["albert","bert","chinese","chinese-corpus","corpus","datasets","nlp","pretrain","roberta"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2003.01355","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/CLUEbenchmark.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-01-25T09:58:34.000Z","updated_at":"2025-02-22T01:20:10.000Z","dependencies_parsed_at":"2023-01-19T22:15:50.022Z","dependency_job_id":null,"html_url":"https://github.com/CLUEbenchmark/CLUECorpus2020","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CLUEbenchmark%2FCLUECorpus2020","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CLUEbenchmark%2FCLUECorpus2020/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CLUEbenchmark%2FCLUECorpus2020/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CLUEbenchmark%2FCLUECorpus2020/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/CLUEbenchmark","download_url":"https://codeload.github.com/CLUEbenchmark/CLUECorpus2020/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253335781,"owners_count":21892732,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["albert","bert","chinese","chinese-corpus","corpus","datasets","nlp","pretrain","roberta"],"created_at":"2024-08-03T09:01:54.552Z","updated_at":"2025-05-09T22:31:46.689Z","avatar_url":"https://github.com/CLUEbenchmark.png","language":null,"funding_links":[],"categories":["NLP语料和数据集","开源数据集库"],"sub_categories":["大语言对话模型及数据","19. MiniMax大模型--MiniMax"],"readme":"# CLUECorpus2020\n\n## 语料介绍\n\n通过对\u003ca href='http://commoncrawl.org'\u003eCommon Crawl\u003c/a\u003e的中文部分进行语料清洗，最终得到100GB的高质量中文预训练语料。实验产出的模型见：\u003ca href='https://github.com/CLUEbenchmark/CLUEPretrainedModels'\u003e高质量中文预训练模型，大号、超小和相似度预训练模型。\u003c/a\u003e \n\n**更多细节请参考我们的技术报告 \u003ca href='https://arxiv.org/pdf/2003.01355'\u003ehttps://arxiv.org/pdf/2003.01355\u003c/a\u003e**\n\n![./pics/corpus.png](./pics/corpus.png)\n\n#### 数据特点：\n1. 可直接用于预训练、语言模型或语言生成任务。\n2. 发布专用于简体中文NLP任务的小词表。\n\n## 词表介绍\n\nGoogle原始中文词表和我们发布的小词表的统计信息如下：\n\n| Token Type | Google | CLUE |\n| :----:| :----: | :----: |\n| Simplified Chinese | 11378 | 5689 |\n| Traditional Chinese | 3264 | ✗ |\n| English | 3529 | 1320 |\n| Japanese | 573 | ✗ |\n| Korean | 84 | ✗ |\n| Emoji | 56 | ✗ |\n| Numbers | 1179 | 140 |\n| Special Tokens | 106 | 106 |\n| Other Tokens | 959 | 766 |\n| Total | 21128 | 8021 |\n\n## 实验效果\n\n使用小数据集在BERT-base上的效果对比：\n\n| Model        | Vocab  | Data        | Steps | AFQMC  | TNEWS'  | IFLYTEK'  | CMNLI  |  AVG   |\n| :----:| :----: | :----: | :----: |:----: |:----: |:----: |:----: |:----: |\n| BERT-base    | Google | Wiki (1 GB) | 125K  | 69.93% | 54.77%  | 57.54%    | 75.64% | 64.47% |\n| BERT-base    | Google | C5 (1 GB)   | 125K  | 69.63% | 55.72%  | 58.87%    | 75.75% | 64.99% |\n| BERT-base    | CLUE   | C5 (1 GB)   | 125K  | 69.00% | 55.04%  | 59.07%    | 75.84% | 64.74% |\n| BERT-base mm | Google | C5 (1 GB)   | 125K  | 69.57% | 55.17%  | 59.69%    | 75.86% | 65.07% |\n| BERT-base    | Google | C5 (1 GB)   | 375K  | 69.85% | 55.97%  | 59.62%    | 76.41% | 65.46% |\n| BERT-base    | CLUE   | C5 (1 GB)   | 375K  | 69.93% | 56.38%  | 59.35%    | 76.58% | 65.56% |\n| BERT-base    | Google | C5 (3 GB)   | 375K  | 70.22% | 56.41%  | 59.58%    | 76.70% | 65.73% |\n| BERT-base    | CLUE   | C5 (3 GB)   | 375K  | 69.49% | 55.97%  | 60.12%    | 77.66% | 65.81% |\n\n更多实验结果和分析可以参考：\u003ca href='https://github.com/CLUEbenchmark/CLUEPretrainedModels'\u003eCLUEPretrainedModels\u003c/a\u003e\n\n## 数据下载\n\n申请方式：\n将使用语料研究目的和用途，计划、研究机构和申请者介绍，发送到邮箱，并承诺不向第三方提供。\n\n邮箱: CLUEbenchmark@163.com，标题是：CLUECorpus2020 200G语料库\n\n# CLUECorpusSmall（14G）\n\n可用于语言建模、预训练或生成型任务等，数据量超过14G，近4000个定义良好的txt文件、50亿个字。主要部分来自于\u003ca href=\"https://github.com/brightmart/nlp_chinese_corpus\"\u003enlp_chinese_corpus项目\u003c/a\u003e\n\n当前语料库按照【预训练格式】处理，内含有多个文件夹；每个文件夹有许多不超过4M大小的小文件，文件格式符合预训练格式：每句话一行，文档间空行隔开。\n\n包含如下子语料库（总共14G语料）：\n\n1、\u003ca href=\"https://pan.baidu.com/s/195M7H5w3N8shYlqCjVL0_Q\"\u003e新闻语料 news2016zh_corpus\u003c/a\u003e: 8G语料，分成两个上下两部分，总共有2000个小文件。  密码:mzlk\n\n2、\u003ca href=\"https://pan.baidu.com/s/1Vk2PihMiZNmWvA2agPb1iA\"\u003e社区互动-语料 webText2019zh_corpus\u003c/a\u003e：3G语料，包含3G文本，总共有900多个小文件。 密码:qvlq\n\n3、\u003ca href=\"https://pan.baidu.com/s/122sax9QujO8SUdV3jH5mTQ\"\u003e维基百科-语料 wiki2019zh_corpus\u003c/a\u003e：1.1G左右文本，包含300左右小文件。  密码:xv7e\n\n4、\u003ca href=\"https://pan.baidu.com/s/18-ufaJJtf7ullzHMWXvhFw\"\u003e评论数据-语料 comments2019zh_corpus\u003c/a\u003e：2.3G左右文本，共784个小文件，包括点评评论547个、亚马逊评论227个，合并\u003ca href=\"https://github.com/InsaneLife/ChineseNLPCorpus\"\u003eChineseNLPCorpus\u003c/a\u003e的多个评论数据，清洗、格式转换、拆分成小文件。  密码:gc3m\n\n## 反馈和支持\n\n可以提交issue，加入讨论群(QQ:836811304)\n\n或发送邮件 CLUEbenchmark@163.com\n\nResearch supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC)\n\n\n## 引用\n\n    @article{CLUECorpus2020,\n      title={CLUECorpus2020: A Large-scale Chinese Corpus for Pre-training Language Model},\n      author={Liang Xu and Xuanwei Zhang and Qianqian Dong},\n      journal={ArXiv},\n      year={2020},\n      volume={abs/2003.01355}\n    }\n\n## 捐赠\n\nCLUE是一个致力于中文自然语言处理的开源组织，如果您觉得我们的工作对您的学习或者业务等有帮助，希望能得到您的赞助，以便我们后续为大家提供更多更有用的开源工作，让我们一起为中文自然语言处理的发展和进步，尽一份力～\n\n**请备注捐赠者机构和姓名，非常感谢！**\n\n\u003ctable\u003e\n  \u003ctr\u003e\n    \u003cth width=\"30%\"\u003e支付宝\u003c/th\u003e\n    \u003cth width=\"30%\"\u003e微信\u003c/th\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\u003c/tr\u003e\n  \u003ctr align=\"center\"\u003e\n    \u003ctd\u003e\u003cimg width=\"70%\" src=\"https://github.com/CLUEbenchmark/CLUECorpus2020/raw/master/pics/alipay.jpeg\"\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003cimg width=\"48%\" src=\"https://github.com/CLUEbenchmark/CLUECorpus2020/raw/master/pics/wechat.jpeg\"\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FCLUEbenchmark%2FCLUECorpus2020","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FCLUEbenchmark%2FCLUECorpus2020","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FCLUEbenchmark%2FCLUECorpus2020/lists"}