{"id":13709713,"url":"https://github.com/ymcui/Chinese-RC-Datasets","last_synced_at":"2025-05-06T18:32:29.670Z","repository":{"id":109115540,"uuid":"177698186","full_name":"ymcui/Chinese-RC-Datasets","owner":"ymcui","description":"Collections of Chinese reading comprehension datasets","archived":false,"fork":false,"pushed_at":"2019-12-19T03:34:21.000Z","size":18,"stargazers_count":214,"open_issues_count":0,"forks_count":27,"subscribers_count":13,"default_branch":"master","last_synced_at":"2024-10-28T19:29:44.245Z","etag":null,"topics":["question-answering","reading-comprehension"],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"cc-by-sa-4.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ymcui.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-03-26T02:13:41.000Z","updated_at":"2024-10-09T09:41:07.000Z","dependencies_parsed_at":null,"dependency_job_id":"4b06e134-62b6-4ec9-a47b-7b4d6763e412","html_url":"https://github.com/ymcui/Chinese-RC-Datasets","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ymcui%2FChinese-RC-Datasets","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ymcui%2FChinese-RC-Datasets/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ymcui%2FChinese-RC-Datasets/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ymcui%2FChinese-RC-Datasets/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ymcui","download_url":"https://codeload.github.com/ymcui/Chinese-RC-Datasets/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224521325,"owners_count":17325216,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["question-answering","reading-comprehension"],"created_at":"2024-08-02T23:00:44.380Z","updated_at":"2024-11-13T20:30:16.038Z","avatar_url":"https://github.com/ymcui.png","language":null,"funding_links":[],"categories":["Repository \u0026 Toolkit"],"sub_categories":[],"readme":"# Chinese Machine Reading Comprehension Datasets\n\n**Note that, this repository will be updated irregularly.**\n\n**If you find this repository helpful, please press the star button. Moreover, if you would like to use or repost the content in this repository, please indicate the orignal author and source link.**\n\n## Content\n\n| Section | Description |\n|-|-|\n| [Chinese Reading Comprehension Datasets](#Chinese-Reading-Comprehension-Datasets) | Describe public Chinese RC datasets |\n| [State-of-the-art Systems](#State-of-the-art-Systems) | State-of-the-art systems and results |\n| [Chinese Reading Comprehension Evaluations and Competitions](#Chinese-Reading-Comprehension-Evaluations-and-Competitions) | Introductions to Chinese RC competitions |\n\n\n## Chinese Reading Comprehension Datasets\nHere I list several Chinese reading comprehension datasets that are PUBLICLY available (with appropriate technical report or paper). If I missed something, feel free to inform me. Unless indicated, the datasets are in simplified Chinese.\n\n| Dataset  | Genre | Query Type | Answer Type |  Document # | Query # | Download |\n| :------ | :-----: | :-----: | :-----: | :-----: | :-----: | :-----: |\n| People Daily \u0026 Children's Fairy Tale [1] | news \u0026 tale | Cloze | word | 28K | 100K | [link](https://github.com/ymcui/Chinese-Cloze-RC) |\n| WebQA [2] | Web | User log | entity | - | 42K | [link](http://paddlepaddle.bj.bcebos.com/dataset/webqa/WebQA.v1.0.zip) |\n| CMRC 2017 [3] | news | Cloze \u0026 Query | word | - | 364K | [link](https://github.com/ymcui/cmrc2017) | \n| DuReader [4] | Web | User log | free form | 1M | 200K | [link](https://github.com/baidu/DuReader) |\n| CMRC 2018 [5] | Wiki | Query | Span | - | 18K | [link](https://github.com/ymcui/cmrc2018) |\n| DRCD [6]\u003csup\u003e(tranditional Chinese)\u003c/sup\u003e | Wiki | Query | Span | - | 34K | [link](https://github.com/DRCSolutionService/DRCD) |\n| C^3 [7] | mixed | Query | choice | 14K | 24K | [link](https://github.com/nlpdata/c3) |\n| CMRC 2019 [8] | Story | cloze | Sentence | 1K | 100K | [link](https://github.com/ymcui/cmrc2019) |\n| ChID [9] | varies | cloze | idiom | 580K | 729K | [link](https://github.com/zhengcj1/ChID-Dataset) | \n\n\u003e [1] (Cui et al., 2016) Consensus Attention-based Neural Networks for Chinese Reading Comprehension. In COLING 2016. https://aclanthology.info/papers/C16-1167/c16-1167\n\n\u003e [2] (Li et al., 2016) Dataset and Neural Recurrent Sequence Labeling Model for Open-Domain Factoid Question Answering. In arXiv. https://arxiv.org/abs/1607.06275\n\n\u003e [3] (Cui et al., 2018) Dataset for the First Evaluation on Chinese Machine Reading Comprehension. In LREC 2018. http://www.lrec-conf.org/proceedings/lrec2018/summaries/32.html\n\n\u003e [4] (He et al., 2018) DuReader: a Chinese Machine Reading Comprehension Dataset from Real-world Applications. In ACL 2018 MRQA Workshop. https://aclanthology.info/papers/W18-2605/w18-2605\n\n\u003e [5] (Cui et al., 2018) A Span-Extraction Dataset for Chinese Machine Reading Comprehension. In arXiv. https://arxiv.org/abs/1810.07366\n\n\u003e [6] (Shao et al., 2018) DRCD: a Chinese Machine Reading Comprehension Dataset. In arXiv. https://arxiv.org/abs/1806.00920\n\n\u003e [7] (Sun et al., 2019) Probing Prior Knowledge Needed in Challenging Chinese Machine Reading Comprehension. https://arxiv.org/abs/1904.09679\n\n\u003e [8] (Cui et al., 2019) https://github.com/ymcui/cmrc2019\n\n\u003e [9] (Zheng et al., 2019) ChID: A Large-scale Chinese IDiom Dataset for Cloze Test. https://aclweb.org/anthology/papers/P/P19/P19-1075/\n\n\n## State-of-the-art Systems\nHere I list several state-of-the-art systems (published / unpublished) for these datasets. There is a big chance that I missed something. So feel free to inform me new entries on `Issue` tab.\n\n### People Daily \u0026 Children's Fairy Tale\n| System  | PD-DEV | PD-TEST | CFT-TEST-AUTO | CFT-TEST-HUMAN | Note |\n| :------ | :-----: | :-----: | :-----: | :-----: | :-----: | \n| [SAW Reader (Zhang et al., 2018)](https://arxiv.org/pdf/1806.09103.pdf) | 72.8 | 75.1 | - | 43.8 | - |\n| [CAW Reader (Zhang et al., 2018)](https://link.springer.com/chapter/10.1007/978-3-319-99495-6_3)| 69.4 | 70.5 | - | 39.7 | - |\n| [CAS Reader (Cui et al., 2016)](https://aclanthology.info/papers/C16-1167/c16-1167) | 65.2 | 68.1 | 41.3 | 35.0 | - |\n| [AS Reader (Cui et al., 2016)](https://aclanthology.info/papers/C16-1167/c16-1167) | 64.1 | 67.2 | 40.9 | 33.1 | - | \n\n\n### CMRC 2017\nLeaderboard: https://hfl-rc.github.io/cmrc2017/leaderboard/\n\n#### Cloze Track \n| System  | DEV | TEST | Note |\n| :------ | :-----: | :-----: | :-----: |\n| 6ESTATES PTE LTD (ensemble) | 81.85 | 81.90 | - |\n| SJTU BCMI-NLP (ensemble) | 78.35 | 80.67 | - | \n| YunSiChuangZhi (ensemble) | 79.20 | 80.27 | - | \n| [SAW Reader (Zhang et al., 2018)](https://arxiv.org/pdf/1806.09103.pdf) | 78.95 | 78.80 | - |\n| [CAW Reader (Zhang et al., 2018)](https://link.springer.com/chapter/10.1007/978-3-319-99495-6_3) | 77.95 | 78.50 | - |\n| [Word + Char + BPE-FRQ (Zhang et al., 2018)](https://arxiv.org/pdf/1811.02364.pdf) | 79.05 | 78.83 | - |\n\n#### User Query Track\n| System  | DEV | TEST | Note |\n| :------ | :-----: | :-----: | :-----: |\n| ECNU (ensemble) | 90.45 | 69.53 | - |\n| SXU-3 (single model) | 47.80 | 49.07 | - | \n| ZZU (single model) | 31.10 | 32.53 | - | \n\n\n### DuReader\nLeaderboard: http://ai.baidu.com/broad/leaderboard?dataset=dureader\n\n| System  | ROUGE-L | BLEU-4 | Note |\n| :------ | :-----: | :-----: | :-----: |\n| AliReader | 63.48 | 61.54 | - |\n| NI-Reader (ensemble) | 63.38 | 59.23 | - |\n| mrc_try_mingyan (single model) | 62.20 | 59.72 | - |\n| [(Yan et al., 2018)](https://arxiv.org/pdf/1811.11374.pdf) | 50.71 | 49.39 | - |\n| [(Li et al., 2018)](http://zhaohuilee.com/files/82.pdf) | 44.95 | 42.68 | - |\n| [(Wang et al., 2018)](https://arxiv.org/pdf/1805.02220.pdf) | 44.18 | 40.97 | - |\n| [(Xu et al., 2018)](https://www.matec-conferences.org/articles/matecconf/pdf/2018/91/matecconf_eitce2018_02047.pdf) | 39.60 | 34.76 | - |  \n| [Match-LSTM (He et al., 2018)](https://aclanthology.info/papers/W18-2605/w18-2605) | 39.2 | 31.9 | - |\n| [BiDAF (He et al., 2018)](https://aclanthology.info/papers/W18-2605/w18-2605) | 39.0 | 31.8 | - |\n\n\n### CMRC 2018\nLeaderboard: https://hfl-rc.github.io/cmrc2018/open_challenge/\n\n| System  | DEV-EM | DEV-F1 | TEST-EM | TEST-F1 | CHALLENGE-EM | CHALLENGE-F1 | Note |\n| :------ | :-----: | :-----: | :-----: | :-----: | :-----: | :-----: | :-----: |\n| P-Reader (single model) | 59.894 | 81.499 | 65.189 | 84.386 | 15.079 | 39.583 | - |\n| GM-Reader (ensemble) | 58.931 | 80.069 | 64.045 | 83.046 | 15.675 | 37.315 | - |\n| MCA-Reader (ensemble) | 66.698 | 85.538 | 71.175 | 88.090 | 15.476 | 37.104 | - | \n| Z-Reader (single model) | 79.776 | 92.696 | 74.178 | 88.145 | 13.889 | 37.422 | - |\n| [SRC-\u003eDS(±) (Yang et al., 2019)](https://arxiv.org/abs/1904.06652) | 49.2 | 65.4 | - | - | - | - | - |\n\n\u003e More detailed results can be obtained in [CMRC 2018 Overview](https://arxiv.org/abs/1810.07366).\n\u003e Note that, some of the submission are using development set for training as well.\n\n### DRCD\n| System  | DEV-EM | DEV-F1 | TEST-EM | TEST-EM | Note |\n| :------ | :-----: | :-----: | :-----: | :-----: | :-----: |\n| [SRC + DS(±) (Yang et al., 2019)](https://arxiv.org/abs/1904.06652) | 55.4 | 67.7 | - | - | - |\n| r-net (single model) | - | - | 29.1 | 44.4 | - |\n\n\n### C^3\n| System  | DEV-1A | TEST-1A | DEV-1B | TEST-1B | DEV-2A | TEST-2A | DEV-2B | TEST-2B | Note |\n| :------ | :-----: | :-----: | :-----: | :-----: | :-----: | :-----: | :-----: | :-----: | :-----: |\n| [BERT_CN (Sun et al., 2019)](https://arxiv.org/abs/1904.09679) | 63.0 | 62.6 | 62.3 | 62.1 | 36.7 | 26.2 | 34.7 | 31.3 | - |\n\n\n## Chinese Reading Comprehension Evaluations and Competitions\nAlong with the release of these datasets, there are also several Chinese Reading Comprehension evaluation workshops or competitions which further accelerate the research on this topic.\n\n\u003e 1. [The First Evaluation Workshop on Chinese Machine Reading Comprehension (CMRC 2017)](https://hfl-rc.github.io/cmrc2017/)  \nHost: [CIPS-CL](http://www.cips-cl.org), [Joint Laboratory of HIT and iFLYTEK Research (HFL)](https://hfl-rc.github.io), [iFLYTEK Co. Ltd](http://www.iflytek.com)  \nCompetition Type: Cloze-style RC, User Query RC\n\n\u003e 2. [The Second Evaluation Workshop on Chinese Machine Reading Comprehension (CMRC 2018)](https://hfl-rc.github.io/cmrc2018/)  \nHost: [CIPS-CL](http://www.cips-cl.org), [Joint Laboratory of HIT and iFLYTEK Research (HFL)](https://hfl-rc.github.io), [iFLYTEK Co. Ltd](http://www.iflytek.com)  \nCompetition Type: Span-Extraction RC  \n\n\u003e 3. [2018 NLP Challenge on Machine Reading Comprehension](http://mrc2018.cipsc.org.cn/)  \nHost: [CCF](https://www.ccf.org.cn), [CIPSC](http://www.cipsc.org.cn), [Baidu Inc.](http://home.baidu.com)  \nCompetition Type: Open-Domain RC  \n\n\u003e 4. [CIPS-SOGOU QA Competition](http://task.www.sogou.com/cips-sogou_qa/)  \nHost: [CIPSC](http://www.cipsc.org.cn), [SOGOU](http://www.sogou.com)  \nCompetition Type: Factoid QA, Non-Factoid QA  \n\n\u003e 5. [The Third Evaluation Workshop on Chinese Machine Reading Comprehension (CMRC 2019)](https://hfl-rc.github.io/cmrc2019/)  \nHost: [CIPS-CL](http://www.cips-cl.org), [Joint Laboratory of HIT and iFLYTEK Research (HFL)](https://hfl-rc.github.io), [iFLYTEK Co. Ltd](http://www.iflytek.com)  \nCompetition Type: Sentence Cloze  \n\n\u003e 6. [2019 NLP Language and Intelligence Challenge](http://lic2019.ccf.org.cn)  \nHost: [CCF](https://www.ccf.org.cn), [CIPSC](http://www.cipsc.org.cn), [Baidu Inc.](http://home.baidu.com)  \nCompetition Type: Open-Domain RC  \n\n\u003e 7. [Chinese Idiom Understanding Contest](https://biendata.com/competition/idiom/)  \nHost: [CCF](https://www.ccf.org.cn), Tsinghua University  \nCompetition Type: Cloze Test\n\n\n## Contact\nFor any problems, please leave a message in the `Github Issues`.\n\n\n## Disclaimer\nAny subjective comments in this repository only represents the idea of the owner (ymcui), and does not represent the claims of any organizations.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fymcui%2FChinese-RC-Datasets","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fymcui%2FChinese-RC-Datasets","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fymcui%2FChinese-RC-Datasets/lists"}