{"id":13754265,"url":"https://github.com/nlpdata/c3","last_synced_at":"2025-05-09T22:31:39.439Z","repository":{"id":175775251,"uuid":"228724677","full_name":"nlpdata/c3","owner":"nlpdata","description":"Investigating Prior Knowledge for Challenging Chinese Machine Reading Comprehension","archived":false,"fork":false,"pushed_at":"2022-04-20T21:58:39.000Z","size":3172,"stargazers_count":164,"open_issues_count":0,"forks_count":23,"subscribers_count":9,"default_branch":"master","last_synced_at":"2024-11-16T07:33:16.400Z","etag":null,"topics":["dataset","dialogue","machine-reading-comprehension"],"latest_commit_sha":null,"homepage":"https://dataset.org/c3/","language":"Python","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nlpdata.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"license.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2019-12-18T00:22:33.000Z","updated_at":"2024-11-12T11:38:23.000Z","dependencies_parsed_at":"2024-01-18T05:14:53.077Z","dependency_job_id":null,"html_url":"https://github.com/nlpdata/c3","commit_stats":null,"previous_names":["nlpdata/c3"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nlpdata%2Fc3","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nlpdata%2Fc3/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nlpdata%2Fc3/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nlpdata%2Fc3/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nlpdata","download_url":"https://codeload.github.com/nlpdata/c3/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253335719,"owners_count":21892718,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dataset","dialogue","machine-reading-comprehension"],"created_at":"2024-08-03T09:01:52.557Z","updated_at":"2025-05-09T22:31:36.709Z","avatar_url":"https://github.com/nlpdata.png","language":"Python","funding_links":[],"categories":["机器阅读理解","Python"],"sub_categories":["其他_文本生成、文本对话"],"readme":"C\u003csup\u003e3\u003c/sup\u003e\r\n=====\r\nOverview\r\n--------\r\nThis repository maintains **C\u003csup\u003e3\u003c/sup\u003e**, the first free-form multiple-**C**hoice **C**hinese machine reading **C**omprehension dataset.\r\n\r\n* Paper: https://arxiv.org/abs/1904.09679\r\n```\r\n@article{sun2019investigating,\r\n  title={Investigating Prior Knowledge for Challenging Chinese Machine Reading Comprehension},\r\n  author={Sun, Kai and Yu, Dian and Yu, Dong and Cardie, Claire},\r\n  journal={Transactions of the Association for Computational Linguistics},\r\n  year={2020},\r\n  url={https://arxiv.org/abs/1904.09679v3}\r\n}\r\n```\r\n\r\nFiles in this repository:\r\n\r\n* ```license.txt```: the license of C\u003csup\u003e3\u003c/sup\u003e.\r\n* ```data/c3-{m,d}-{train,dev,test}.json```: the dataset files, where m and d represent \"**m**ixed-genre\" and \"**d**ialogue\", respectively. The data format is as follows.\r\n```\r\n[\r\n  [\r\n    [\r\n      document 1\r\n    ],\r\n    [\r\n      {\r\n        \"question\": document 1 / question 1,\r\n        \"choice\": [\r\n          document 1 / question 1 / answer option 1,\r\n          document 1 / question 1 / answer option 2,\r\n          ...\r\n        ],\r\n        \"answer\": document 1 / question 1 / correct answer option\r\n      },\r\n      {\r\n        \"question\": document 1 / question 2,\r\n        \"choice\": [\r\n          document 1 / question 2 / answer option 1,\r\n          document 1 / question 2 / answer option 2,\r\n          ...\r\n        ],\r\n        \"answer\": document 1 / question 2 / correct answer option\r\n      },\r\n      ...\r\n    ],\r\n    document 1 / id\r\n  ],\r\n  [\r\n    [\r\n      document 2\r\n    ],\r\n    [\r\n      {\r\n        \"question\": document 2 / question 1,\r\n        \"choice\": [\r\n          document 2 / question 1 / answer option 1,\r\n          document 2 / question 1 / answer option 2,\r\n          ...\r\n        ],\r\n        \"answer\": document 2 / question 1 / correct answer option\r\n      },\r\n      {\r\n        \"question\": document 2 / question 2,\r\n        \"choice\": [\r\n          document 2 / question 2 / answer option 1,\r\n          document 2 / question 2 / answer option 2,\r\n          ...\r\n        ],\r\n        \"answer\": document 2 / question 2 / correct answer option\r\n      },\r\n      ...\r\n    ],\r\n    document 2 / id\r\n  ],\r\n  ...\r\n]\r\n```\r\n* ```annotation/c3-{m,d}-{dev,test}.txt```: question type annotations. Each file contains 150 annotated instances. We adopt the following abbreviations:\r\n\r\n\r\n\u003ctable\u003e\r\n  \u003ctr\u003e\r\n    \u003cth\u003e\u003c/th\u003e\r\n    \u003cth\u003eAbbreviation\u003c/th\u003e\r\n    \u003cth\u003eQuestion Type\u003c/th\u003e\r\n  \u003c/tr\u003e\r\n  \u003ctr\u003e\r\n    \u003ctd rowspan=\"1\"\u003eMatching\u003c/td\u003e\r\n    \u003ctd\u003em\u003c/td\u003e\r\n    \u003ctd\u003eMatching\u003c/td\u003e\r\n  \u003c/tr\u003e\r\n  \u003ctr\u003e\r\n    \u003ctd rowspan=\"10\"\u003ePrior knowledge\u003c/td\u003e\r\n    \u003ctd\u003el\u003c/td\u003e\r\n    \u003ctd\u003eLinguistic\u003c/td\u003e\r\n  \u003c/tr\u003e\r\n  \u003ctr\u003e\r\n    \u003ctd\u003es\u003c/td\u003e\r\n    \u003ctd\u003eDomain-specific\u003c/td\u003e\r\n  \u003c/tr\u003e\r\n  \u003ctr\u003e\r\n    \u003ctd\u003ec-a\u003c/td\u003e\r\n    \u003ctd\u003eArithmetic\u003c/td\u003e\r\n  \u003c/tr\u003e\r\n  \u003ctr\u003e\r\n    \u003ctd\u003ec-o\u003c/td\u003e\r\n    \u003ctd\u003eConnotation\u003c/td\u003e\r\n  \u003c/tr\u003e\r\n  \u003ctr\u003e\r\n    \u003ctd\u003ec-e\u003c/td\u003e\r\n    \u003ctd\u003eCause-effect\u003c/td\u003e\r\n  \u003c/tr\u003e\r\n  \u003ctr\u003e\r\n    \u003ctd\u003ec-i\u003c/td\u003e\r\n    \u003ctd\u003eImplication\u003c/td\u003e\r\n  \u003c/tr\u003e\r\n  \u003ctr\u003e\r\n    \u003ctd\u003ec-p\u003c/td\u003e\r\n    \u003ctd\u003ePart-whole\u003c/td\u003e\r\n  \u003c/tr\u003e\r\n  \u003ctr\u003e\r\n    \u003ctd\u003ec-d\u003c/td\u003e\r\n    \u003ctd\u003ePrecondition\u003c/td\u003e\r\n  \u003c/tr\u003e\r\n  \u003ctr\u003e\r\n    \u003ctd\u003ec-h\u003c/td\u003e\r\n    \u003ctd\u003eScenario\u003c/td\u003e\r\n  \u003c/tr\u003e\r\n  \u003ctr\u003e\r\n    \u003ctd\u003ec-n\u003c/td\u003e\r\n    \u003ctd\u003eOther\u003c/td\u003e\r\n  \u003c/tr\u003e\r\n  \u003ctr\u003e\r\n    \u003ctd rowspan=\"3\"\u003eSupporting Sentences\u003c/td\u003e\r\n    \u003ctd\u003e0\u003c/td\u003e\r\n    \u003ctd\u003eSingle Sentence\u003c/td\u003e\r\n  \u003c/tr\u003e\r\n  \u003ctr\u003e\r\n    \u003ctd\u003e1\u003c/td\u003e\r\n    \u003ctd\u003eMultiple sentences\u003c/td\u003e\r\n  \u003c/tr\u003e\r\n  \u003ctr\u003e\r\n    \u003ctd\u003e2\u003c/td\u003e\r\n    \u003ctd\u003eIndependent\u003c/td\u003e\r\n  \u003c/tr\u003e\r\n\u003c/table\u003e\r\n\r\n\r\n* ```bert``` folder: code of Chinese BERT, BERT-wwm, and BERT-wwm-ext baselines. The code is derived from [this repository](https://github.com/nlpdata/mrc_bert_baseline). Below are detailed instructions on fine-tuning Chinese BERT on C\u003csup\u003e3\u003c/sup\u003e. \r\n  1. Download and unzip the pre-trained Chinese BERT from [here](https://github.com/google-research/bert), and set up the environment variable for BERT by ```export BERT_BASE_DIR=/PATH/TO/BERT/DIR```. \r\n  2. Copy the dataset folder ```data``` to ```bert/```.\r\n  3. In ```bert```, execute ```python convert_tf_checkpoint_to_pytorch.py --tf_checkpoint_path=$BERT_BASE_DIR/bert_model.ckpt --bert_config_file=$BERT_BASE_DIR/bert_config.json --pytorch_dump_path=$BERT_BASE_DIR/pytorch_model.bin```.\r\n  4. Execute ```python run_classifier.py --task_name c3 --do_train --do_eval --data_dir . --vocab_file $BERT_BASE_DIR/vocab.txt --bert_config_file $BERT_BASE_DIR/bert_config.json --init_checkpoint $BERT_BASE_DIR/pytorch_model.bin --max_seq_length 512 --train_batch_size 24 --learning_rate 2e-5 --num_train_epochs 8.0 --output_dir c3_finetuned --gradient_accumulation_steps 3```.\r\n  5. The resulting fine-tuned model, predictions, and evaluation results are stored in ```bert/c3_finetuned```.\r\n\r\n**Note**:\r\n  1. Fine-tuning Chinese BERT-wwm or BERT-wwm-ext follows the same steps except for downloading their pre-trained language models.\r\n  2. There is randomness in model training, so you may want to run multiple times to choose the best model based on development set performance. You may also want to set different seeds (specify ```--seed``` when executing ```run_classifier.py```).\r\n  3. Depending on your hardware, you may need to change ```gradient_accumulation_steps```.\r\n  4. The code has been tested with Python 3.6 and PyTorch 1.0.\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnlpdata%2Fc3","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnlpdata%2Fc3","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnlpdata%2Fc3/lists"}