{"id":13535063,"url":"https://github.com/sogou/SogouMRCToolkit","last_synced_at":"2025-04-02T00:32:24.204Z","repository":{"id":37676743,"uuid":"178673705","full_name":"sogou/SogouMRCToolkit","owner":"sogou","description":"This toolkit was designed for the fast and efficient development of modern machine comprehension models, including both published models and original prototypes.","archived":true,"fork":false,"pushed_at":"2020-12-17T02:45:58.000Z","size":242,"stargazers_count":745,"open_issues_count":15,"forks_count":165,"subscribers_count":38,"default_branch":"master","last_synced_at":"2024-08-11T16:09:18.428Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sogou.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-03-31T10:35:42.000Z","updated_at":"2024-08-11T16:09:18.429Z","dependencies_parsed_at":"2022-09-13T02:52:16.808Z","dependency_job_id":null,"html_url":"https://github.com/sogou/SogouMRCToolkit","commit_stats":null,"previous_names":["sogou/smrctoolkit"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sogou%2FSogouMRCToolkit","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sogou%2FSogouMRCToolkit/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sogou%2FSogouMRCToolkit/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sogou%2FSogouMRCToolkit/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sogou","download_url":"https://codeload.github.com/sogou/SogouMRCToolkit/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":222788514,"owners_count":17037777,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T08:00:49.244Z","updated_at":"2024-11-02T23:30:28.773Z","avatar_url":"https://github.com/sogou.png","language":"Python","readme":"# Sogou Machine Reading Comprehension Toolkit\n## Introduction\n**The Sogou Machine Reading Comprehension (SMRC)** toolkit was designed for the fast and efficient development of modern machine comprehension models, including both published models and original prototypes.\n\n## Toolkit Architecture\n![avatar](./doc/architecture.png)\n\n## Installation\n```sh\n$ git clone https://github.com/sogou/SMRCToolkit.git\n$ cd SMRCToolkit\n$ pip install [-e] .\n```\nOption *-e* makes your installation **editable**, i.e., it links it to your source directory\n\nThis repo was tested on Python 3 and Tensorflow 1.12\n\n## Quick Start\nTo train a Machine Reading Comprehension model, please follow the steps below.\n\nFor SQuAD1.0, you can download a dataset with the following commands.\n```sh\n$ wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json\n$ wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json\n$ wget https://nlp.stanford.edu/data/glove.840B.300d.zip #used in DrQA\n$ unzip glove.840B.300d.zip\n```\nPrepare the dataset reader and evaluator.\n```python\ntrain_file = data_folder + \"train-v1.1.json\"\ndev_file = data_folder + \"dev-v1.1.json\"\nreader = SquadReader()\ntrain_data = reader.read(train_file)\neval_data = reader.read(dev_file)\nevaluator = SquadEvaluator(dev_file)\n```\nBuild a vocabulary and load the pretrained embedding.\n```python\nvocab = Vocabulary(do_lowercase=False)\nvocab.build_vocab(train_data + eval_data, min_word_count=3, min_char_count=10)\nword_embedding = vocab.make_word_embedding(embedding_folder+\"glove.840B.300d.txt\")\n```\nUse the feature extractor,which is only necessary when using linguistic features.\n```python\nfeature_transformer = FeatureExtractor(features=['match_lemma','match_lower','pos','ner','context_tf'],\nbuild_vocab_feature_names=set(['pos','ner']),word_counter=vocab.get_word_counter())\ntrain_data = feature_transformer.fit_transform(dataset=train_data)\neval_data = feature_transformer.transform(dataset=eval_data)\n```\nBuild a batch generator for training and evaluation,where additional features and a feature vocabulary are necessary when a linguistic feature\nis used.\n```python\ntrain_batch_generator = BatchGenerator(vocab,train_data, training=True, batch_size=32, \\\n    additional_fields = feature_transformer.features,feature_vocab=feature_transformer.vocab)\neval_batch_generator = BatchGenerator(vocab,eval_data, batch_size=32, \\\n    additional_fields = feature_transformer.features, feature_vocab=feature_transformer.vocab)\n```\nImport the built-in model and compile the training operation, call functions such as `train_and_evaluate` for training and evaluation.\n```python\nmodel = DrQA(vocab, word_embedding, features=feature_transformer.features,\\\n feature_vocab=feature_transformer.vocab)\nmodel.compile()\nmodel.train_and_evaluate(train_batch_generator, eval_batch_generator, evaluator, epochs=40, eposides=2)\n```\nAll of the codes are provided using built-in models running on different datasets in the [examples](./examples/). You can check these for details. [Example of model saving and loading](./doc/model_save_load.md).\n\n## Modules\n1. `data`\n    - vocabulary.py: Vocabulary building and word/char index mapping\n    - batch_generator.py: Mapping words and tags to indices, padding length-variable features, transforming all of the features into tensors, and then batching them\n2. `dataset_reader`\n    - squad.py: Dataset reader and evaluator (from official code) for SQuAD 1.0\n    - squadv2.py : Dataset reader and evaluator (from official code) for SQuAD 2.0\n    - coqa.py : Dataset reader and evaluator (from official code) for CoQA\n    - cmrc.py :Dataset reader and evaluator (from official code) for CMRC\n3. `examples`\n    - Examples for running different models, where the specified data path should provided to run the examples\n4. `model`\n    - Base class and subclasses of models, where any model should inherit the base class\n    - Built-in models such as BiDAF, DrQA, and FusionNet\n5. `nn`\n    - similarity\\_function.py: Similarity functions for attention, e.g., dot_product, trilinear, and symmetric_nolinear\n    - attention.py: Attention functions such as BiAttention, Trilinear and Uni-attention\n    - ops: Common ops\n    - recurrent: Wrappers for LSTM and GRU\n    - layers: Layer base class and commonly used layers\n6. `utils`\n    - tokenizer.py: Tokenizers that can be used for both English and Chinese\n    - feature_extractor: Extracting linguistic features used in some papers, e.g., POS, NER, and Lemma\n7. `libraries`\n    - Bert is included in this toolkit with the code from the [official source code](https://github.com/google-research/bert).\n\n## Custom Model and Dataset\n- Custom models can easily be added with the description in the [tutorial](./doc/build_custom_model.md).\n- A new dataset can easily be supported by implementing a Custom Dataset Reader and Evaluator.\n\n## Performance\n\n### F1/EM score on SQuAD 1.0 dev set\n| Model | toolkit implementation | original paper|\n| --- | --- | ---|\n|BiDAF | 77.3/67.7  | 77.3/67.7 |\n|BiDAF+ELMo | 81.0/72.1 | - |\n|IARNN-Word | 73.9/65.2 | - |\n|IARNN-hidden |  72.2/64.3| - |\n|DrQA | 78.9/69.4 | 78.8/69.5  |\n|DrQA+ELMO|83.1/74.4 | - |\n|R-Net | 79.3/70.8 | 79.5/71.1  |\n|BiDAF++ | 78.6/69.2 | -/-  |\n|FusionNet | 81.0/72.0 | 82.5/74.1  |\n|QANet | 80.8/71.8 | 82.7/73.6  |\n|BERT-Base | 88.3/80.6 | 88.5/80.8 |\n\n### F1/EM score on SQuAD 2.0 dev set\n| Model | toolkit implementation | original paper|\n| --- | --- | ---|\n|BiDAF | 62.7/59.7 | 62.6/59.8 |\n|BiDAF++ | 64.3/61.8 | 64.8/61.9  |\n|BiDAF++ + ELMo  | 67.6/64.8| 67.6/65.1 |\n|BERT-Base | 75.9/73.0 | 75.1/72.0 |\n\n### F1 score on CoQA dev set\n| Model | toolkit implementation | original paper|\n| --- | --- | ---|\n|BiDAF++ | 71.7 | 69.2 |\n|BiDAF++ + ELMo | 74.5 | 69.2|\n|BERT-Base | 78.6 | - |\n|BERT-Base+Answer Verification| 79.5 | - |\n\n## Contact information\nFor help or issues using this toolkit, please submit a GitHub issue.\n\n## Citation\nIf you use this toolkit in your research, please use the following BibTex Entry\n```\n@ARTICLE{2019arXiv190311848W,\n       author = {{Wu}, Jindou and {Yang}, Yunlun and {Deng}, Chao and {Tang}, Hongyi and\n         {Wang}, Bingning and {Sun}, Haoze and {Yao}, Ting and {Zhang}, Qi},\n        title = \"{Sogou Machine Reading Comprehension Toolkit}\",\n      journal = {arXiv e-prints},\n     keywords = {Computer Science - Computation and Language},\n         year = \"2019\",\n        month = \"Mar\",\n          eid = {arXiv:1903.11848},\n        pages = {arXiv:1903.11848},\narchivePrefix = {arXiv},\n       eprint = {1903.11848},\n primaryClass = {cs.CL},\n       adsurl = {https://ui.adsabs.harvard.edu/\\#abs/2019arXiv190311848W},\n      adsnote = {Provided by the SAO/NASA Astrophysics Data System}\n}\n```\n## License\n[Apache-2.0](https://opensource.org/licenses/Apache-2.0)\n\n","funding_links":[],"categories":["BERT QA \u0026 RC task:","Python"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsogou%2FSogouMRCToolkit","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsogou%2FSogouMRCToolkit","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsogou%2FSogouMRCToolkit/lists"}