{"id":21392241,"url":"https://github.com/akariasai/extractive_rc_by_runtime_mt","last_synced_at":"2025-07-13T18:31:09.749Z","repository":{"id":82562163,"uuid":"146079322","full_name":"AkariAsai/extractive_rc_by_runtime_mt","owner":"AkariAsai","description":"Code and datasets of \"Multilingual Extractive Reading Comprehension by Runtime Machine Translation\"","archived":false,"fork":false,"pushed_at":"2019-01-02T03:01:20.000Z","size":980,"stargazers_count":40,"open_issues_count":2,"forks_count":5,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-06-04T09:59:50.762Z","etag":null,"topics":["multilingual","natural-language-processing","nlp","nmt","pytorch","question-answering","reading-comprehension","squad"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AkariAsai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-08-25T08:31:51.000Z","updated_at":"2024-12-28T23:41:53.000Z","dependencies_parsed_at":"2023-03-06T16:15:21.157Z","dependency_job_id":null,"html_url":"https://github.com/AkariAsai/extractive_rc_by_runtime_mt","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/AkariAsai/extractive_rc_by_runtime_mt","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AkariAsai%2Fextractive_rc_by_runtime_mt","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AkariAsai%2Fextractive_rc_by_runtime_mt/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AkariAsai%2Fextractive_rc_by_runtime_mt/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AkariAsai%2Fextractive_rc_by_runtime_mt/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AkariAsai","download_url":"https://codeload.github.com/AkariAsai/extractive_rc_by_runtime_mt/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AkariAsai%2Fextractive_rc_by_runtime_mt/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":265186504,"owners_count":23724682,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["multilingual","natural-language-processing","nlp","nmt","pytorch","question-answering","reading-comprehension","squad"],"created_at":"2024-11-22T13:39:41.959Z","updated_at":"2025-07-13T18:31:09.743Z","avatar_url":"https://github.com/AkariAsai.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Multilingual Extractive Reading Comprehension by Runtime Machine Translation\nWe introduce the first extractive RC systems for  non-English languages, without using language-specific RC training data, but instead by using an English RC model and an attention-based Neural Machine Translation (NMT) model [1].  \n\n![The Overview](https://github.com/AkariAsai/extractive_rc_by_runtime_mt/blob/master/overview.png)\n\n## Contents\n1. [Code](#code)\n2. [Datasets](#datasets)\n3. [Benchmarks](#benchmarks)\n4. [Reference](#reference)\n5. [Contact](#contact)\n\n## Code\nWe implemented our NMT and extractive RC models (BiDAF, BiDAF + Self Attention + ELMo) in [PyTorch](https://pytorch.org/).  \n\n### Installation\nThe installation steps are as follows:\n\n```bash\ngit clone https://github.com/AkariAsai/extractive_rc_by_runtime_mt.git\ncd extractive_rc_by_runtime_mt\npip install -r requirements.txt\n```\n\nFor preprocessing steps, we used [StanfordCoreNLP](https://stanfordnlp.github.io/CoreNLP/) for English, [mosesdecoder's scripts](https://github.com/moses-smt/mosesdecoder) for French and [MeCab](http://taku910.github.io/mecab/) for Japanese.   \nIn our implementation, we call StanfordCoreNLP from [py-corenlp](https://github.com/smilli/py-corenlp), moses' tokenizer from [its original perl scrips](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl), and MeCab through [mecab-python3](https://pypi.org/project/mecab-python3/).  \n\n##### For English\nFor py-corenlp, first make sure you have the Stanford CoreNLP server running. See [this instruction](https://stanfordnlp.github.io/CoreNLP/corenlp-server.html#getting-started).\n\n##### For Japanese\nFor mecab-python3, we use [Neologism dictionary for MeCab](https://github.com/neologd/mecab-ipadic-neologd/), instead of default Ochasen.   \nPlease install the dictionary beforehand following the instruction on [official page](https://github.com/neologd/mecab-ipadic-neologd/).\n\n##### For French\nYou can install mosesdecoder's tokenizer by following the commands listed below. \n```sh\nwget https://github.com/moses-smt/mosesdecoder/archive/master.zip\nunzip master.zip # files are to be extracted under extractive_rc_by_runtime_mt/mosesdecoder-master/\nrm master.zip\n```\nFor easy set up, we will replace the codes for tokenization process with [mosestokenizer](https://pypi.org/project/mosestokenizer/), the python wrapper for mose's tokenizer, sometime soon. \n\n### Train\nThe details of training RC and NMT models, see [README.md](https://github.com/AkariAsai/extractive_rc_by_runtime_mt/blob/master/rc/README.md) under `rc` directory and [README.md](https://github.com/AkariAsai/extractive_rc_by_runtime_mt/blob/master/nmt/README.md) under `nmt` directory.\n\n\n### Evaluate\nTo run the evaluation on multilingual SQuAD dataset, you first need to train your own model for RC and NMT, or use pre-trained models.  \nYou can download pretrained models from [Google Drive](https://drive.google.com/drive/folders/1mqz_L5B4uiQ7TPu8F8ytHsIFeMGPtlW1?usp=sharing).\n\nFor example, you can evaluate on French SQuAD using the pre-trained models by following the processes instructions below.\n\n1. Create `params` directory right under the home directory. \n2. Download the zipped files `fren_nmt_params.tar.gz` and `rc_params.tar.gz` into `extractive_rc_by_runtime_mt/params` directory, \nand extract necessary files.\n\n```sh\ntar -xzf params/fren_nmt_params.tar.gz \u0026\u0026 rm -f params/fren_nmt_params.tar.gz\ntar -xzf params/rc_params.tar.gz \u0026\u0026 rm -f params/rc_params.tar.gz\n```\n\n3. Run the evaluation based on the command below. \n\n```sh\ntar -xzf params/fren_nmt_params.tar.gz \u0026\u0026 rm -f params/fren_nmt_params.tar.gz\ntar -xzf params/rc_params.tar.gz \u0026\u0026 rm -f params/rc_params.tar.gz\ncd rc\npython main.py evaluate_mlqa ../params/rc_params \\\n--evaluation-data-file https://s3-us-west-2.amazonaws.com/allennlp/datasets/squad/squad-dev-v1.1.json \\\n--unziped_archive_directory ../params/rc_params --elmo \\\n--trans_embedding_model ../params/fren_nmt_params/embedding.bin  \\\n--trans_encdec_model  ../params/fren_nmt_params/encdec.bin \\\n--trans_train_source ../params/fren_nmt_params/train_fr_lower_1.txt \\\n--trans_train_target ../params/fren_nmt_params/train_en_lower_1.txt \\\n-l Fr -v5 --beam\n```\nFor the details of the command line options, see [rc/evaluate_mlqa.py](https://github.com/AkariAsai/extractive_rc_by_runtime_mt/blob/master/rc/evaluate_mlqa.py)\n\n## Datasets\nWe provide the two datasets, (1) multilingual SQuAD Datasets (Japanese, French) and (2) {Japanese, French}-to-English bilingual corpora to train our NMT models for the extractive RC system.\n\n#### Multilingual SQuAD Datasets\nThe Japanese and French datasets are created by manually translating the original [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) (v1.1) development dataset into Japanese and French.  \nThese datasets contains 327 questions for each.  \nMore details can be found in Section 3.3 (SQuAD Test Data for Japanese and French) in [1].\n\n| | Multilingual SQuAD Datasets       |\n| ------------- |:-------------:|\n| Japanese    | [japanese_squad.json](data/squad_japanese_test.json) |\n| French | [french_squad.json](data/squad_french_test.json) |\n\n\n#### {Japanese, French}-to-English bilingual Corpora\n##### Wikipedia-based {Japanese, French}-to-English bilingual corpora\nTo train the NMT model for specific language directions, we take advantage of constantly growing web resources to automatically construct parallel corpora, rather than assuming the availability of high quality parallel corpora of the target domain.  \nWe constructed bilingual corpora from Wikipedia articles, using its inter-language links and [hunalign](https://github.com/danielvarga/hunalign), a sentence-level aligner.  \nMore details can be found in Supplementary Material Section A (Details of Wikipedia-based Bilingual CorpusCreation) in [1].  \n\nTo download preprocessed Wikipedia-based {Japanese, French}-to-English bilingual corpora, please run the commands below.  \n\n```sh\ncd datasets\nwget http://www.hal.t.u-tokyo.ac.jp/~asai/datasets/extractive_rc_by_runtime_mt/wiki_corpus.tar.gz\ntar -xvf wiki_corpus.tar.gz \u0026\u0026 rm -f wiki_corpus.tar.gz\ncd ..\n```\n\nThe downloaded wikipedia-based corpora will be placed under `datasets/wiki_corpus` folder.  \nThe training data includes 1,000,000 pairs, and\nthe development data includes 2,000 pairs.\n\n##### Manually translated question sentences\nIn our experiment, we also found that adding a small number of manually translated question sentences could further improve the extractive RC performance.   \nHere, we also provide the translated question sentences we actually used to train our NMT models.  \nThe details pf the creation of these small parallel questions datasets can be found in Supplementary Material Section C (Details of Manually Translated SQuAD DatasetQuestions Creation) in [1].\n\n| | question sentences        |\n| ------------- |:-------------:|\n| Japanese     | [questions.ja](data/questions_jaen.ja), [questions.en](data/questions_jaen.en) |\n| French  | [questions.fr](data/questions_fren.fr), [questions.en](data/questions_jaen.en) |\n\n\n## Benchmarks\nWe provide the results of our proposed method on multilingual SQuAD datasets.\n- Japanese\n\n| methods|F1          | EM  |\n| ------------- |:-------------:| :-----:|\n| Our Method| **52.19** | **37.00** |\n| back-translation baseline| 42.60|24.77|\n\n- French\n\n| methods |F1          | EM  |\n| ------------- |:-------------:| :-----:|\n| Our Method | **61.88** | **40.67** |\n| back-translation baseline | 44.02 | 23.54|\n\n\n\n## Reference\nPlease cite [1] if you found the resources in this repository useful.\n\n[1] Akari Asai, Akiko Eriguchi, Kazuma Hashimoto, and Yoshimasa Tsuruoka. 2018. \"[Multilingual Extractive Reading Comprehension by Runtime Machine Translation](https://arxiv.org/abs/1809.03275)\".\n\n## Contact\nPlease direct any questions to [Akari Asai](https://akariasai.github.io/) at akari-asai@g.ecc.u-tokyo.ac.jp.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fakariasai%2Fextractive_rc_by_runtime_mt","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fakariasai%2Fextractive_rc_by_runtime_mt","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fakariasai%2Fextractive_rc_by_runtime_mt/lists"}