{"id":24364555,"url":"https://github.com/ryi06/multiqa","last_synced_at":"2026-04-18T05:03:05.698Z","repository":{"id":90336334,"uuid":"127334375","full_name":"ryi06/MultiQA","owner":"ryi06","description":"DS-GA 1012 Course Project by Ren Yi and Dima Taji","archived":false,"fork":false,"pushed_at":"2021-10-08T15:36:10.000Z","size":7374,"stargazers_count":3,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-12-27T14:11:02.791Z","etag":null,"topics":["dependency-parsing","document-reader","drqa","quasar","question-answering","squad"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ryi06.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-03-29T18:53:20.000Z","updated_at":"2021-10-08T15:36:14.000Z","dependencies_parsed_at":null,"dependency_job_id":"4bcc5399-5964-4ec9-8057-6b90e98b9028","html_url":"https://github.com/ryi06/MultiQA","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ryi06/MultiQA","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ryi06%2FMultiQA","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ryi06%2FMultiQA/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ryi06%2FMultiQA/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ryi06%2FMultiQA/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ryi06","download_url":"https://codeload.github.com/ryi06/MultiQA/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ryi06%2FMultiQA/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31957158,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-18T00:39:45.007Z","status":"online","status_checked_at":"2026-04-18T02:00:07.018Z","response_time":103,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dependency-parsing","document-reader","drqa","quasar","question-answering","squad"],"created_at":"2025-01-18T23:54:15.501Z","updated_at":"2026-04-18T05:03:05.692Z","avatar_url":"https://github.com/ryi06.png","language":"Python","readme":"# DrQA\nThis is a PyTorch implementation of the DrQA system described in the ACL 2017 paper [Reading Wikipedia to Answer Open-Domain Questions](https://arxiv.org/abs/1704.00051).\n\n## Quick Links\n\n- [About](#machine-reading-at-scale)\n- [Demo](#quick-start-demo)\n- [Installation](#installing-drqa)\n- [Components](#drqa-components)\n\n## Machine Reading at Scale\n\n\u003cp align=\"center\"\u003e\u003cimg width=\"70%\" src=\"img/drqa.png\" /\u003e\u003c/p\u003e\n\nDrQA is a system for reading comprehension applied to open-domain question answering. In particular, DrQA is targeted at the task of \"machine reading at scale\" (MRS). In this setting, we are searching for an answer to a question in a potentially very large corpus of unstructured documents (that may not be redundant). Thus the system has to combine the challenges of document retrieval (finding the relevant documents) with that of machine comprehension of text (identifying the answers from those documents).\n\nOur experiments with DrQA focus on answering factoid questions while using Wikipedia as the unique knowledge source for documents. Wikipedia is a well-suited source of large-scale, rich, detailed information. In order to answer any question, one must first retrieve the few potentially relevant articles among more than 5 million, and then scan them carefully to identify the answer.\n\nNote that DrQA treats Wikipedia as a generic collection of articles and does not rely on its internal graph structure. As a result, **_DrQA can be straightforwardly applied to any collection of documents_**, as described in the retriever [README](scripts/retriever/README.md).\n\nThis repository includes code, data, and pre-trained models for processing and querying Wikipedia as described in the paper -- see [Trained Models and Data](#trained-models-and-data). We also list several different datasets for evaluation, see [QA Datasets](#qa-datasets). Note that this work is a refactored and more efficient version of the original code. Reproduction numbers are very similar but not exact.\n\n## Quick Start: Demo\n\n[Install](#installing-drqa) DrQA and [download](#trained-models-and-data) our models to start asking open-domain questions!\n\nRun `python scripts/pipeline/interactive.py` to drop into an interactive session. For each question, the top span and the Wikipedia paragraph it came from are returned.\n\n```\n\u003e\u003e\u003e process('What is question answering?')\n\nTop Predictions:\n+------+----------------------------------------------------------------------------------------------------------+--------------------+--------------+-----------+\n| Rank |                                                  Answer                                                  |        Doc         | Answer Score | Doc Score |\n+------+----------------------------------------------------------------------------------------------------------+--------------------+--------------+-----------+\n|  1   | a computer science discipline within the fields of information retrieval and natural language processing | Question answering |    1917.8    |   327.89  |\n+------+----------------------------------------------------------------------------------------------------------+--------------------+--------------+-----------+\n\nContexts:\n[ Doc = Question answering ]\nQuestion Answering (QA) is a computer science discipline within the fields of\ninformation retrieval and natural language processing (NLP), which is\nconcerned with building systems that automatically answer questions posed by\nhumans in a natural language.\n```\n\n```\n\u003e\u003e\u003e process('What is the answer to life, the universe, and everything?')\n\nTop Predictions:\n+------+--------+---------------------------------------------------+--------------+-----------+\n| Rank | Answer |                        Doc                        | Answer Score | Doc Score |\n+------+--------+---------------------------------------------------+--------------+-----------+\n|  1   |   42   | Phrases from The Hitchhiker's Guide to the Galaxy |    47242     |   141.26  |\n+------+--------+---------------------------------------------------+--------------+-----------+\n\nContexts:\n[ Doc = Phrases from The Hitchhiker's Guide to the Galaxy ]\nThe number 42 and the phrase, \"Life, the universe, and everything\" have\nattained cult status on the Internet. \"Life, the universe, and everything\" is\na common name for the off-topic section of an Internet forum and the phrase is\ninvoked in similar ways to mean \"anything at all\". Many chatbots, when asked\nabout the meaning of life, will answer \"42\". Several online calculators are\nalso programmed with the Question. Google Calculator will give the result to\n\"the answer to life the universe and everything\" as 42, as will Wolfram's\nComputational Knowledge Engine. Similarly, DuckDuckGo also gives the result of\n\"the answer to the ultimate question of life, the universe and everything\" as\n42. In the online community Second Life, there is a section on a sim called\n43. \"42nd Life.\" It is devoted to this concept in the book series, and several\nattempts at recreating Milliways, the Restaurant at the End of the Universe, were made.\n```\n\n```\n\u003e\u003e\u003e process('Who was the winning pitcher in the 1956 World Series?')\n\nTop Predictions:\n+------+------------+------------------+--------------+-----------+\n| Rank |   Answer   |       Doc        | Answer Score | Doc Score |\n+------+------------+------------------+--------------+-----------+\n|  1   | Don Larsen | New York Yankees |  4.5059e+06  |   278.06  |\n+------+------------+------------------+--------------+-----------+\n\nContexts:\n[ Doc = New York Yankees ]\nIn 1954, the Yankees won over 100 games, but the Indians took the pennant with\nan AL record 111 wins; 1954 was famously referred to as \"The Year the Yankees\nLost the Pennant\". In , the Dodgers finally beat the Yankees in the World\nSeries, after five previous Series losses to them, but the Yankees came back\nstrong the next year. On October 8, 1956, in Game Five of the 1956 World\nSeries against the Dodgers, pitcher Don Larsen threw the only perfect game in\nWorld Series history, which remains the only perfect game in postseason play\nand was the only no-hitter of any kind to be pitched in postseason play until\nRoy Halladay pitched a no-hitter on October 6, 2010.\n```\n\nTry some of your own! Of course, DrQA might provide alternative facts, so enjoy the ride.\n\n## Installing DrQA\n\n_Setting up DrQA is easy!_\n\nDrQA requires Linux/OSX and Python 3.5 or higher. It also requires installing [PyTorch](http://pytorch.org/). Its other dependencies are listed in requirements.txt. CUDA is strongly recommended for speed, but not necessary.\n\nRun the following commands to clone the repository and install DrQA:\n\n```bash\ngit clone https://github.com/facebookresearch/DrQA.git\ncd DrQA; pip install -r requirements.txt; python setup.py develop\n```\n\nNote: requirements.txt includes a subset of all the possible required packages. Depending on what you want to run, you might need to install an extra package (e.g. spacy).\n\nIf you use the CoreNLPTokenizer or SpacyTokenizer you also need to download the Stanford CoreNLP jars and spaCy `en` model, respectively. If you use Stanford CoreNLP, have the jars in your java `CLASSPATH` environment variable, or set the path programmatically with:\n\n```python\nimport drqa.tokenizers\ndrqa.tokenizers.set_default('corenlp_classpath', '/your/corenlp/classpath/*')\n```\n\n**IMPORTANT: The default [tokenizer](#tokenizers) is CoreNLP so you will need that in your `CLASSPATH` to run the README examples.**\n\nEx: `export CLASSPATH=$CLASSPATH:/path/to/corenlp/download/*`.\n\nIf you do not already have a CoreNLP [download](https://stanfordnlp.github.io/CoreNLP/index.html#download) you can run:\n\n```bash\n./install_corenlp.sh\n```\n\nVerify that it runs:\n```python\nfrom drqa.tokenizers import CoreNLPTokenizer\ntok = CoreNLPTokenizer()\ntok.tokenize('hello world').words()  # Should complete immediately\n```\n\nFor convenience, the Document Reader, Retriever, and Pipeline modules will try to load default models if no model argument is given. See below for downloading these models.\n\n### Trained Models and Data\n\nTo download all provided trained models and data for Wikipedia question answering, run:\n\n```bash\n./download.sh\n```\n\n_Warning: this downloads a 7.5GB tarball (25GB untarred) and will take some time._\n\nThis stores the data in `data/` at the file paths specified in the various modules' defaults. This top-level directory can be modified by setting a `DRQA_DATA` environment variable to point to somewhere else.\n\nDefault directory structure (see [embeddings](scripts/reader/README.md#note-on-word-embeddings) for more info on additional downloads for training):\n```\nDrQA\n├── data (or $DRQA_DATA)\n    ├── datasets\n    │   ├── SQuAD-v1.1-\u003ctrain/dev\u003e.\u003ctxt/json\u003e\n    │   ├── WebQuestions-\u003ctrain/test\u003e.txt\n    │   ├── freebase-entities.txt\n    │   ├── CuratedTrec-\u003ctrain/test\u003e.txt\n    │   └── WikiMovies-\u003ctrain/test/entities\u003e.txt\n    ├── reader\n    │   ├── multitask.mdl\n    │   └── single.mdl\n    └── wikipedia\n        ├── docs.db\n        └── docs-tfidf-ngram=2-hash=16777216-tokenizer=simple.npz\n```\n\nDefault model paths for the different modules can also be modified programmatically in the code, e.g.:\n\n```python\nimport drqa.reader\ndrqa.reader.set_default('model', '/path/to/model')\nreader = drqa.reader.Predictor()  # Default model loaded for prediction\n```\n\n#### Document Retriever\n\nTF-IDF model using Wikipedia (unigrams and bigrams, 2^24 bins, simple tokenization), evaluated on multiple datasets (test sets, dev set for SQuAD):\n\n| Model | SQuAD P@5 | CuratedTREC P@5 | WebQuestions P@5 | WikiMovies P@5 | Size |\n| :---: | :-------: | :-------------: | :--------------: | :------------: | :---: |\n| [TF-IDF model](https://s3.amazonaws.com/fair-data/drqa/docs-tfidf-ngram%3D2-hash%3D16777216-tokenizer%3Dsimple.npz.gz) | 78.0 | 87.6 | 75.0 | 69.8 | ~13GB |\n\n_P@5 here is defined as the % of questions for which the answer segment appears in one of the top 5 documents_.\n\n#### Document Reader\n\nModel trained only on SQuAD, evaluated in the SQuAD setting:\n\n| Model | SQuAD Dev EM | SQuAD Dev F1 | Size |\n| :---: | :-----------:| :----------: | :--: |\n| [Single model](https://s3.amazonaws.com/fair-data/drqa/single.mdl) | 69.4 | 78.9 | ~130MB |\n\nModel trained with distant supervision without NER/POS/lemma features, evaluated on multiple datasets (test sets, dev set for SQuAD) in the full Wikipedia setting:\n\n| Model | SQuAD EM | CuratedTREC EM | WebQuestions EM | WikiMovies EM | Size |\n| :---: | :------: | :------------: | :-------------: | :-----------: | :--:\n| [Multitask model](https://s3.amazonaws.com/fair-data/drqa/multitask.mdl) | 29.5 | 27.2 | 18.5 | 36.9 | ~270MB |\n\n#### Wikipedia\n\nOur full-scale experiments were conducted on the 2016-12-21 dump of English Wikipedia. The dump was processed with the [WikiExtractor](https://github.com/attardi/wikiextractor) and filtered for internal disambiguation, list, index, and outline pages (pages that are typically just links). We store the documents in an sqlite database for which `drqa.retriever.DocDB` provides an interface.\n\n| Database | Num. Documents | Size |\n| :------: | :------------: | :-----------------: |\n| [Wikipedia](https://s3.amazonaws.com/fair-data/drqa/docs.db.gz) | 5,075,182 | ~13GB |\n\n#### QA Datasets\n\nThe datasets used for DrQA training and evaluation can be found here:\n\n- SQuAD: [train](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json), [dev](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json)\n- WebQuestions: [train](http://nlp.stanford.edu/static/software/sempre/release-emnlp2013/lib/data/webquestions/dataset_11/webquestions.examples.train.json.bz2), [test](http://nlp.stanford.edu/static/software/sempre/release-emnlp2013/lib/data/webquestions/dataset_11/webquestions.examples.test.json.bz2), [entities](https://s3.amazonaws.com/fair-data/drqa/freebase-entities.txt.gz)\n- WikiMovies: [train/test/entities](https://s3.amazonaws.com/fair-data/drqa/WikiMovies.tar.gz)\n(Rehosted in expected format from https://research.fb.com/downloads/babi/)\n- CuratedTrec: [train/test](https://s3.amazonaws.com/fair-data/drqa/CuratedTrec.tar.gz)\n(Rehosted in expected format from https://github.com/brmson/dataset-factoid-curated)\n\n##### Format A\n\nThe `retriever/eval.py`, `pipeline/eval.py`, and `distant/generate.py` scripts expect the datasets as a `.txt` file where each line is a JSON encoded QA pair, like so:\n\n```python\n'{\"question\": \"q1\", \"answer\": [\"a11\", ..., \"a1i\"]}'\n...\n'{\"question\": \"qN\", \"answer\": [\"aN1\", ..., \"aNi\"]}'\n```\n\nScripts to convert SQuAD and WebQuestions to this format are included in `scripts/convert`. This is automatically done in `download.sh`.\n\n##### Format B\n\nThe `reader` directory scripts expect the datasets as a `.json` file where the data is arranged like SQuAD:\n\n```\nfile.json\n├── \"data\"\n│   └── [i]\n│       ├── \"paragraphs\"\n│       │   └── [j]\n│       │       ├── \"context\": \"paragraph text\"\n│       │       └── \"qas\"\n│       │           └── [k]\n│       │               ├── \"answers\"\n│       │               │   └── [l]\n│       │               │       ├── \"answer_start\": N\n│       │               │       └── \"text\": \"answer\"\n│       │               ├── \"id\": \"\u003cuuid\u003e\"\n│       │               └── \"question\": \"paragraph question?\"\n│       └── \"title\": \"document id\"\n└── \"version\": 1.1\n```\n\n##### Entity lists\n\nSome datasets have (potentially large) candidate lists for selecting answers. For example, WikiMovies' answers are OMDb entries while WebQuestions is based on Freebase. If we have known candidates, we can impose that all predicted answers must be in this list by discarding any higher scoring spans that are not.\n\n## DrQA Components\n\n### Document Retriever\n\nDrQA is not tied to any specific type of retrieval system -- as long as it effectively narrows the search space and focuses on relevant documents.\n\nFollowing classical QA systems, we include an efficient (non-machine learning) document retrieval system based on sparse, TF-IDF weighted bag-of-word vectors. We use bags of hashed n-grams (here, unigrams and bigrams).\n\nTo see how to build your own such model on new documents, see the retriever [README](scripts/retriever/README.md).\n\nTo interactively query Wikipedia:\n\n```bash\npython scripts/retriever/interactive.py --model /path/to/model\n```\n\nIf `model` is left out our [default model](#document-retriever-1) will be used (assuming it was [downloaded](#installing-drqa)).\n\nTo evaluate the retriever accuracy (% match in top 5) on a dataset:\n\n```bash\npython scripts/retriever/eval.py /path/to/format/A/dataset.txt --model /path/to/model\n```\n\n### Document Reader\n\nDrQA's Document Reader is a multi-layer recurrent neural network machine comprehension model trained to do extractive question answering. That is, the model tries to find the answer to any question as a text span in one of the returned documents.\n\nThe Document Reader was inspired by, and primarily trained on, the [SQuAD](https://arxiv.org/abs/1606.05250) dataset. It can also be used standalone on such SQuAD-like tasks where a specific context is supplied with the question, the answer to which is contained in the context.\n\nTo see how to train the Document Reader on SQuAD, see the reader [README](scripts/reader/README.md).\n\nTo interactively ask questions about text with a trained model:\n\n```bash\npython scripts/reader/interactive.py --model /path/to/model\n```\n\nAgain, here `model` is optional; a [default model](#document-reader-1) will be used if it is left out.\n\nTo run model predictions on a dataset:\n\n```bash\npython scripts/reader/predict.py /path/to/format/B/dataset.json --model /path/to/model\n```\n\n### DrQA Pipeline\n\nThe full system is linked together in `drqa.pipeline.DrQA`.\n\nTo interactively ask questions using the full DrQA:\n\n```bash\npython scripts/pipeline/interactive.py\n```\n\nOptional arguments:\n```\n--reader-model    Path to trained Document Reader model.\n--retriever-model Path to Document Retriever model (tfidf).\n--doc-db          Path to Document DB.\n--tokenizer      String option specifying tokenizer type to use (e.g. 'corenlp').\n--candidate-file  List of candidates to restrict predictions to, one candidate per line.\n--no-cuda         Use CPU only.\n--gpu             Specify GPU device id to use.\n```\n\nTo run predictions on a dataset:\n\n```bash\npython scripts/pipeline/predict.py /path/to/format/A/dataset.txt\n```\n\nOptional arguments:\n```\n--out-dir             Directory to write prediction file to (\u003cdataset\u003e-\u003cmodel\u003e-pipeline.preds).\n--reader-model        Path to trained Document Reader model.\n--retriever-model     Path to Document Retriever model (tfidf).\n--doc-db              Path to Document DB.\n--embedding-file      Expand dictionary to use all pretrained embeddings in this file (e.g. all glove vectors to minimize UNKs at test time).\n--candidate-file      List of candidates to restrict predictions to, one candidate per line.\n--n-docs              Number of docs to retrieve per query.\n--top-n               Number of predictions to make per query.\n--tokenizer           String option specifying tokenizer type to use (e.g. 'corenlp').\n--no-cuda             Use CPU only.\n--gpu                 Specify GPU device id to use.\n--parallel            Use data parallel (split across GPU devices).\n--num-workers         Number of CPU processes (for tokenizing, etc).\n--batch-size          Document paragraph batching size (Reduce in case of GPU OOM).\n--predict-batch-size  Question batching size (Reduce in case of CPU OOM).\n```\n\n### Distant Supervision (DS)\n\nDrQA's performance improves significantly in the full-setting when provided with distantly supervised data from additional datasets. Given question-answer pairs but no supporting context, we can use string matching heuristics to automatically associate paragraphs to these training examples.\n\n\u003eQuestion: What U.S. state’s motto is “Live free or Die”?\n\u003e\n\u003eAnswer: New Hampshire\n\u003e\n\u003eDS Document: Live Free or Die\n **“Live Free or Die”** is the official **motto** of the **U.S. state** of _**New Hampshire**_, adopted by the **state** in 1945. It is possibly the best-known of all state mottos, partly because it conveys an assertive independence historically found in American political philosophy and partly because of its contrast to the milder sentiments found in other state mottos.\n\nThe `scripts/distant` directory contains code to generate and inspect such distantly supervised data. More information can be found in the distant supervision [README](scripts/distant/README.md).\n\n### Tokenizers\n\nWe provide a number of different tokenizer options for convenience. Each has its own pros/cons based on how many dependencies it requires, overhead for running it, speed, and performance. For our reported experiments we used CoreNLP (but results are all similar).\n\nAvailable tokenizers:\n- _CoreNLPTokenizer_: Uses [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/) (option: 'corenlp'). We used v3.7.0. Requires Java 8.\n- _SpacyTokenizer_: Uses [spaCy](https://spacy.io/) (option: 'spacy').\n- _RegexpTokenizer_: Custom regex-based PTB-style tokenizer (option: 'regexp').\n- _SimpleTokenizer_: Basic alpha-numeric/non-whitespace tokenizer (option: 'simple').\n\nSee the [list](drqa/tokenizers/__init__.py) of mappings between string option names and tokenizer classes.\n\n## Citation\n\nPlease cite the ACL paper if you use DrQA in your work:\n\n```\n@inproceedings{chen2017reading,\n  title={Reading {Wikipedia} to Answer Open-Domain Questions},\n  author={Chen, Danqi and Fisch, Adam and Weston, Jason and Bordes, Antoine},\n  booktitle={Association for Computational Linguistics (ACL)},\n  year={2017}\n}\n```\n\n## Connection with ParlAI\n\nThis implementation of the DrQA Document Reader is closely related to the one found in [ParlAI](https://github.com/facebookresearch/ParlAI). Here, however, the work is extended to interact with the Document Retriever in the open-domain setting. It is also somewhat more efficient to train and achieves slightly better performance given that the ParlAI API restrictions are lifted (e.g. with respect to preprocessing, answer spans, etc).\n\nWe plan to consolidate this model into the ParlAI interface as well, so that the reader can be interchangeably trained here or multitasked on many datasets with ParlAI.\n\n## License\nDrQA is BSD-licensed. We also provide an additional patent grant.","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fryi06%2Fmultiqa","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fryi06%2Fmultiqa","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fryi06%2Fmultiqa/lists"}