{"id":13678380,"url":"https://github.com/facebookresearch/DPR","last_synced_at":"2025-04-29T13:30:33.719Z","repository":{"id":38281049,"uuid":"263491619","full_name":"facebookresearch/DPR","owner":"facebookresearch","description":"Dense Passage Retriever - is a set of tools and models for open domain Q\u0026A task.","archived":true,"fork":false,"pushed_at":"2023-04-06T07:36:18.000Z","size":407,"stargazers_count":1702,"open_issues_count":36,"forks_count":299,"subscribers_count":23,"default_branch":"main","last_synced_at":"2024-09-26T21:43:44.991Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/facebookresearch.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2020-05-13T01:13:13.000Z","updated_at":"2024-09-26T04:35:04.000Z","dependencies_parsed_at":"2022-07-12T17:22:55.836Z","dependency_job_id":"a2ad2946-c7cd-4586-9a34-f079c0341242","html_url":"https://github.com/facebookresearch/DPR","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2FDPR","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2FDPR/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2FDPR/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2FDPR/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/facebookresearch","download_url":"https://codeload.github.com/facebookresearch/DPR/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224173815,"owners_count":17268180,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-02T13:00:52.976Z","updated_at":"2024-11-11T20:31:29.512Z","avatar_url":"https://github.com/facebookresearch.png","language":"Python","funding_links":[],"categories":["SDKs \u0026 Libraries","Python","Libraries and Tools","Retrieval Methods"],"sub_categories":["2023","Dense Retrieval"],"readme":"# Dense Passage Retrieval\n\nDense Passage Retrieval (`DPR`) - is a set of tools and models for state-of-the-art open-domain Q\u0026A research.\nIt is based on the following paper:\n\nVladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih. [Dense Passage Retrieval for Open-Domain Question Answering.](https://arxiv.org/abs/2004.04906) Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, 2020.\n\nIf you find this work useful, please cite the following paper:\n\n```\n@inproceedings{karpukhin-etal-2020-dense,\n    title = \"Dense Passage Retrieval for Open-Domain Question Answering\",\n    author = \"Karpukhin, Vladimir and Oguz, Barlas and Min, Sewon and Lewis, Patrick and Wu, Ledell and Edunov, Sergey and Chen, Danqi and Yih, Wen-tau\",\n    booktitle = \"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)\",\n    month = nov,\n    year = \"2020\",\n    address = \"Online\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://www.aclweb.org/anthology/2020.emnlp-main.550\",\n    doi = \"10.18653/v1/2020.emnlp-main.550\",\n    pages = \"6769--6781\",\n}\n```\n\nIf you're interesting in reproducing experimental results in the paper based on our model checkpoints (i.e., don't want to train the encoders from scratch), you might consider using the [Pyserini toolkit](https://github.com/castorini/pyserini/blob/master/docs/experiments-dpr.md), which has the experiments nicely packaged in via `pip`.\nTheir toolkit also reports higher BM25 and hybrid scores.\n\n## Features\n1. Dense retriever model is based on bi-encoder architecture.\n2. Extractive Q\u0026A reader\u0026ranker joint model inspired by [this](https://arxiv.org/abs/1911.03868) paper.\n3. Related data pre- and post- processing tools.\n4. Dense retriever component for inference time logic is based on FAISS index.\n\n## New (March 2021) release\nDPR codebase is upgraded with a number of enhancements and new models.\nMajor changes:\n1. [Hydra](https://hydra.cc/)-based configuration for all the command line tools exept the data loader (to be converted soon)\n2. Pluggable data processing layer to support custom datasets\n3. New retrieval model checkpoint with better perfromance.\n\n## New (March 2021) retrieval model\nA new bi-encoder model trained on NQ dataset only is now provided: a new checkpoint, training data, retrieval results and wikipedia embeddings.\nIt is trained on the original DPR NQ train set and its version where hard negatives are mined using DPR index itself using the previous NQ checkpoint.\nA Bi-encoder model is trained from scratch using this new training data combined with our original NQ training data. This training scheme gives a nice retrieval performance boost.\n\nNew vs old top-k documents retrieval accuracy on NQ test set (3610 questions).\n\n| Top-k passages        | Original DPR NQ model           | New DPR model  |\n| ------------- |:-------------:| -----:|\n| 1      | 45.87 | 52.47 |\n| 5      | 68.14      |   72.24 |\n| 20  | 79.97      |    81.33 |\n| 100  | 85.87      |    87.29 |\n\nNew model downloadable resources names (see how to use download_data script below):\n\nCheckpoint: checkpoint.retriever.single-adv-hn.nq.bert-base-encoder\n\nNew training data: data.retriever.nq-adv-hn-train\n\nRetriever resutls for NQ test set: data.retriever_results.nq.single-adv-hn.test\n\nWikipedia embeddings: data.retriever_results.nq.single-adv-hn.wikipedia_passages\n\n\n## Installation\n\nInstallation from the source. Python's virtual or Conda environments are recommended.\n\n```bash\ngit clone git@github.com:facebookresearch/DPR.git\ncd DPR\npip install .\n```\n\nDPR is tested on Python 3.6+ and PyTorch 1.2.0+.\nDPR relies on third-party libraries for encoder code implementations.\nIt currently supports Huggingface (version \u003c=3.1.0) BERT, Pytext BERT and Fairseq RoBERTa encoder models.\nDue to generality of the tokenization process, DPR uses Huggingface tokenizers as of now. So Huggingface is the only required dependency, Pytext \u0026 Fairseq are optional.\nInstall them separately if you want to use those encoders.\n\n\n## Resources \u0026 Data formats\nFirst, you need to prepare data for either retriever or reader training.\nEach of the DPR components has its own input/output data formats. \nYou can see format descriptions below.\nDPR provides NQ \u0026 Trivia preprocessed datasets (and model checkpoints) to be downloaded from the cloud using our dpr/data/download_data.py tool. One needs to specify the resource name to be downloaded. Run 'python data/download_data.py' to see all options.\n\n```bash\npython data/download_data.py \\\n\t--resource {key from download_data.py's RESOURCES_MAP}  \\\n\t[optional --output_dir {your location}]\n```\nThe resource name matching is prefix-based. So if you need to download all data resources, just use --resource data.\n\n## Retriever input data format\nThe default data format of the Retriever training data is JSON.\nIt contains pools of 2 types of negative passages per question, as well as positive passages and some additional information.\n\n```\n[\n  {\n\t\"question\": \"....\",\n\t\"answers\": [\"...\", \"...\", \"...\"],\n\t\"positive_ctxs\": [{\n\t\t\"title\": \"...\",\n\t\t\"text\": \"....\"\n\t}],\n\t\"negative_ctxs\": [\"...\"],\n\t\"hard_negative_ctxs\": [\"...\"]\n  },\n  ...\n]\n```\n\nElements' structure  for negative_ctxs \u0026 hard_negative_ctxs is exactly the same as for positive_ctxs.\nThe preprocessed data available for downloading also contains some extra attributes which may be useful for model modifications (like bm25 scores per passage). Still, they are not currently in use by DPR.\n\nYou can download prepared NQ dataset used in the paper by using 'data.retriever.nq' key prefix. Only dev \u0026 train subsets are available in this format.\nWe also provide question \u0026 answers only CSV data files for all train/dev/test splits. Those are used for the model evaluation since our NQ preprocessing step looses a part of original samples set.\nUse 'data.retriever.qas.*' resource keys to get respective sets for evaluation.\n\n```bash\npython data/download_data.py\n\t--resource data.retriever\n\t[optional --output_dir {your location}]\n```\n\n## DPR data formats and custom processing \nOne can use their own data format and custom data parsing \u0026 loading logic by inherting from DPR's Dataset classes in dpr/data/{biencoder|retriever|reader}_data.py files and implementing load_data() and __getitem__() methods. See [DPR hydra configuration](https://github.com/facebookresearch/DPR/blob/master/conf/README.md) instructions.\n\n\n## Retriever training\nRetriever training quality depends on its effective batch size. The one reported in the paper used 8 x 32GB GPUs.\nIn order to start training on one machine:\n```bash\npython train_dense_encoder.py \\\ntrain_datasets=[list of train datasets, comma separated without spaces] \\\ndev_datasets=[list of dev datasets, comma separated without spaces] \\\ntrain=biencoder_local \\\noutput_dir={path to checkpoints dir}\n```\n\nExample for NQ dataset\n\n```bash\npython train_dense_encoder.py \\\ntrain_datasets=[nq_train] \\\ndev_datasets=[nq_dev] \\\ntrain=biencoder_local \\\noutput_dir={path to checkpoints dir}\n```\n\nDPR uses HuggingFace BERT-base as the encoder by default. Other ready options include Fairseq's ROBERTA and Pytext BERT models.\nOne can select them by either changing encoder configuration files (conf/encoder/hf_bert.yaml) or providing a new configuration file in conf/encoder dir and enabling it with encoder={new file name} command line parameter. \n\nNotes:\n- If you want to use pytext bert or fairseq roberta, you will need to download pre-trained weights and specify encoder.pretrained_file parameter. Specify the dir location of the downloaded files for 'pretrained.fairseq.roberta-base' resource prefix for RoBERTa model or the file path for pytext BERT (resource name 'pretrained.pytext.bert-base.model').\n- Validation and checkpoint saving happens according to train.eval_per_epoch parameter value.\n- There is no stop condition besides a specified amount of epochs to train (train.num_train_epochs configuration parameter).\n- Every evaluation saves a model checkpoint.\n- The best checkpoint is logged in the train process output.\n- Regular NLL classification loss validation for bi-encoder training can be replaced with average rank evaluation. It aggregates passage and question vectors from the input data passages pools, does large similarity matrix calculation for those representations and then averages the rank of the gold passage for each question. We found this metric more correlating with the final retrieval performance vs nll classification loss. Note however that this average rank validation works differently in DistributedDataParallel vs DataParallel PyTorch modes. See train.val_av_rank_* set of parameters to enable this mode and modify its settings.\n\nSee the section 'Best hyperparameter settings' below as e2e example for our best setups.\n\n## Retriever inference\n\nGenerating representation vectors for the static documents dataset is a highly parallelizable process which can take up to a few days if computed on a single GPU. You might want to use multiple available GPU servers by running the script on each of them independently and specifying their own shards.\n\n```bash\npython generate_dense_embeddings.py \\\n\tmodel_file={path to biencoder checkpoint} \\\n\tctx_src={name of the passages resource, set to dpr_wiki to use our original wikipedia split} \\\n\tshard_id={shard_num, 0-based} num_shards={total number of shards} \\\n\tout_file={result files location + name PREFX}\t\n```\n\nThe name of the resource for ctx_src parameter \nor just the source name from conf/ctx_sources/default_sources.yaml file.\n\nNote: you can use much large batch size here compared to training mode. For example, setting batch_size 128 for 2 GPU(16gb) server should work fine.\nYou can download already generated wikipedia embeddings from our original model (trained on NQ dataset) using resource key 'data.retriever_results.nq.single.wikipedia_passages'. \nEmbeddings resource name for the new better model 'data.retriever_results.nq.single-adv-hn.wikipedia_passages'\n\nWe generally use the following params on 50 2-gpu nodes: batch_size=128 shard_id=0 num_shards=50\n\n\n\n## Retriever validation against the entire set of documents:\n\n```bash\n\npython dense_retriever.py \\\n\tmodel_file={path to a checkpoint downloaded from our download_data.py as 'checkpoint.retriever.single.nq.bert-base-encoder'} \\\n\tqa_dataset={the name os the test source} \\\n\tctx_datatsets=[{list of passage sources's names, comma separated without spaces}] \\\n\tencoded_ctx_files=[{list of encoded document files glob expression, comma separated without spaces}] \\\n\tout_file={path to output json file with results} \n\t\n```\n\nFor example, If your generated embeddings fpr two passages set as ~/myproject/embeddings_passages1/wiki_passages_* and ~/myproject/embeddings_passages2/wiki_passages_* files and want to evaluate on NQ dataset:\n\n```bash\npython dense_retriever.py \\\n\tmodel_file={path to a checkpoint file} \\\n\tqa_dataset=nq_test \\\n\tctx_datatsets=[dpr_wiki] \\\n\tencoded_ctx_files=[\\\"~/myproject/embeddings_passages1/wiki_passages_*\\\",\\\"~/myproject/embeddings_passages2/wiki_passages_*\\\"] \\\n\tout_file={path to output json file with results} \n```\n\n\nThe tool writes retrieved results for subsequent reader model training into specified out_file.\nIt is a json with the following format:\n\n```\n[\n    {\n        \"question\": \"...\",\n        \"answers\": [\"...\", \"...\", ... ],\n        \"ctxs\": [\n            {\n                \"id\": \"...\", # passage id from database tsv file\n                \"title\": \"\",\n                \"text\": \"....\",\n                \"score\": \"...\",  # retriever score\n                \"has_answer\": true|false\n     },\n]\n```\nResults are sorted by their similarity score, from most relevant to least relevant.\n\nBy default, dense_retriever uses exhaustive search process, but you can opt in to use lossy index types.\nWe provide HNSW and HNSW_SQ index options.\nEnabled them by indexer=hnsw or indexer=hnsw_sq command line arguments.\nNote that using this index may be useless from the research point of view since their fast retrieval process comes at the cost of much longer indexing time and higher RAM usage.\nThe similarity score provided is the dot product for the default case of exhaustive search (indexer=flat) and L2 distance in a modified representations space in case of HNSW index.\n\n\n## Reader model training\n```bash\npython train_extractive_reader.py \\\n\tencoder.sequence_length=350 \\\n\ttrain_files={path to the retriever train set results file} \\\n\tdev_files={path to the retriever dev set results file}  \\\n\toutput_dir={path to output dir}\n```\nDefault hyperparameters are set for a single node with 8 gpus setup.\nModify them as needed in the conf/train/extractive_reader_default.yaml and conf/extractive_reader_train_cfg.yaml cpnfiguration files or override specific parameters from the command line.\nFirst time run will preprocess train_files \u0026 dev_files and convert them into serialized set of .pkl files in the same locaion and will use them on all subsequent runs.\n\nNotes:\n- If you want to use pytext bert or fairseq roberta, you will need to download pre-trained weights and specify encoder.pretrained_file parameter. Specify the dir location of the downloaded files for 'pretrained.fairseq.roberta-base' resource prefix for RoBERTa model or the file path for pytext BERT (resource name 'pretrained.pytext.bert-base.model').\n- Reader training pipeline does model validation every train.eval_step batches\n- Like the bi-encoder, it saves model checkpoints on every validation\n- Like the bi-encoder, there is no stop condition besides a specified amount of epochs to train.\n- Like the bi-encoder, there is no best checkpoint selection logic, so one needs to select that based on dev set validation performance which is logged in the train process output.\n- Our current code only calculates the Exact Match metric.\n\n## Reader model inference\n\nIn order to make an inference, run `train_reader.py` without specifying `train_files`. Make sure to specify `model_file` with the path to the checkpoint, `passages_per_question_predict` with number of passages per question (being used when saving the prediction file), and `eval_top_docs` with a list of top passages threshold values from which to choose question's answer span (to be printed as logs). The example command line is as follows.\n\n```bash\npython train_extractive_reader.py \\\n  prediction_results_file={path to a file to write the results to} \\\n  eval_top_docs=[10,20,40,50,80,100] \\\n  dev_files={path to the retriever results file to evaluate} \\\n  model_file= {path to the reader checkpoint} \\\n  train.dev_batch_size=80 \\\n  passages_per_question_predict=100 \\\n  encoder.sequence_length=350\n```\n\n## Distributed training\nUse Pytorch's distributed training launcher tool:\n\n```bash\npython -m torch.distributed.launch \\\n\t--nproc_per_node={WORLD_SIZE}  {non distributed scipt name \u0026 parameters}\n```\nNote:\n- all batch size related parameters are specified per gpu in distributed mode(DistributedDataParallel) and for all available gpus in DataParallel (single node - multi gpu) mode.\n\n## Best hyperparameter settings\n\ne2e example with the best settings for NQ dataset.\n\n### 1. Download all retriever training and validation data:\n\n```bash\npython data/download_data.py --resource data.wikipedia_split.psgs_w100\npython data/download_data.py --resource data.retriever.nq\npython data/download_data.py --resource data.retriever.qas.nq\n```\n\n### 2. Biencoder(Retriever) training in the single set mode.\n\nWe used distributed training mode on a single 8 GPU x 32 GB server\n\n```bash\npython -m torch.distributed.launch --nproc_per_node=8\ntrain_dense_encoder.py \\\ntrain=biencoder_nq \\\ntrain_datasets=[nq_train] \\\ndev_datasets=[nq_dev] \\\ntrain=biencoder_nq \\\noutput_dir={your output dir}\n```\n\nNew model training combines two NQ datatsets:\n\n```bash\npython -m torch.distributed.launch --nproc_per_node=8\ntrain_dense_encoder.py \\\ntrain=biencoder_nq \\\ntrain_datasets=[nq_train,nq_train_hn1] \\\ndev_datasets=[nq_dev] \\\ntrain=biencoder_nq \\\noutput_dir={your output dir}\n```\n\nThis takes about a day to complete the training for 40 epochs. It switches to Average Rank validation on epoch 30 and it should be around 25 or less at the end.\nThe best checkpoint for bi-encoder is usually the last, but it should not be so different if you take any after epoch ~ 25.\n\n### 3. Generate embeddings for Wikipedia.\nJust use instructions for \"Generating representations for large documents set\". It takes about 40 minutes to produce 21 mln passages representation vectors on 50 2 GPU servers.\n\n### 4. Evaluate retrieval accuracy and generate top passage results for each of the train/dev/test datasets.\n\n```bash\n\npython dense_retriever.py \\\n\tmodel_file={path to the best checkpoint or use our proivded checkpoints (Resource names like checkpoint.retriever.*)  } \\\n\tqa_dataset=nq_test \\\n\tctx_datatsets=[dpr_wiki] \\\n\tencoded_ctx_files=[\"{glob expression for generated embedding files}\"] \\\n\tout_file={path to the output file}\n```\n\nAdjust batch_size based on the available number of GPUs, 64-128 should work for 2 GPU server.\n\n### 5. Reader training\nWe trained reader model for large datasets using a single 8 GPU x 32 GB server. All the default parameters are already set to our best NQ settings.\nPlease also download data.gold_passages_info.nq_train \u0026 data.gold_passages_info.nq_dev resources for NQ datatset - they are used for special NQ only heuristics when preprocessing the data for the NQ reader training. If you already run reader trianign on NQ data without gold_passages_src \u0026 gold_passages_src_dev specified, please delete the corresponding .pkl files so that thye will be re-generated.\n\n```bash\npython train_extractive_reader.py \\\n\tencoder.sequence_length=350 \\\n\ttrain_files={path to the retriever train set results file} \\\n\tdev_files={path to the retriever dev set results file}  \\\n\tgold_passages_src={path to data.gold_passages_info.nq_train file} \\\n\tgold_passages_src_dev={path to data.gold_passages_info.nq_dev file} \\\n\toutput_dir={path to output dir}\n```\n\nWe found that using the learning rate above works best with static schedule, so one needs to stop training manually based on evaluation performance dynamics.\nOur best results were achieved on 16-18 training epochs or after ~60k model updates.\n\nWe provide all input and intermediate results for e2e pipeline for NQ dataset and most of the similar resources for Trivia.\n\n## Misc.\n- TREC validation requires regexp based matching. We support only retriever validation in the regexp mode. See --match parameter option.\n- WebQ validation requires entity normalization, which is not included as of now.\n\n## License\nDPR is CC-BY-NC 4.0 licensed as of now.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffacebookresearch%2FDPR","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffacebookresearch%2FDPR","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffacebookresearch%2FDPR/lists"}