{"id":13677928,"url":"https://github.com/PaddlePaddle/RocketQA","last_synced_at":"2025-04-29T12:32:30.897Z","repository":{"id":37306492,"uuid":"403829741","full_name":"PaddlePaddle/RocketQA","owner":"PaddlePaddle","description":"🚀 RocketQA, dense retrieval for information retrieval and question answering, including both Chinese and English state-of-the-art models. ","archived":false,"fork":false,"pushed_at":"2023-12-19T08:08:35.000Z","size":2663,"stargazers_count":766,"open_issues_count":66,"forks_count":128,"subscribers_count":19,"default_branch":"main","last_synced_at":"2024-11-06T17:57:44.231Z","etag":null,"topics":["dense-retrieval","information-retrieval","nlp","question-answering"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/PaddlePaddle.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2021-09-07T03:36:20.000Z","updated_at":"2024-10-31T11:31:37.000Z","dependencies_parsed_at":"2024-01-14T15:38:22.908Z","dependency_job_id":null,"html_url":"https://github.com/PaddlePaddle/RocketQA","commit_stats":{"total_commits":47,"total_committers":12,"mean_commits":"3.9166666666666665","dds":0.7872340425531915,"last_synced_commit":"e2bfcfcfa902ac6cef7f0d359606a9da05b795ac"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PaddlePaddle%2FRocketQA","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PaddlePaddle%2FRocketQA/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PaddlePaddle%2FRocketQA/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PaddlePaddle%2FRocketQA/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/PaddlePaddle","download_url":"https://codeload.github.com/PaddlePaddle/RocketQA/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224173346,"owners_count":17268100,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dense-retrieval","information-retrieval","nlp","question-answering"],"created_at":"2024-08-02T13:00:48.587Z","updated_at":"2024-11-11T20:30:41.195Z","avatar_url":"https://github.com/PaddlePaddle.png","language":"Python","readme":"\u003cp align=center\u003e \u003cimg src=\"https://github.com/PaddlePaddle/RocketQA/blob/main/RocketQA_title.png\" /\u003e \u003c/p\u003e\n\n\u003cdiv align=center\u003e\n  \n![](https://img.shields.io/badge/license-Apache%202-blue) ![](https://img.shields.io/badge/version-v1.0-green) ![](https://img.shields.io/badge/JupyterNotebook-Try%20%F0%9F%9A%80RocketQA%20Now!-orange) ![](https://img.shields.io/badge/requirements-up%20to%20date-brightgreen) ![](https://img.shields.io/badge/size-1.68MB-blue)\n  \n \u003c/div\u003e\n\nIn recent years, the dense retrievers based on pre-trained language models have achieved remarkable progress. To facilitate more developers using cutting edge technologies, this repository provides an easy-to-use toolkit for running and fine-tuning the state-of-the-art dense retrievers, namely **🚀RocketQA**. This toolkit has the following advantages:\n\n\n* ***State-of-the-art***: 🚀RocketQA provides our well-trained models, which achieve SOTA performance on many dense retrieval datasets. And it will continue to update the [latest models](https://github.com/PaddlePaddle/RocketQA#news).\n* ***First-Chinese-model***: 🚀RocketQA provides the first open source Chinese dense retrieval model, which is trained on millions of manual annotation data from [DuReader](https://github.com/baidu/DuReader).\n* ***Easy-to-use***: By integrating this toolkit with [JINA](https://jina.ai/), 🚀RocketQA can help developers build an end-to-end retrieval system and question answering system with several lines of code. \u003cimg src=\"https://github.com/PaddlePaddle/RocketQA/blob/main/RocketQA_flow.png\" alt=\"\" align=center /\u003e  \n\n## News\n* 🎉 Nov 27, 2022: Our survey paper on dense retrieval [Dense Text Retrieval based on Pretrained Language Models: A Survey](https://arxiv.org/pdf/2211.14876.pdf) was publicly available. \n* Oct 8, 2022: [DuReader\u003csub\u003eretrieval\u003c/sub\u003e](https://arxiv.org/abs/2203.10232) was accepted by EMNLP 2022. [[data]](https://github.com/baidu/DuReader/tree/master/DuReader-Retrieval); The latest version of DuReader\u003csub\u003eretrieval\u003c/sub\u003e contains cross-lingual retrieval benchmarks. Stay tuned!\n* Apr 29, 2022: **Training function** is added to RocketQA toolkit. And the baseline models of **DuReader\u003csub\u003eretrieval\u003c/sub\u003e** (both cross encoder and dual encoder) are available in RocketQA models.\n* Mar 30, 2022: We released **DuReader\u003csub\u003eretrieval\u003c/sub\u003e**, a large-scale Chinese benchmark for passage retrieval. The dataset contains over 90K questions and 8M passages from Baidu Search. [[paper]](https://arxiv.org/abs/2203.10232) [[data]](https://github.com/baidu/DuReader/tree/master/DuReader-Retrieval) ; The baseline of **DuReader\u003csub\u003eretrieval\u003c/sub\u003e** [leaderboard](https://aistudio.baidu.com/aistudio/competition/detail/157/0/introduction) was also released. [[code/model]](https://github.com/PaddlePaddle/RocketQA/tree/main/research/DuReader-Retrieval-Baseline)\n* Dec 3, 2021: The toolkit of dense retriever RocketQA was released, including the first chinese dense retrieval model trained on DuReader. \n* Aug 26, 2021: [RocketQA v2](https://arxiv.org/pdf/2110.07367.pdf) was accepted by EMNLP 2021. [[code/model]](https://github.com/PaddlePaddle/RocketQA/tree/main/research/RocketQAv2_EMNLP2021)\n* May 5, 2021: [PAIR](https://aclanthology.org/2021.findings-acl.191.pdf) was accepted by ACL 2021. [[code/model]](https://github.com/PaddlePaddle/RocketQA/tree/main/research/PAIR_ACL2021)\n* Mar 11, 2021: [RocketQA v1](https://arxiv.org/pdf/2010.08191.pdf) was accepted by NAACL 2021. [[code/model]](https://github.com/PaddlePaddle/RocketQA/tree/main/research/RocketQA_NAACL2021)\n\n\n## Installation\n\nWe provide two installation methods: ***Python Installation Package*** and ***Docker Environment***\n\n\n### Install with Python Package\nFirst, install [PaddlePaddle](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html).\n```bash\n# GPU version:\n$ pip install paddlepaddle-gpu\n\n# CPU version:\n$ pip install paddlepaddle\n```\n\nSecond, install rocketqa package (latest version: 1.1.0):\n```bash\n$ pip install rocketqa\n```\n\nNOTE: this toolkit MUST be running on Python3.6+ with [PaddlePaddle](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html) 2.0+.\n\n### Install with Docker\n\n```bash\ndocker pull rocketqa/rocketqa\n\ndocker run -it docker.io/rocketqa/rocketqa bash\n```\n\n## Getting Started\n\nRefer to the examples below, you can build and run your own Search Engine with several lines of code. We also provide a [Playground](https://aistudio.baidu.com/aistudio/projectdetail/3225255?contributionType=1) with JupyterNotebook. Try 🚀RocketQA straight away in your browser!\n\n### Running with JINA\n[JINA](https://jina.ai/) is a cloud-native neural search framework to build SOTA and scalable deep learning search applications in minutes. Here is a simple example to build a Search Engine based on JINA and RocketQA.\n\n```bash\ncd examples/jina_example\npip3 install -r requirements.txt\n\n# Generate vector representations and build a libray for your Documents\n# JINA will automaticlly start a web service for you\npython3 app.py index toy_data/test.tsv\n\n# Try some questions related to the indexed Documents\npython3 app.py query_cli\n```\nPlease view [JINA example](https://github.com/PaddlePaddle/RocketQA/tree/main/examples/jina_example) to know more.\n\n### Running with FAISS\nWe also provide a simple example built on [Faiss](https://github.com/facebookresearch/faiss).\n```bash\ncd examples/faiss_example/\npip3 install -r requirements.txt\n\n# Generate vector representations and build a libray for your Documents\npython3 index.py zh ../data/dureader.para test_index\n\n# Start a web service on http://localhost:8888/rocketqa\npython3 rocketqa_service.py zh ../data/dureader.para test_index\n\n# Try some questions related to the indexed Documents\npython3 query.py\n```\n\n\n## API\nYou can also easily integrate 🚀RocketQA into your own task. We provide two types of models, ERNIE-based dual encoder for answer retrieval and ERNIE-based cross encoder for answer re-ranking. For running our models, you can use the following functions.\n\n### Load model\n\n#### [`rocketqa.available_models()`](https://github.com/PaddlePaddle/RocketQA/blob/3a99cf2720486df8cc54acc0e9ce4cbcee993413/rocketqa/rocketqa.py#L17)\n\nReturns the names of the available RocketQA models. To know more about the available models, please see the code comment.\n\n#### [`rocketqa.load_model(model, use_cuda=False, device_id=0, batch_size=1)`](https://github.com/PaddlePaddle/RocketQA/blob/3a99cf2720486df8cc54acc0e9ce4cbcee993413/rocketqa/rocketqa.py#L52)\n\nReturns the model specified by the input parameter. It can initialize both dual encoder and cross encoder. By setting input parameter, you can load either RocketQA models returned by \"available_models()\" or your own checkpoints.\n\n### Dual encoder\nDual-encoder returned by \"load_model()\" supports the following functions:\n\n#### [`model.encode_query(query: List[str])`](https://github.com/PaddlePaddle/RocketQA/blob/1746b938d659c7f8d0b9f960e3199dcbd945adac/rocketqa/encoder/dual_encoder.py#L151)\n\nGiven a list of queries, returns their representation vectors encoded by model.\n\n#### [`model.encode_para(para: List[str], title: List[str])`](https://github.com/PaddlePaddle/RocketQA/blob/1746b938d659c7f8d0b9f960e3199dcbd945adac/rocketqa/encoder/dual_encoder.py#L179)\n\nGiven a list of paragraphs and their corresponding titles (optional), returns their representations vectors encoded by model.\n\n#### [`model.matching(query: List[str], para: List[str], title: List[str])`](https://github.com/PaddlePaddle/RocketQA/blob/1746b938d659c7f8d0b9f960e3199dcbd945adac/rocketqa/encoder/dual_encoder.py#L212)\n\nGiven a list of queries and paragraphs (and titles), returns their matching scores (dot product between two representation vectors). \n\n#### [`model.train(train_set: str, epoch: int, save_model_path: str, args)`](https://github.com/PaddlePaddle/RocketQA/blob/1746b938d659c7f8d0b9f960e3199dcbd945adac/rocketqa/encoder/dual_encoder.py#L247)\n\nGiven the hyperparameters `train_set`, `epoch` and `save_model_path`, you can train your own dual encoder model or finetune our models. Other settings like `save_steps` and `learning_rate` can also be set in `args`. Please refer to examples/example.py for detail.\n\n### Cross encoder\nCross-encoder returned by \"load_model()\" supports the following function:\n\n#### [`model.matching(query: List[str], para: List[str], title: List[str])`](https://github.com/PaddlePaddle/RocketQA/blob/1746b938d659c7f8d0b9f960e3199dcbd945adac/rocketqa/encoder/cross_encoder.py#L156)\n\nGiven a list of queries and paragraphs (and titles), returns their matching scores (probability that the paragraph is the query's right answer).\n  \n#### [`model.train(train_set: str, epoch: int, save_model_path: str, args)`](https://github.com/PaddlePaddle/RocketQA/blob/1746b938d659c7f8d0b9f960e3199dcbd945adac/rocketqa/encoder/cross_encoder.py#L193)\n\nGiven the hyperparameters `train_set`, `epoch` and `save_model_path`, you can train your own cross encoder model or finetune our models. Other settings like `save_steps` and `learning_rate` can also be set in `args`. Please refer to examples/example.py for detail.\n\n\n### Examples\n\nFollowing the examples below, you can retrieve the vector representations of your documents and connect 🚀RocketQA to your own tasks.  \n\n####  Run RocketQA Model\nTo run RocketQA models, you should set the parameter `model` in 'load_model()' with RocketQA model name returned by 'available_models()'.\n\n```python\nimport rocketqa\n\nquery_list = [\"trigeminal definition\"]\npara_list = [\n    \"Definition of TRIGEMINAL. : of or relating to the trigeminal nerve.ADVERTISEMENT. of or relating to the trigeminal nerve. ADVERTISEMENT.\"]\n\n# init dual encoder\ndual_encoder = rocketqa.load_model(model=\"v1_marco_de\", use_cuda=True, device_id=0, batch_size=16)\n\n# encode query \u0026 para\nq_embs = dual_encoder.encode_query(query=query_list)\np_embs = dual_encoder.encode_para(para=para_list)\n# compute dot product of query representation and para representation\ndot_products = dual_encoder.matching(query=query_list, para=para_list)\n```\n\n####  Train Your Own Model\nTo train your own models, you can use `train()` function with your dataset and parameters. Training data contains 4 columns: query, title, para, label (0 or 1), separated by \"\\t\". For detail about parameters and dataset, please refer to './examples/example.py'\n\n```python\nimport rocketqa\n\n# init cross encoder, and set device and batch_size\ncross_encoder = rocketqa.load_model(model=\"zh_dureader_ce\", use_cuda=True, device_id=0, batch_size=32)\n\n# finetune cross encoder based on \"zh_dureader_ce_v2\"\ncross_encoder.train('./examples/data/cross.train.tsv', 2, 'ce_models', save_steps=1000, learning_rate=1e-5, log_folder='log_ce')\n\n```\n  \n#### Run Your Own Model\nTo run your own models, you should set parameter `model` in 'load_model()' with a JSON config file.\n\n```python\nimport rocketqa\n\n# init cross encoder\ncross_encoder = rocketqa.load_model(model=\"./examples/ce_models/config.json\", use_cuda=True, device_id=0, batch_size=16)\n\n# compute relevance of query and para\nrelevance = cross_encoder.matching(query=query_list, para=para_list)\n```\n\nconfig is a JSON file like this\n```\n{\n    \"model_type\": \"cross_encoder\",\n    \"max_seq_len\": 384,\n    \"model_conf_path\": \"zh_config.json\",\n    \"model_vocab_path\": \"zh_vocab.txt\",\n    \"model_checkpoint_path\": ${YOUR_MODEL},\n    \"for_cn\": true,\n    \"share_parameter\": 0\n}\n```\nFolder `examples` provides more details.\n\n\n## Citations\n\nIf you find RocketQA v1 models helpful, feel free to cite our publication [RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering](https://arxiv.org/pdf/2010.08191.pdf)\n```\n@inproceedings{rocketqa_v1,\n    title=\"RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering\",\n    author=\"Yingqi Qu, Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Wayne Xin Zhao, Daxiang Dong, Hua Wu and Haifeng Wang\",\n    year=\"2021\",\n    booktitle = \"In Proceedings of NAACL\"\n}\n```\n\nIf you find PAIR models helpful, feel free to cite our publication [PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense Passage Retrieval](https://aclanthology.org/2021.findings-acl.191.pdf)\n```\n@inproceedings{rocketqa_pair,\n    title=\"PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense Passage Retrieval\",\n    author=\"Ruiyang Ren, Shangwen Lv, Yingqi Qu, Jing Liu, Wayne Xin Zhao, Qiaoqiao She, Hua Wu, Haifeng Wang and Ji-Rong Wen\",\n    year=\"2021\",\n    booktitle = \"In Proceedings of ACL Findings\"\n}\n```\n\nIf you find RocketQA v2 models helpful, feel free to cite our publication [RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking](https://arxiv.org/pdf/2110.07367.pdf)\n\n```\n@inproceedings{rocketqa_v2,\n    title=\"RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking\",\n    author=\"Ruiyang Ren, Yingqi Qu, Jing Liu, Wayne Xin Zhao, Qiaoqiao She, Hua Wu, Haifeng Wang and Ji-Rong Wen\",\n    year=\"2021\",\n    booktitle = \"In Proceedings of EMNLP\"\n}\n```\n\nIf you find DuReader\u003csub\u003eretrieval\u003c/sub\u003e dataset helpful, feel free to cite our publication [DuReader_retrieval: A Large-scale Chinese Benchmark for Passage Retrieval from Web Search Engine](https://arxiv.org/pdf/2203.10232.pdf)\n\n```\n@inproceedings{DuReader_retrieval,\n    title=\"DuReader_retrieval: A Large-scale Chinese Benchmark for Passage Retrieval from Web Search Engine\",\n    author=\"Yifu Qiu, Hongyu Li, Yingqi Qu, Ying Chen, Qiaoqiao She, Jing Liu, Hua Wu and Haifeng Wang\",\n    booktitle = \"In Proceedings of EMNLP\"\n    year=\"2022\"\n}\n```\n\nIf you find our survey useful for your work, please cite the following paper [Dense Text Retrieval based on Pretrained Language Models: A Survey](https://arxiv.org/pdf/2211.14876.pdf)\n\n```\n@article{DRSurvey,\n    title={Dense Text Retrieval based on Pretrained Language Models: A Survey},\n    author={Wayne Xin Zhao, Jing Liu, Ruiyang Ren, Ji-Rong Wen},\n    year={2022},\n    journal={arXiv preprint arXiv:2211.14876}\n}\n```\n\n## License\nThis repository is provided under the [Apache-2.0 license](https://github.com/PaddlePaddle/RocketQA/blob/main/LICENSE).\n\n\n## Contact Information\nFor help or issues using RocketQA, please submit a Github issue.\n\n\nFor other communication or cooperation, please contact Jing Liu (liujing46@baidu.com) or scan the following QR Code.\n\n\u003cimg src=\"https://github.com/PaddlePaddle/RocketQA/blob/main/BaiduNLP-QRCode.png\" width = \"300\" height = \"300\" alt=\"\" align=center /\u003e\n\n","funding_links":[],"categories":["机器阅读理解","Python"],"sub_categories":["其他_文本生成、文本对话"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FPaddlePaddle%2FRocketQA","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FPaddlePaddle%2FRocketQA","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FPaddlePaddle%2FRocketQA/lists"}