{"id":15076511,"url":"https://github.com/NLPJCL/RAG-Retrieval","last_synced_at":"2025-09-25T23:31:41.553Z","repository":{"id":228076132,"uuid":"773093836","full_name":"NLPJCL/RAG-Retrieval","owner":"NLPJCL","description":"Unify Efficient Fine-tuning of  RAG Retrieval, including Embedding, ColBERT,Cross Encoder","archived":false,"fork":false,"pushed_at":"2024-06-19T17:23:21.000Z","size":2219,"stargazers_count":332,"open_issues_count":4,"forks_count":30,"subscribers_count":6,"default_branch":"master","last_synced_at":"2024-06-21T18:48:43.286Z","etag":null,"topics":["ai","llm","nlp","rag","retrieval-augmented-generation"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/NLPJCL.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-03-16T18:19:04.000Z","updated_at":"2024-07-01T12:12:09.028Z","dependencies_parsed_at":"2024-03-19T17:27:21.613Z","dependency_job_id":"803abfd0-9ccc-4343-9b1e-5bee7b57d343","html_url":"https://github.com/NLPJCL/RAG-Retrieval","commit_stats":null,"previous_names":["nlpjcl/rag-retrieval"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NLPJCL%2FRAG-Retrieval","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NLPJCL%2FRAG-Retrieval/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NLPJCL%2FRAG-Retrieval/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NLPJCL%2FRAG-Retrieval/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/NLPJCL","download_url":"https://codeload.github.com/NLPJCL/RAG-Retrieval/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":234268508,"owners_count":18805582,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","llm","nlp","rag","retrieval-augmented-generation"],"created_at":"2024-09-25T03:58:11.288Z","updated_at":"2025-09-25T23:31:41.547Z","avatar_url":"https://github.com/NLPJCL.png","language":"Python","readme":"\u003ch1 align=\"center\"\u003eRAG-Retrieval\u003c/h1\u003e\n\u003cp align=\"center\"\u003e\n    \u003ca href=\"https://pypi.org/project/rag-retrieval/#description\"\u003e\n            \u003cimg alt=\"Build\" src=\"https://img.shields.io/pypi/v/rag-retrieval?color=brightgreen\"\u003e\n    \u003c/a\u003e\n\u003c!--     \u003ca href=\"https://www.pepy.tech/projects/rag-retrieval\"\u003e\n            \u003cimg alt=\"Build\" src=\"https://static.pepy.tech/personalized-badge/rag-retrieval?period=total\u0026units=international_system\u0026left_color=grey\u0026right_color=brightgreen\u0026left_text=downloads\"\u003e\n    \u003c/a\u003e --\u003e\n    \u003ca href=\"https://github.com/NLPJCL/RAG-Retrieval\"\u003e\n            \u003cimg alt=\"Build\" src=\"https://img.shields.io/badge/Contribution-Welcome-blue\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://github.com/NLPJCL/RAG-Retrieval/blob/master/LICENSE\"\u003e\n        \u003cimg alt=\"License\" src=\"https://img.shields.io/badge/LICENSE-MIT-green\"\u003e\n    \u003c/a\u003e\n\u003c/p\u003e\n\n[English](./README.md) | [中文](./README_zh.md)\n\nThe RAG-Retrieval offers end-to-end code for training, inference, and distillation of the RAG retrieval model.\n- For training, **RAG-Retrieval supports fine-tuning of any open-source RAG retrieval models**, including embedding models (figure a,bert-based, llm-based), late interactive models (figure d,colbert), and reranker models (figure c,bert-based, llm-based).\n- For inference, RAG-Retrieval focuses reranker and has developed a lightweight Python library [rag-retrieval](https://pypi.org/project/rag-retrieval/), **which provides a unified way to call any different RAG ranking models.**\n- For distillation, **Distillation of support embedding models and reranker models**, support distill from a larger model to a smaller model (0.5b llm or bert-base).\n\n![ColBERT](pictures/models.png)\n\n\n# Communication between communities\n\n[Join our WeChat group chat](https://www.notion.so/RAG-Retrieval-Roadmap-c817257e3e8a484b8850cac40a3fcf88)\n\n# News\n\n- 🔥 **22/05/2025**: RAG-Retrieval released Myopic Trap, an empirical study of positional bias across the full IR pipeline. We systematically evaluate a range of SOTA retrieval models—including BM25, dense embeddings, ColBERT-style models, and rerankers—on two carefully designed position-aware benchmarks: SQuAD-PosQ and FineWeb-PosQ. [Learn more](./examples/MyopicTrap/)\n\n- **29/12/2024**: RAG-Retrieval released the core training code (stage3) of Stella and Jasper embedding model [Jasper and Stella: distillation of SOTA embedding models](https://arxiv.org/abs/2412.19048).\n\n- **21/10/2024**: RAG-Retrieval released two different methods for Reranker tasks based on LLM, as well as a method for distilling them into BERT. [Best Practices for LLM in Reranker Tasks? A Simple Experiment Report (with code)](https://zhuanlan.zhihu.com/p/987727357)\n\n- **05/06/2024**: Implementation of MRL loss for the Embedding model in RAG-Retrieval. [RAG-Retrieval: Making MRL Loss a Standard for Training Vector (Embedding) Models](https://zhuanlan.zhihu.com/p/701884479)\n\n- **02/06/2024**: RAG-Retrieval implements LLM preference-based supervised fine-tuning of the RAG retriever. [RAG-Retrieval Implements LLM Preference-Based Supervised Fine-Tuning of the RAG Retriever](https://zhuanlan.zhihu.com/p/701215443)\n\n- **05/05/2024**: Released a lightweight Python library for RAG-Retrieval. [RAG-Retrieval: Your RAG Application Deserves a better infer framework](https://zhuanlan.zhihu.com/p/692404995)\n\n- **18/03/2024**: Released RAG-Retrieval [Introduction to RAG-Retrieval on Zhihu](https://zhuanlan.zhihu.com/p/683483778)\n\n\n\n# Features\n\n- **Simple yet Elegant**: Rejects complex, with a simple and understandable code structure for easy modifications.\n- **Supports end-to-end fine-tuning of RAG retrieval models**: Embedding (bert-based, llm-based), late interaction models (colbert), and reranker models (bert-based, llm-based).\n- **Supports fine-tuning of any open-source RAG retrieval models**: Compatible with most open-source embedding and reranker models, such as: bge (bge-embedding, bge-m3, bge-reranker), bce (bce-embedding, bce-reranker), gte (gte-embedding, gte-multilingual-reranker-base).\n- **Supports distillation of larger models into smaller models**: Enables the distillation of larger LLM-based reranker or embedding models into smaller ones (e.g., a 0.5B-parameter LLM or BERT-base).\n- **Advanced Algorithms**: For embedding models, supports the [MRL algorithm](https://arxiv.org/abs/2205.13147) to reduce the dimensionality of output vectors and [Stella distillation method](https://arxiv.org/abs/2412.19048).\n- **Multi-gpu training strategy**: Includes deepspeed, fsdp.\n\n\n# Quick Start\n\n## Installation\nFor training (all):\n```bash\nconda create -n rag-retrieval python=3.8 \u0026\u0026 conda activate rag-retrieval\n# To avoid incompatibility between the automatically installed torch and the local cuda, it is recommended to manually install the compatible version of torch before proceeding to the next step.\npip install -r requirements.txt \n```\nFor prediction (reranker):\n```bash\n# To avoid incompatibility between the automatically installed torch and the local cuda, it is recommended to manually install the compatible version of torch before proceeding to the next step.\npip install rag-retrieval\n```\n\n## Training\n\nFor different model types, please go into different subdirectories. For example:\nFor [embedding](https://github.com/NLPJCL/RAG-Retrieval/tree/master/rag_retrieval/train/embedding), and similarly for others. Detailed procedures can be found in the README file in each subdirectories.\n```bash\ncd ./rag_retrieval/train/embedding\nbash train_embedding.sh\n```\n\n## inference\n\nRAG-Retrieval has developed a lightweight Python library, [rag-retrieval](https://pypi.org/project/rag-retrieval/), which provides a unified interface for calling various RAG reranker models with the following features:\n\n- Supports multiple ranking models: Compatible with common open-source ranking models (Cross Encoder Reranker, Decoder-Only LLM Reranker).\n\n- Long document friendly: Supports two different handling logics for long documents (maximum length truncation and splitting to take the maximum score).\n\n- Easy to Extend: If there is a new ranking model, users only need to inherit from BaseReranker and implement the rank and compute_score functions.\n\n**For detailed usage and considerations of the rag-retrieval package, please refer to the [Tutorial](https://github.com/NLPJCL/RAG-Retrieval/blob/master/examples/Reranker_Tutorial.md)**\n\n\n\n# Experimental Results\n\n\n## Results of the reranker model on the MTEB Reranking task\n\n\n|      **Model**       |  **Model Size(GB)**  |**T2Reranking** | **MMarcoReranking** | **CMedQAv1** | **CMedQAv2** | **Avg** |\n|:-----------:|:----------:|:----------:|:-------------:|:--------------:|:---------------:| :---------------:|\n|   bge-reranker-base   |  1.11 | 67.28    |      35.46     |      81.27      |       84.10      | 67.03\n| bce-reranker-base_v1 |   1.11 |70.25    |      34.13     |      79.64      |       81.31      | 66.33\n| rag-retrieval-reranker |  0.41 | 67.33    |      31.57     |      83.54     |       86.03     | 67.12\n\nAmong them, rag-retrieval-reranker is the result of training on the hfl/chinese-roberta-wwm-ext model using the RAG-Retrieval code, and the training data uses the training data of the bge-rerank model.\n\n## Results of the Colbert model in the MTEB Reranking task\n\n|      **Model**  | **Model Size(GB)**  | **Dim**  | **T2Reranking** | **MMarcoReranking** | **CMedQAv1** | **CMedQAv2** | **Avg** |\n|:-----------: |:----------:|:----------:|:----------:|:-------------:|:--------------:|:---------------:| :---------------:|\n|   bge-m3-colbert   | 2.24 | 1024 | 66.82 | 26.71    |      75.88     |      76.83      |      61.56      \n| rag-retrieval-colbert | 0.41 |  1024|  66.85    |      31.46     |      81.05     |       84.22     | 65.90\n\nAmong them, rag-retrieval-colbert is the result of training on the hfl/chinese-roberta-wwm-ext model using the RAG-Retrieval code, and the training data uses the training data of the bge-rerank model.\n\n## Fine-tune the open source BGE series models with domain data\n\n|      **Model**  | **T2ranking**  | |\n|:-----------: |:----------:|:----------:|\n|   bge-v1.5-embedding   | 66.49|  | \n|   bge-v1.5-embedding **finetune**    | 67.15 | **+0.66** | \n|   bge-m3-colbert   | 66.82|  | \n|   bge-m3-colbert **finetune**    | 67.22 | **+0.40** | \n|   bge-reranker-base   | 67.28|  | \n|   bge-reranker-base  **finetune**    | 67.57 | **+0.29** | \n\nThe number with finetune at the end means that we used RAG-Retrieval to fine-tune the corresponding open source model, and the training data used the training set of T2-Reranking.\n\nIt is worth noting that the training set of the three open source models of bge already includes T2-Reranking, and the data is relatively general, so the performance improvement of fine-tuning using this data is not significant. However, if the open source model is fine-tuned using a vertical field data set, the performance improvement will be greater.\n\n\n# Citation\nIf you find this repository helpful, please cite our work:\n```bib\n@misc{zhang2025jasperstelladistillationsota,\n      title={Jasper and Stella: distillation of SOTA embedding models}, \n      author={Dun Zhang and Jiacheng Li and Ziyang Zeng and Fulong Wang},\n      year={2025},\n      eprint={2412.19048},\n      archivePrefix={arXiv},\n      primaryClass={cs.IR},\n      url={https://arxiv.org/abs/2412.19048}, \n}\n```\n\n# Acknowledge\n\nDuring the development process, we borrowed or based on the implementation of the following projects. We sincerely appreciate the efforts of these teams for their contributions to open-source research and development.\n\n- [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding)\n- [uniem](https://github.com/wangyuxinwhy/uniem)\n- [sentence-transformers](https://github.com/UKPLab/sentence-transformers)\n- [rerankers](https://github.com/AnswerDotAI/rerankers)\n\n\n# Star History\n\n[![Star History Chart](https://api.star-history.com/svg?repos=NovaSearch-Team/RAG-Retrieval\u0026type=Date)](https://star-history.com/#NovaSearch-Team/RAG-Retrieval\u0026Date)\n\n# License\nRAG-Retrieval is licensed under the [MIT License](https://github.com/NLPJCL/RAG-Retrieval/blob/master/LICENSE). \n\n\n\n","funding_links":[],"categories":["A01_文本生成_文本对话","NLP"],"sub_categories":["大语言对话模型及数据","3. Pretraining"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FNLPJCL%2FRAG-Retrieval","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FNLPJCL%2FRAG-Retrieval","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FNLPJCL%2FRAG-Retrieval/lists"}