{"id":28653895,"url":"https://github.com/tiger-ai-lab/longrag","last_synced_at":"2025-06-13T07:08:03.211Z","repository":{"id":245710280,"uuid":"816992650","full_name":"TIGER-AI-Lab/LongRAG","owner":"TIGER-AI-Lab","description":"Official repo for \"LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs\".","archived":false,"fork":false,"pushed_at":"2024-08-25T20:33:27.000Z","size":1464,"stargazers_count":154,"open_issues_count":2,"forks_count":14,"subscribers_count":3,"default_branch":"main","last_synced_at":"2024-08-25T22:18:33.985Z","etag":null,"topics":["llm","rag"],"latest_commit_sha":null,"homepage":"https://tiger-ai-lab.github.io/LongRAG/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/TIGER-AI-Lab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-06-18T19:49:24.000Z","updated_at":"2024-08-25T20:33:31.000Z","dependencies_parsed_at":"2024-06-23T16:27:05.282Z","dependency_job_id":"f403dbd7-3089-44de-b56a-a7f0c3b55f88","html_url":"https://github.com/TIGER-AI-Lab/LongRAG","commit_stats":null,"previous_names":["tiger-ai-lab/longrag"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/TIGER-AI-Lab/LongRAG","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TIGER-AI-Lab%2FLongRAG","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TIGER-AI-Lab%2FLongRAG/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TIGER-AI-Lab%2FLongRAG/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TIGER-AI-Lab%2FLongRAG/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/TIGER-AI-Lab","download_url":"https://codeload.github.com/TIGER-AI-Lab/LongRAG/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TIGER-AI-Lab%2FLongRAG/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259599331,"owners_count":22882357,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["llm","rag"],"created_at":"2025-06-13T07:08:02.481Z","updated_at":"2025-06-13T07:08:03.192Z","avatar_url":"https://github.com/TIGER-AI-Lab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# **LongRAG** \nThis repo contains the code for \"LongRAG: Enhancing Retrieval-Augmented Generation\nwith Long-context LLMs\". \u003cspan style=\"color: red;\"\u003eWe are still in the process to polish our repo.\u003c/span\u003e\n\n\u003ca target=\"_blank\" href=\"https://arxiv.org/abs/2406.15319\"\u003e\n\u003cimg style=\"height:22pt\" src=\"https://img.shields.io/badge/-Paper-red?style=flat\u0026logo=arxiv\"\u003e\u003c/a\u003e\n\u003ca target=\"_blank\" href=\"https://github.com/TIGER-AI-Lab/LongRAG\"\u003e\n\u003cimg style=\"height:22pt\" src=\"https://img.shields.io/badge/-Code-green?style=flat\u0026logo=github\"\u003e\u003c/a\u003e\n\u003ca target=\"_blank\" href=\"https://tiger-ai-lab.github.io/LongRAG/\"\u003e\n\u003cimg style=\"height:22pt\" src=\"https://img.shields.io/badge/-🌐%20Website-blue?style=flat\"\u003e\u003c/a\u003e\n\u003ca target=\"_blank\" href=\"https://huggingface.co/datasets/TIGER-Lab/LongRAG\"\u003e\n\u003cimg style=\"height:22pt\" src=\"https://img.shields.io/badge/-🤗%20Dataset-red?style=flat\"\u003e\u003c/a\u003e\n\u003ca target=\"_blank\" href=\"https://x.com/WenhuChen/status/1805278871786340644\"\u003e\n\u003cimg style=\"height:22pt\" src=\"https://img.shields.io/badge/-Tweet-blue?style=flat\u0026logo=twitter\"\u003e\u003c/a\u003e\n\u003cbr\u003e\n\n\n## **Table of Contents**\n- [Introduction](#introduction)\n- [Installation](#installation)\n- [Quick Start](#quick-start)\n- [Corpus Preparation (Optional)](#corpus)\n- [Long Retriever](#long-retriever)\n- [Long Reader](#long-reader)\n- [License](#license)\n- [Citation](#citation)\n\n\n## **Introduction**\nIn traditional RAG framework, the basic retrieval units are normally short. Such a design forces the \nretriever to search over a large corpus to find the \"needle\" unit. In contrast, the readers only need \nto extract answers from the short retrieved units. Such an imbalanced heavy retriever and light reader\ndesign can lead to sub-optimal performance. We propose a new framework LongRAG, consisting of a \n\"long retriever\" and a \"long reader\". Our framework use a 4K-token retrieval unit, which is 30x longer than before. \nOur study offers insights into the future roadmap for combining RAG with long-context LLMs.\n\n## **Installation**\n\nClone this repository and install the required packages:\n```bash\ngit clone https://github.com/TIGER-AI-Lab/LongRAG.git\ncd LongRAG\npip install -r requirements.txt\n```\n\n## **Quick Start**\nPlease go to the \"Long Reader\" section and follow the instructions. This will help you get the final prediction for 100 examples. \nThe output will be similar to our sample files in the ``exp/`` directory.\n\n## **Corpus Preparation (Optional)**\nThis is an optional step. You can use our processed corpus directly. We have released two versions of the retrieval corpus for \nNQ and HotpotQA on Hugging Face.\n```python\nfrom datasets import load_dataset\ncorpus_nq = load_dataset(\"TIGER-Lab/LongRAG\", \"nq_corpus\")\ncorpus_hotpotqa = load_dataset(\"TIGER-Lab/LongRAG\", \"hotpot_qa_corpus\")\n```\n\nIf you are still interested in how we craft the corpus, you can start reading here.\n\n***Wikipedia raw data clean:***\nWe first clean Wikipedia raw data by following the standard process. We use [WikiExtractor](https://github.com/attardi/wikiextractor).\nThis is a widely-used Python script that extracts and cleans text from a Wikipedia database backup dump. Please ensure you use the required \nPython environment. A sample script is:\n```bash\nsh scripts/extract_and_clean_wiki_dump.sh\n```\n\n***Preprocess Wikipedia data***\nAfter cleaning the Wikipedia raw data, run the following script to gather more information.\n```bash\nsh scripts/process_wiki_page.sh\n```\n+ ``dir_path``: The directory path of the cleaned Wikipedia dump, which is the output of the previous step.\n+ ``output_path_dir``: The output directory will contain several pickle files, each representing a dictionary for the Wikipedia page. \n``degree.pickle``: The key is the Wikipedia page title, and the value is the number of hyperlinks.\n``abs_adj.pickle``: The key is the Wikipedia page title, and the value is the linked page in the abstract paragraph.\n``full_adj.pickle``: The key is the Wikipedia page title, and the value is the linked page in the entire page.\n``doc_size.pickle``: The key is the Wikipedia page title, and the value is the number of tokens on that page.\n``doc_dict.pickle``: The key is the Wikipedia page title, and the value is the text of the page.\n+ ``corpus_title_path``: The key is used to filter the NQ dataset. In the original DPR paper, certain \nWikipedia pages, such as list pages and disambiguation pages, were removed, reducing the total number of \nWikipedia pages from 5 million to 3 million. For a fair comparison, we also chose to exclude these pages. \n(For HotpotQA, we did not remove any pages, so the number of Wikipedia pages remains at 5 million.) \nYou can download the DPR's titles from this [link](https://dl.fbaipublicfiles.com/dpr/wikipedia_split/psgs_w100.tsv.gz).\n\nWe have provided the processed Wikipedia in our [huggingface repo](https://huggingface.co/datasets/TIGER-Lab/LongRAG).\nPlease check out the ``nq_wiki`` and ``hotpot_qa_wiki`` subsets for more information. You could easily derive these \npickle files from these two datasets.\n\n\n***Retrieval Corpus:*** By grouping multiple related documents, we can construct long \nretrieval units with more than 4K tokens. This design could also significantly reduce \nthe corpus size (number of retrieval units in the corpus). Then, the retriever’s task \nbecomes much easier. Additionally, the long retrieval unit will also improve the \ninformation completeness to avoid ambiguity or confusion.\n\n```bash\nsh scripts/group_documents.sh\n```\n+ ``processed_wiki_dir``: The output directory of the above step.\n+ ``mode``: ``abs`` is for HotpotQA corpus, ``full`` is for NQ corpus.\n+ ``output_dir``: The output directory, The output directory will contain several pickle files, each representing a \ndictionary for the retrieval corpus. The most important one is ``group_text.pickle``, which maps the corpus ID to the \ncorpus text. For more details, please refer to our released corpus on Hugging Face.\n\n\n## **Long Retriever**\nWe leverage open-sourced dense retrieval toolkit, [Tevatron](https://github.com/texttron/tevatron). For all our retrieval experiments. \nThe base embedding model we used is bge-large-en-v1.5. We have provided a sample script; make sure to update the parameters with your \nown dataset local path. Additionally, our script uses 4 GPUs to encode the corpus for time saving; please update this based on your own \nuse case.\n```bash\nsh scripts/run_retrieve_tevatron.sh\n```\n\n## **Long Reader**\nWe select Gemini-1.5-Pro and GPT-4o as our long reader given their strong ability\nto handle long context input. (We also plan to test other LLMs capable of handling long contexts in the future.)\n\nThe input of the reader is a concatenation of all the long retrieval units from the long retriever.\nWe have provided the input file in our Huggingface repo.\n\n```bash\nmkdir -p exp/\nsh scripts/run_eval_qa.sh\n```\n+ ``test_data_name``: Test set name, ``nq`` (NQ) or ``hotpot_qa`` (HotpotQA).\n+ ``test_data_split``: For each test set, there are three splits: ``full``, ``subset_1000``, ``subset_100``. We suggest starting with ``subset_100`` for a \nquick start or debugging and using ``subset_1000`` to obtain relatively stable results.\n+ ``output_file_path``: The output file, here it's placed in the ``exp/`` directory.\n+ ``reader_model``: The long context reader model we use, currently our code support ``GPT-4o``, ``GPT-4-Turbo``, ``Gemini-1.5-Pro``, ``Claude-3-Opus``.\nPlease note that you need to update the related API key and API configuration in the code. For example, if you are using the GPT-4 series, you need to \nconfigure the code in  ``utils/gpt_inference.py``; if you are using the Gemini series, you need to configure the code in  ``utils/gemini_inference.py``. \nWe will continue to support more models in the future.\n\nThe output file contains one test case per row. The ``short_ans`` field is our final prediction.\n\n```json\n{\n    \"query_id\": \"383\", \n    \"question\": \"how many episodes of touching evil are there\", \n    \"answers\": [\"16\"], \n    \"long_ans\": \"16 episodes.\", \n    \"short_ans\": \"16\", \n    \"is_exact_match\": 1, \n    \"is_substring_match\": 1, \n    \"is_retrieval\": 1\n}\n```\nWe have provided some sample output files in our `exp/` directory. For example, ``exp/nq_gpt4o_100.json`` contains\nthe result from the running file:\n```bash\npython eval/eval_qa.py \\\n  --test_data_name \"nq\" \\\n  --test_data_split \"subset_100\" \\\n  --output_file_path \"./exp/nq_gpt4o_100.json\" \\\n  --reader_model \"GPT-4o\"\n```\nThe top-1 retrieval accuracy is 88%, and the exact match rate is 64%.\n\n## **License**\nPlease check out the license of each subset we use in our work.\n\n| Dataset Name \t | License Type   \t               |\n|----------------|--------------------------------|\n| NQ        \t    | Apache License 2.0           \t |\n| HotpotQA    \t  | CC BY-SA 4.0 License           |\n\n\n## **Citation**\n\nPlease kindly cite our paper if you find our project is useful:\n\n```\n@article{jiang2024longrag,\n  title={LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs},\n  author={Ziyan Jiang, Xueguang Ma, Wenhu Chen},\n  journal={arXiv preprint arXiv:2406.15319},\n  year={2024},\n  url={https://arxiv.org/abs/2406.15319}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftiger-ai-lab%2Flongrag","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftiger-ai-lab%2Flongrag","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftiger-ai-lab%2Flongrag/lists"}