https://github.com/freedomintelligence/rag-instruct

RAG-Instruct: Boosting LLMs with Diverse Retrieval-Augmented Instructions
https://github.com/freedomintelligence/rag-instruct
Last synced: 9 months ago
JSON representation
RAG-Instruct: Boosting LLMs with Diverse Retrieval-Augmented Instructions
Host: GitHub
URL: https://github.com/freedomintelligence/rag-instruct
Owner: FreedomIntelligence
License: apache-2.0
Created: 2024-12-31T07:19:25.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-04-13T02:57:29.000Z (about 1 year ago)
Last Synced: 2025-04-13T03:40:35.885Z (about 1 year ago)
Language: Python
Size: 12.4 MB
Stars: 137
Watchers: 11
Forks: 4
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          # RAG-Instruct: Boosting LLMs with Diverse Retrieval-Augmented Instructions





  RAG-Instruct







📃 Paper ｜🤗 RAG-Instruct-Llama3-3B ｜🤗 RAG-Instruct-Llama3-8B ｜  📚 RAG-Instruct Dataset



## ⚡ Introduction

Hello! Welcome to the repository for [RAG-Instruct](https://arxiv.org/abs/2501.00353)!







**RAG-Instruct** is a method for generating diverse and high-quality RAG instruction data. It synthesizes instruction datasets based on any source corpus, leveraging the following approaches:

- **Five RAG paradigms**, which represent diverse query-document relationships to enhance model generalization across tasks.

- **Instruction simulation**, which enriches instruction diversity and quality by utilizing the strengths of existing instruction datasets.

Using this approach, we constructed a 40K instruction dataset from Wikipedia, covering a wide range of RAG scenarios and tasks. 

Our RAG-Instruct significantly enhances the RAG ability of LLMs, demonstrating remarkable improvements in RAG performance across various tasks.

| Model                          | WQA (acc) | PQA (acc) | TQA (acc) | OBQA (EM) | Pub (EM) | ARC (EM) | 2WIKI (acc) | HotP (acc) | MSQ (acc) | CFQA (EM) | PubMed (EM) |

|--------------------------------|-----------|-----------|-----------|-----------|----------|----------|-------------|------------|-----------|-----------|-------------|

| Llama3.2-3B                    | 58.7 | 61.8 | 69.7 |  77.0 | 55.0 | 66.8 | 55.6 | 40.2 | 13.2 | 46.8 | 70.3 |

| Llama3.1-8B                    | 59.5                      | 60.8                | 73.4               |  82.0                           | 56.7                    | 77.1                    | 65.6                 | 45.6           | 18.7            | 56.5                     | 73.9                    |

| Llama3.2-3B + **RAG-Instruct**     | 65.3                      | 64.0                | 77.0               |  81.2                           | 66.4                    | 73.0                    | 72.9                 | 52.7           | 25.0            | 50.3                     | 72.6                    |

| Llama3.1-8B + **RAG-Instruct**     | 69.7                      | 68.4                | 79.3               |  84.8                           | 77.2                    | 79.9                    | 79.3                 | 56.4           | 30.3            | 57.8                     | 77.0                    |

We open-sourced our models, data, and code here.

## 💻 Model

- **Model Access**

|          Model Name                  | Base LLMs       | Link                                                                         |

| -------------------------- | ------------ | ---------------------------------------------------------------------------- |

| **RAG-Instruct-Llama3-3B** | LLaMA-3.2-3B | [HF Link](https://huggingface.co/FreedomIntelligence/RAG-Instruct-Llama3-3B) |

| **RAG-Instruct-Llama3-8B** | LLaMA-3.1-8B | [HF Link](https://huggingface.co/FreedomIntelligence/RAG-Instruct-Llama3-8B) |

- **Deploy**

RAG-Instruct models can be used just like `Llama-3.1-8B-Instruct`. You can deploy it with tools like [vllm](https://github.com/vllm-project/vllm) or [Sglang](https://github.com/sgl-project/sglang),  or perform direct inference:

```python

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the model and tokenizer

model = AutoModelForCausalLM.from_pretrained("FreedomIntelligence/RAG-Instruct-Llama3-8B",torch_dtype="auto",device_map="auto")

tokenizer = AutoTokenizer.from_pretrained("FreedomIntelligence/RAG-Instruct-Llama3-8B")

# Example input

input_text = """### Paragraph:

[1] structure is at risk from new development...

[2] as Customs and Excise stores...

[3] Powis Street is partly underway...

...

### Instruction:

Which organization is currently using a building in Woolwich that holds historical importance?

"""

# Tokenize and prepare input

messages = [{"role": "user", "content": input_text}]

inputs = tokenizer(tokenizer.apply_chat_template(messages, tokenize=False,add_generation_prompt=True), return_tensors="pt").to(model.device)

# Generate output

outputs = model.generate(**inputs, max_new_tokens=2048)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

```

## 📚 Data

We’ve open-sourced a 40K instruction dataset for RAG. Download it here:

| Data                  | Description                                                                                   | Link                                                                                           |

| -------------------------- | ----------------------------------------------------------------- | --------------------------------------------------------------------------------------------- |

| RAG-Instruct (Wikipedia) | Diverse RAG instruction data based on Wikipedia | [Link](https://huggingface.co/datasets/FreedomIntelligence/RAG-Instruct)  |

## 🛠️ Data Construction

We provide scripts to **synthesize a diverse RAG instruction dataset**.

**1. Download Source Documents.**  

We use preprocessed passage data from DPR and embeddings generated with [Contriever-MSMARCO](https://github.com/facebookresearch/contriever) :

- Download the preprocessed passage data:

  ```bash

  cd retrieval_lm

  wget https://dl.fbaipublicfiles.com/dpr/wikipedia_split/psgs_w100.tsv.gz

  ```

  

- Download the generated embeddings:

  ```bash

  wget https://dl.fbaipublicfiles.com/contriever/embeddings/contriever-msmarco/wikipedia_embeddings.tar

  ```

**2. Prepare Exemplar Datasets.**  

We utilize several high-quality datasets as exemplars, including [ShareGPT](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered), [Alpaca](https://github.com/tatsu-lab/stanford_alpaca), [WizardLM-70K](https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V70K), [Lmsys-chat-1M](https://huggingface.co/datasets/lmsys/lmsys-chat-1m), and [SlimOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca).

To ensure high-quality data, we filtered and sampled these datasets using GPT-4o to extract **knowledge-intensive data** (Q). Using the exemplar data (Q), we retrieve source documents to construct (D*). Specifically, we match the exemplar instructions or questions with source documents by ranking their relevance. For convenience, we provide a processed dataset containing source documents and exemplar data across five RAG scenarios [here](data_gen/examplar_data/data.json).

**3. Synthesize Data with Prompts.**  

Using the retrieved documents (D*) and exemplar data (Q), we synthesize new data points with tailored prompts to create diverse and high-quality instruction-following datasets.

```bash

cd data_gen

python generate_data.py \

    --data_path examplar_data/data.json \

    --max_workers 16 \

    --save_dir ./output_data/RAG-Instruct.json

```

**4. Run Retriever**  

Before training, we need to perform retrieval on the synthesized RAG-Instruct dataset. For each data entry, we ensure that the retrieval documents includes all source documents (D*) and supplement them with enough unrelated documents (D-) to total 10 documents.

We use preprocessed passage data from DPR and embeddings generated with [Contriever](https://github.com/facebookresearch/contriever). To retrieve noisy documents (D-), use the following command:

```bash

cd retrieval_lm

python passage_retrieval.py \

    --model_name_or_path facebook/contriever-msmarco \

    --passages psgs_w100.tsv \

    --passages_embeddings "wikipedia_embeddings/*" \

    --input_name RAG_INSTRCT_DATA_PATH \

    --output_dir YOUR_OUTPUT_FILE \

    --n_docs 250

```

`RAG_INSTRUCT_DATA_PATH` is the final location of the synthesized `RAG-Instruct.json` file. The input file must be in `json` or `jsonl` format. Each instance should include either a `question` or `instruction` field, which will be used as the query during retrieval. 

Next, we sample documents ranked beyond the top 200 as (D-) and get the final training data. 

## 🚀 Training

**Fine-tuning with RAG-Instruct**

You can fine-tune your large model using the `RAG-Instruct` dataset to significantly boost RAG capabilities. Use the following code:

```bash

accelerate launch --config_file ./configs/sft.yaml \

    --num_processes 8  \

    --num_machines 1 \

    --machine_rank 0 \

    --deepspeed_multinode_launcher standard train_rag_sft.py \

    --experiment_name RAG-Instruct-training \

    --model_path meta-llama/Llama-3.1-8B-Instruct \

    --data_path FreedomIntelligence/RAG-Instruct \

    --max_seq_len 4096 \

    --learning_rate 5e-6 \

    --train_bsz_per_gpu 2 \

    --gradient_accumulation_steps 16 \

    --output_dir ./ckpts \

    --log_dir ./train_logs \

    --n_epochs 3 \

    --gradient_checkpointing

```

## 🧐 Evaluation

1. You first need to install [Sglang](https://github.com/sgl-project/sglang). After installation, deploy the model you want to test using Sglang with the following command:

```bash

log_num=0

model_name="FreedomIntelligence/RAG-Instruct-Llama3-3B" # Path to the model you are deploying

port=21${log_num}35

CUDA_VISIBLE_DEVICES=0  python -m sglang.launch_server --model-path $model_name --port $port --mem-fraction-static 0.8 --dp 1 --tp 1  > sglang${log_num}.log 2>&1 &

```

2. Wait for the model to be deployed. After deployment, you can run the following code for evaluation. 

```bash

model_name="FreedomIntelligence/RAG-Instruct-Llama3-3B" # Path to the model you are deploying

python eval/eval_sglang.py --model_name $model_name --input_file eval/data/eval_data.json --port $port --max_new_tokens 500  

```

Here, we provide the evaluation example using the PopQA dataset in the file `eval/data/eval_data.json`. For other evaluation datasets, please first use the retriever to retrieve (You can refer to the retriever code in the training section), and then use the above script for evaluation.

3. After completing the evaluation, run the following code to stop the Sglang service and release GPU memory.

```bash

bash evaluation/kill_sglang_server.sh

```

The evaluation code above can be used to test most models supported by Sglang.

## 📖 Citation

```

@misc{liu2024raginstructboostingllmsdiverse,

      title={RAG-Instruct: Boosting LLMs with Diverse Retrieval-Augmented Instructions}, 

      author={Wanlong Liu and Junying Chen and Ke Ji and Li Zhou and Wenyu Chen and Benyou Wang},

      year={2024},

      eprint={2501.00353},

      archivePrefix={arXiv},

      primaryClass={cs.CL},

      url={https://arxiv.org/abs/2501.00353}, 

}

```
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/freedomintelligence/rag-instruct

Awesome Lists containing this project

README

RAG-Instruct