https://github.com/freedomintelligence/rag-instruct
RAG-Instruct: Boosting LLMs with Diverse Retrieval-Augmented Instructions
https://github.com/freedomintelligence/rag-instruct
Last synced: 9 months ago
JSON representation
RAG-Instruct: Boosting LLMs with Diverse Retrieval-Augmented Instructions
- Host: GitHub
- URL: https://github.com/freedomintelligence/rag-instruct
- Owner: FreedomIntelligence
- License: apache-2.0
- Created: 2024-12-31T07:19:25.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-04-13T02:57:29.000Z (about 1 year ago)
- Last Synced: 2025-04-13T03:40:35.885Z (about 1 year ago)
- Language: Python
- Size: 12.4 MB
- Stars: 137
- Watchers: 11
- Forks: 4
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# RAG-Instruct: Boosting LLMs with Diverse Retrieval-Augmented Instructions
RAG-Instruct
📃 Paper |🤗 RAG-Instruct-Llama3-3B |🤗 RAG-Instruct-Llama3-8B | 📚 RAG-Instruct Dataset
## ⚡ Introduction
Hello! Welcome to the repository for [RAG-Instruct](https://arxiv.org/abs/2501.00353)!
**RAG-Instruct** is a method for generating diverse and high-quality RAG instruction data. It synthesizes instruction datasets based on any source corpus, leveraging the following approaches:
- **Five RAG paradigms**, which represent diverse query-document relationships to enhance model generalization across tasks.
- **Instruction simulation**, which enriches instruction diversity and quality by utilizing the strengths of existing instruction datasets.
Using this approach, we constructed a 40K instruction dataset from Wikipedia, covering a wide range of RAG scenarios and tasks.
Our RAG-Instruct significantly enhances the RAG ability of LLMs, demonstrating remarkable improvements in RAG performance across various tasks.
| Model | WQA (acc) | PQA (acc) | TQA (acc) | OBQA (EM) | Pub (EM) | ARC (EM) | 2WIKI (acc) | HotP (acc) | MSQ (acc) | CFQA (EM) | PubMed (EM) |
|--------------------------------|-----------|-----------|-----------|-----------|----------|----------|-------------|------------|-----------|-----------|-------------|
| Llama3.2-3B | 58.7 | 61.8 | 69.7 | 77.0 | 55.0 | 66.8 | 55.6 | 40.2 | 13.2 | 46.8 | 70.3 |
| Llama3.1-8B | 59.5 | 60.8 | 73.4 | 82.0 | 56.7 | 77.1 | 65.6 | 45.6 | 18.7 | 56.5 | 73.9 |
| Llama3.2-3B + **RAG-Instruct** | 65.3 | 64.0 | 77.0 | 81.2 | 66.4 | 73.0 | 72.9 | 52.7 | 25.0 | 50.3 | 72.6 |
| Llama3.1-8B + **RAG-Instruct** | 69.7 | 68.4 | 79.3 | 84.8 | 77.2 | 79.9 | 79.3 | 56.4 | 30.3 | 57.8 | 77.0 |
We open-sourced our models, data, and code here.
## 💻 Model
- **Model Access**
| Model Name | Base LLMs | Link |
| -------------------------- | ------------ | ---------------------------------------------------------------------------- |
| **RAG-Instruct-Llama3-3B** | LLaMA-3.2-3B | [HF Link](https://huggingface.co/FreedomIntelligence/RAG-Instruct-Llama3-3B) |
| **RAG-Instruct-Llama3-8B** | LLaMA-3.1-8B | [HF Link](https://huggingface.co/FreedomIntelligence/RAG-Instruct-Llama3-8B) |
- **Deploy**
RAG-Instruct models can be used just like `Llama-3.1-8B-Instruct`. You can deploy it with tools like [vllm](https://github.com/vllm-project/vllm) or [Sglang](https://github.com/sgl-project/sglang), or perform direct inference:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load the model and tokenizer
model = AutoModelForCausalLM.from_pretrained("FreedomIntelligence/RAG-Instruct-Llama3-8B",torch_dtype="auto",device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("FreedomIntelligence/RAG-Instruct-Llama3-8B")
# Example input
input_text = """### Paragraph:
[1] structure is at risk from new development...
[2] as Customs and Excise stores...
[3] Powis Street is partly underway...
...
### Instruction:
Which organization is currently using a building in Woolwich that holds historical importance?
"""
# Tokenize and prepare input
messages = [{"role": "user", "content": input_text}]
inputs = tokenizer(tokenizer.apply_chat_template(messages, tokenize=False,add_generation_prompt=True), return_tensors="pt").to(model.device)
# Generate output
outputs = model.generate(**inputs, max_new_tokens=2048)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
## 📚 Data
We’ve open-sourced a 40K instruction dataset for RAG. Download it here:
| Data | Description | Link |
| -------------------------- | ----------------------------------------------------------------- | --------------------------------------------------------------------------------------------- |
| RAG-Instruct (Wikipedia) | Diverse RAG instruction data based on Wikipedia | [Link](https://huggingface.co/datasets/FreedomIntelligence/RAG-Instruct) |
## 🛠️ Data Construction
We provide scripts to **synthesize a diverse RAG instruction dataset**.
**1. Download Source Documents.**
We use preprocessed passage data from DPR and embeddings generated with [Contriever-MSMARCO](https://github.com/facebookresearch/contriever) :
- Download the preprocessed passage data:
```bash
cd retrieval_lm
wget https://dl.fbaipublicfiles.com/dpr/wikipedia_split/psgs_w100.tsv.gz
```
- Download the generated embeddings:
```bash
wget https://dl.fbaipublicfiles.com/contriever/embeddings/contriever-msmarco/wikipedia_embeddings.tar
```
**2. Prepare Exemplar Datasets.**
We utilize several high-quality datasets as exemplars, including [ShareGPT](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered), [Alpaca](https://github.com/tatsu-lab/stanford_alpaca), [WizardLM-70K](https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V70K), [Lmsys-chat-1M](https://huggingface.co/datasets/lmsys/lmsys-chat-1m), and [SlimOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca).
To ensure high-quality data, we filtered and sampled these datasets using GPT-4o to extract **knowledge-intensive data** (Q). Using the exemplar data (Q), we retrieve source documents to construct (D*). Specifically, we match the exemplar instructions or questions with source documents by ranking their relevance. For convenience, we provide a processed dataset containing source documents and exemplar data across five RAG scenarios [here](data_gen/examplar_data/data.json).
**3. Synthesize Data with Prompts.**
Using the retrieved documents (D*) and exemplar data (Q), we synthesize new data points with tailored prompts to create diverse and high-quality instruction-following datasets.
```bash
cd data_gen
python generate_data.py \
--data_path examplar_data/data.json \
--max_workers 16 \
--save_dir ./output_data/RAG-Instruct.json
```
**4. Run Retriever**
Before training, we need to perform retrieval on the synthesized RAG-Instruct dataset. For each data entry, we ensure that the retrieval documents includes all source documents (D*) and supplement them with enough unrelated documents (D-) to total 10 documents.
We use preprocessed passage data from DPR and embeddings generated with [Contriever](https://github.com/facebookresearch/contriever). To retrieve noisy documents (D-), use the following command:
```bash
cd retrieval_lm
python passage_retrieval.py \
--model_name_or_path facebook/contriever-msmarco \
--passages psgs_w100.tsv \
--passages_embeddings "wikipedia_embeddings/*" \
--input_name RAG_INSTRCT_DATA_PATH \
--output_dir YOUR_OUTPUT_FILE \
--n_docs 250
```
`RAG_INSTRUCT_DATA_PATH` is the final location of the synthesized `RAG-Instruct.json` file. The input file must be in `json` or `jsonl` format. Each instance should include either a `question` or `instruction` field, which will be used as the query during retrieval.
Next, we sample documents ranked beyond the top 200 as (D-) and get the final training data.
## 🚀 Training
**Fine-tuning with RAG-Instruct**
You can fine-tune your large model using the `RAG-Instruct` dataset to significantly boost RAG capabilities. Use the following code:
```bash
accelerate launch --config_file ./configs/sft.yaml \
--num_processes 8 \
--num_machines 1 \
--machine_rank 0 \
--deepspeed_multinode_launcher standard train_rag_sft.py \
--experiment_name RAG-Instruct-training \
--model_path meta-llama/Llama-3.1-8B-Instruct \
--data_path FreedomIntelligence/RAG-Instruct \
--max_seq_len 4096 \
--learning_rate 5e-6 \
--train_bsz_per_gpu 2 \
--gradient_accumulation_steps 16 \
--output_dir ./ckpts \
--log_dir ./train_logs \
--n_epochs 3 \
--gradient_checkpointing
```
## 🧐 Evaluation
1. You first need to install [Sglang](https://github.com/sgl-project/sglang). After installation, deploy the model you want to test using Sglang with the following command:
```bash
log_num=0
model_name="FreedomIntelligence/RAG-Instruct-Llama3-3B" # Path to the model you are deploying
port=21${log_num}35
CUDA_VISIBLE_DEVICES=0 python -m sglang.launch_server --model-path $model_name --port $port --mem-fraction-static 0.8 --dp 1 --tp 1 > sglang${log_num}.log 2>&1 &
```
2. Wait for the model to be deployed. After deployment, you can run the following code for evaluation.
```bash
model_name="FreedomIntelligence/RAG-Instruct-Llama3-3B" # Path to the model you are deploying
python eval/eval_sglang.py --model_name $model_name --input_file eval/data/eval_data.json --port $port --max_new_tokens 500
```
Here, we provide the evaluation example using the PopQA dataset in the file `eval/data/eval_data.json`. For other evaluation datasets, please first use the retriever to retrieve (You can refer to the retriever code in the training section), and then use the above script for evaluation.
3. After completing the evaluation, run the following code to stop the Sglang service and release GPU memory.
```bash
bash evaluation/kill_sglang_server.sh
```
The evaluation code above can be used to test most models supported by Sglang.
## 📖 Citation
```
@misc{liu2024raginstructboostingllmsdiverse,
title={RAG-Instruct: Boosting LLMs with Diverse Retrieval-Augmented Instructions},
author={Wanlong Liu and Junying Chen and Ke Ji and Li Zhou and Wenyu Chen and Benyou Wang},
year={2024},
eprint={2501.00353},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2501.00353},
}
```