https://github.com/IAAR-Shanghai/SafeRAG
https://github.com/IAAR-Shanghai/SafeRAG
Last synced: 12 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/IAAR-Shanghai/SafeRAG
- Owner: IAAR-Shanghai
- Created: 2025-01-27T13:38:35.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-03-11T03:50:27.000Z (about 1 year ago)
- Last Synced: 2025-04-06T20:36:15.322Z (about 1 year ago)
- Language: Python
- Size: 18.9 MB
- Stars: 26
- Watchers: 3
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- StarryDivineSky - IAAR-Shanghai/SafeRAG - Shanghai)开发的用于提升检索增强生成(RAG)系统安全性的项目。它旨在解决RAG系统中存在的安全风险,例如提示注入、对抗性示例和信息泄露等问题。SafeRAG通过一系列安全策略和防御机制,增强RAG系统抵抗恶意攻击的能力。该项目包含多种安全模块,用于检测和缓解不同类型的安全威胁。SafeRAG的核心工作原理是监控和过滤输入查询和生成的内容,以防止恶意代码执行和敏感信息泄露。项目提供了一套工具和框架,方便开发者集成到现有的RAG系统中。SafeRAG特别关注大型语言模型(LLM)的安全问题,并提供针对LLM的特定安全措施。项目目标是构建更安全、更可靠的RAG系统,从而提高用户信任度和数据安全性。SafeRAG的代码和文档是开源的,方便社区参与和贡献。它支持多种RAG架构和LLM模型。SafeRAG的评估指标包括攻击成功率和防御有效性。该项目为RAG系统的安全研究和实践提供了一个有价值的平台。 (A01_文本生成_文本对话 / 大语言对话模型及数据)
README
SafeRAG: Benchmarking Security in Retrieval-Augmented Generation of Large Language Model
Xun Liang1,*,
Simin Niu1,*,
Zhiyu Li2,†,
Sensen Zhang1, Hanyu Wang1, Feiyu Xiong2, Jason Zhaoxin Fan3,
Bo Tang2, Shichao Song1, Mengwei Wang1, Jiawei Yang1
1Renmin University of China,
2Institute for Advanced Algorithms Research, Shanghai,
3Beihang University
**🎯 Who Should Pay Attention to Our Work?**
- **Exploring attacks on RAG systems?** SafeRAG introduces a **Threat Framework** that executes **Noise, Conflict, Toxicity, and Denial-of-Service (DoS) attacks** at various stages of the **RAG Pipeline**, aiming to **bypass RAG security components as effectively as possible and exploit its vulnerabilities**.
- **Developing robust and trustworthy RAG systems?** Our benchmark provides a new **Security Evaluation Framework** to test defenses and reveals systemic weaknesses in the **RAG Pipeline**.
- **Shaping RAG security policies?** SafeRAG provides **empirical evidence** of how **Data Injection** attacks can impact AI reliability.
> \[!Tip\]
> - Security evaluation datasets may become outdated or ineffective over time, diminishing their evaluation value. Therefore, until the attack tasks proposed in this paper are adequately addressed, we will continuously update the datasets to ensure their effectiveness.
> - Each attack task in the SafeRAG dataset contains only about 100 evaluation data points, but it covers a wide range of topics, ensuring comprehensive assessment. This lightweight design not only helps reduce the evaluation cost for users but also enables a fast and cost-effective testing experience.
## :loudspeaker: News
- **[2025/01]** We released SafeRAG: Benchmarking Security in Retrieval-Augmented Generation of Large Language Model.
## Overview
Abstract
Retrieval-Augmented Generation (RAG) seamlessly integrates advanced retrieval and generation techniques, making it particularly well-suited for high-stakes domains such as law, healthcare, and finance, where factual accuracy is paramount. This approach significantly enhances the professional applicability of large language models (LLMs).
But is RAG truly secure? **Clearly, attackers can manipulate the data flow at any stage of the RAG pipeline**—including **indexing, retrieval, and filtering**—by injecting malicious, low-quality, misleading, or incorrect texts into **knowledge bases, retrieved contexts, and filtered contexts**. These adversarial modifications **indirectly influence the LLM’s outputs**, potentially compromising its reliability.
**SafeRAG systematically evaluates the security vulnerabilities of RAG components from both retrieval and generation perspectives.** Experiments conducted on **14 mainstream RAG components** reveal that **most RAG systems fail to effectively defend against data injection attacks**. Attackers can **manipulate the data flow within the RAG pipeline**, deceiving the model into generating **low-quality, inaccurate, or misleading content**, and in some cases, even **inducing a denial-of-service (DoS) response**.
We summarize our primary contributions as follows:
- We reveal four attack tasks capable of bypassing the **retriever**, **filter**, and **generator**. For each attack task, we develop a lightweight RAG security evaluation dataset, primarily constructed by humans with LLM assistance.
- We propose an economical, efficient, and accurate RAG security evaluation framework that incorporates attack-specific metrics, which are highly consistent with human judgment.
- We introduce the first Chinese RAG security benchmark, SafeRAG, which analyzes the risks posed to the **retriever** and **generator** by the injection of **Noise**, **Conflict**, **Toxicity**, and **DoS** at various stages of the RAG pipeline.
## Quick Start
- Install dependency packages
```bash
pip install -r requirements.txt
```
- Start the milvus-lite service (vector database)
```bash
milvus-server
```
- Download the bge-base-zh-v1.5 model to the /path/to/your/bge-base-zh-v1.5/ directory
- Modify config.py according to your need.
- Run quick_start_nctd.py
```bash
python quick_start_nctd.py \
--retriever_name 'bm25' \
--retrieve_top_k 6 \
--filter_module 'off' \
--model_name 'gpt-3.5-turbo' \
--quest_eval_model 'deepseek-chat' \
--attack_task 'SN' \
--attack_module 'indexing' \
--attack_intensity 0.5 \
--shuffle True \
--bert_score_eval \
--quest_eval \
--num_threads 5 \
--show_progress_bar True
```
> \[!Tip\]
> - You can modify the RAG components to be evaluated, attack tasks, and other parameters based on your specific evaluation needs.
## Results
The default retrieval window for the silver noise task is set to top K = 6, with a default attack injection ratio of 3/6. For other tasks, the default retrieval window is top K = 2, and the attack injection ratio is fixed at 1/2.
We evaluate the security of 14 different types of RAG components against injected attack texts at different RAG stages (**indexing**, retrieval, and generation), including: (1) retrievers (**DPR**, BM25, Hybrid, Hybrid-Rerank); (2) filters (OFF, **filter NLI**, compressor SKR); and (3) generators (**DeepSeek**, GPT-3.5-turbo, GPT-4, GPT-4o, Qwen 7B, Qwen 14B, Baichuan 13B, ChatGLM 6B).
The bold values represent the default settings. Additionally, we adopt a unified sentence chunking strategy to segment the knowledge base during the indexing. The embedding model used is bge-base-zh-v1.5, the reranker is bge-reranker-base.
### Results on Noise
We inject different noise ratios into the text accessible in the RAG pipeline, including the **knowledge base**, **retrieved context**, and **filtered context**.
> - Regardless of the stage at which noise injection is performed, the F1 (avg) decreases as the noise ratio increases, indicating a decline in the diversity of generated responses.
> - Different retrievers exhibit varying degrees of noise resistance. The overall ranking of retrievers' robustness against noise attacks is Hybrid-Rerank > Hybrid > BM25 > DPR. This suggests that hybrid retrievers and rerankers are more inclined to retrieve diverse golden contexts rather than homogeneous attack contexts.
> - When the noise ratio increases, the retrieval accuracy (RA) for noise injected into the retrieved or filtered context is significantly higher than that for noise injected into the knowledge base. This is because noise injected into the knowledge base has approximately a 50% chance of not being retrieved.
> - The compressor SKR lacks sufficient security. Although it attempts to merge redundant information in silver noise, it severely compresses the detailed information necessary to answer questions within the retrieved context, leading to a decrease in F1 (avg).
### Results on Conflict, Toxicity, and DoS
> - After injecting different types of attacks into the texts accessible at any stage of the RAG pipeline, both F1 (avg) and the attack failure rate (AFR) decline across all three tasks. Specifically, conflict attacks make it difficult for the RAG to determine which information is true, potentially leading to the use of fabricated facts from the attack context, resulting in a drop in metrics. Toxicity attacks cause the RAG to misinterpret disguised authoritative statements as factual, leading to the automatic propagation of soft ads in generated responses, which also contributes to the metric decline. DoS attacks, on the other hand, make the RAG more likely to refuse to answer, even when relevant evidence is retrieved, further reducing the performance metrics. Overall, the ranking of attack effectiveness across different stages is: filtered context > retrieved context > knowledge base.
> - Different retrievers exhibit varying vulnerabilities to different types of attacks. For instance, Hybrid-Rerank is more susceptible to conflict attacks, while DPR is more prone to DoS attacks. The vulnerability levels of retrievers under toxicity attacks are generally consistent.
> - Across different attack tasks, the changes in RA remain largely consistent regardless of the retriever used.
> - In conflict tasks, using the compressor SKR is less secure as it compresses conflict details, leading to a decline in F1 (avg). In toxicity and DoS tasks, the filter NLI is generally ineffective, with its AFR close to that of disabling the filter. However, in toxicity and DoS tasks, the SKR compressor proves to be secure as it effectively compresses soft ads and warning content.
## TODOs
Click me to show all TODOs
- [ ] feat: add SafeRAG PyPI package.
- [ ] feat: release SafeRAG dataset on Hugging Face.
- [ ] docs: extend dataset.
# Project Structure
```bash
├── configs # This folder comprises scripts used to initialize the loading parameters of the large language models (LLMs) in RAG systems.
│ └── config.py # Before running the project, users need to fill in their own key or local model path information to the corresponding location.
├── embeddings # The embedding model used to build vector databases.
│ └──base.py
├── knowledge_base # Path to knowledge_base.
│ ├──SN # The knowledge base used for silver noise.
│ | ├──add_SN # Add attacks to the clean knowledge base.
│ | └──db.txt # a clean knowledge base.
│ ├──ICC # The knowledge base used for inter-context conflict.
│ | ├──add_ICC # Add attacks to the clean knowledge base.
│ | └──db.txt # a clean knowledge base.
│ ├──SA # The knowledge base used for soft ad.
│ | ├──add_SA # Add attacks to the clean knowledge base.
│ | └──db.txt # a clean knowledge base.
│ └──WDoS # The knowledge base used for White DoS.
│ ├──add_WDoS # Add attacks to the clean knowledge base.
│ └──db.txt # a clean knowledge base.
├── llms # This folder contains scripts used to load the LLMs.
│ ├── api_model.py # Call GPT-series models.
│ ├── local_model.py # Call a locally deployed model.
│ └── remote_model.py # Call the model deployed remotely and encapsulated into an API.
├── metric # The evaluation metric we used in the experiments.
│ ├── common.py # bleu, rouge, bertScore.
│ └── quest_eval.py # Multiple-choice QuestEval. Note that using such metric requires calling a LLM such as GPT to answer questions, or modifying the code and deploying the question answering model yourself.
├── datasets # This folder contains scripts used to load the dataset.
├── output # The evaluation results will be retained here.
├── prompts # The prompts we used in the experiments.
├── retrievers # The retriever used in RAG system.
└── tasks # The evaluation attack tasks.
```