https://github.com/cluebbers/adverserial-paraphrasing
Evaluate how LLaMA 3.1 8B handles paraphrased adversarial prompts targeting refusal behavior.
https://github.com/cluebbers/adverserial-paraphrasing
deep-learning direct-preference-optimization redteam reinforcement-learning
Last synced: 10 months ago
JSON representation
Evaluate how LLaMA 3.1 8B handles paraphrased adversarial prompts targeting refusal behavior.
- Host: GitHub
- URL: https://github.com/cluebbers/adverserial-paraphrasing
- Owner: cluebbers
- License: apache-2.0
- Created: 2025-04-14T09:47:03.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-05-26T07:37:48.000Z (about 1 year ago)
- Last Synced: 2025-06-26T15:51:43.735Z (about 1 year ago)
- Topics: deep-learning, direct-preference-optimization, redteam, reinforcement-learning
- Language: Jupyter Notebook
- Homepage: https://app.verifyed.io/certificate/ai-safety-ethics-society-3163172?activeSection=certificate
- Size: 430 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Adversarial Paraphrasing Red-Teaming for LLaMA, Mistral & Pythia
This repository delivers a reproducible pipeline to evaluate and improve βrefusalβ behavior in three open-weight LLMsβLLaMA-3.1-8B, Mistral-7B-v0.1, and Pythia-6.9Bβunder adversarial paraphrasing. [Full technical report (PDF)](2025-05-09_Luebbers_report.pdf).
Trained adapters can be found on [Huggingface](https://huggingface.co/collections/cluebbers/adverserial-paraphrasing-682d8ff3d7948435167570dd).
This project was done for the spring 2025 cohort of [AI Safety, Ethics and Society](https://app.verifyed.io/certificate/ai-safety-ethics-society-3163172?activeSection=certificate).
## π Key Features
- **Prompt set**: 64 harmful base prompts Γ 4 variants (canonical, lexical, syntactic, semantic), including six real-world case studies (e.g. Tokyo sarin, Unit 731, Unabomber).
- **Evaluation scripts**:
- `run_inference.py` β batch-runs all prompts through any base model/pipeline.
- `run_inference_lora.py`- batch-run with lora adapters
- `annotate_outputs.py` β interactive refusal/harm labeling.
- `evaluation.ipynb` β computes refusal and harmfulness rates, generates publication-quality bar charts.
- **Alignment adapters**: LoRA rank-8 checkpoints for both
- **SFT** on 580 promptβrefusal pairs, and
- **DPO** on 580 promptΒβchosen_vs_rejected triples.
- **Results**:
- **Baseline** refusal: 2β14 \%; harmful: up to 62 \%.
- **DPO** gains: modest (+4β38 \% refusal; β24β40 \% harm).
- **SFT** gains: dramatic (+60β96 \% refusal; harmful β€ 16 \%).
## π Repository Structure
```text
.
βββ data/
β βββ base_prompts.json # 64 prompts
β βββ paraphrased_prompts.json # 64 prompts Γ 4 variants
β βββ dpo_train.jsonl # 580 DPO triples
β βββ sft_train.jsonl # 580 SFT doubles
βββ scripts/
β βββ run_inference.py
β βββ run_inference_lora.py
β βββ annotate_outputs.py
β βββ evaluation.ipynb
β βββ train_dpo.py
β βββ train_sft.py
βββ figures/
β βββ refusal_harmful_rates.pdf
β βββ paraphrase_types.pdf
βββ 2025-05-09_Luebbers_report.pdf
βββ requirements.txt
βββ README.md
```
## π οΈ Quickstart
Tested on
```bash
torch==2.6.0,
transformers==4.51.3
datasets==3.5.0
accelerate==1.6.0
bitsandbytes==0.45.5
matplotlib==3.10.1
trl==0.17.0
peft==0.15.2
```
1. **Install dependencies:**
```bash
pip install -r requirements.txt
```
2. **Get model access:**
3. **Run inference:**
possible models:
"pythia": "EleutherAI/pythia-6.9b"
"mistral": "mistralai/Mistral-7B-v0.1"
"llama": "meta-llama/Meta-Llama-3.1-8B"
```bash
python scripts/run_inference.py \
--model llama
```
and adapters either "sft" or "dpo"
```bash
python scripts/run_inference_lora.py \
--model llama \
--adapter dpo
```
4. **Annotate outputs:**
You need to specify the input and output files in the script
```bash
python scripts/annotate_outputs.py
```
5. **Inspect results** with scripts/evaluation.ipynb
## π Key Findings
Paraphrase-aware SFT yields the largest safety gains with minimal compute.
Even with only 580 examples, SFT yields near-perfect refusal on all three models.
| Method | Avg. Refusal β | Avg. Harm β |
| :------: | :------------: | :---------: |
| Baseline | 6 \% | 41 \% |
| DPO | 17 \% | 22 \% |
| SFT | 89 \% | 8 \% |

## π Citing This Work
```bibtex
@article{lubbers2025refusal,
title={Evaluating Refusal Robustness under Adversarial Paraphrasing},
author={Luebbers, Christopher L.},
year={2025},
howpublished={\url{https://github.com/cluebbers/adverserial-paraphrasing}}
}
```
---
Feel free to explore, adapt, or extend this toolkit for your own red-teaming and alignment research!