An open API service indexing awesome lists of open source software.

https://github.com/cluebbers/adverserial-paraphrasing

Evaluate how LLaMA 3.1 8B handles paraphrased adversarial prompts targeting refusal behavior.
https://github.com/cluebbers/adverserial-paraphrasing

deep-learning direct-preference-optimization redteam reinforcement-learning

Last synced: 10 months ago
JSON representation

Evaluate how LLaMA 3.1 8B handles paraphrased adversarial prompts targeting refusal behavior.

Awesome Lists containing this project

README

          

# Adversarial Paraphrasing Red-Teaming for LLaMA, Mistral & Pythia

This repository delivers a reproducible pipeline to evaluate and improve β€œrefusal” behavior in three open-weight LLMsβ€”LLaMA-3.1-8B, Mistral-7B-v0.1, and Pythia-6.9Bβ€”under adversarial paraphrasing. [Full technical report (PDF)](2025-05-09_Luebbers_report.pdf).
Trained adapters can be found on [Huggingface](https://huggingface.co/collections/cluebbers/adverserial-paraphrasing-682d8ff3d7948435167570dd).
This project was done for the spring 2025 cohort of [AI Safety, Ethics and Society](https://app.verifyed.io/certificate/ai-safety-ethics-society-3163172?activeSection=certificate).

## πŸš€ Key Features

- **Prompt set**: 64 harmful base prompts Γ— 4 variants (canonical, lexical, syntactic, semantic), including six real-world case studies (e.g. Tokyo sarin, Unit 731, Unabomber).
- **Evaluation scripts**:
- `run_inference.py` β€” batch-runs all prompts through any base model/pipeline.
- `run_inference_lora.py`- batch-run with lora adapters
- `annotate_outputs.py` β€” interactive refusal/harm labeling.
- `evaluation.ipynb` β€” computes refusal and harmfulness rates, generates publication-quality bar charts.
- **Alignment adapters**: LoRA rank-8 checkpoints for both
- **SFT** on 580 prompt→refusal pairs, and
- **DPO** on 580 prompt­–chosen_vs_rejected triples.
- **Results**:
- **Baseline** refusal: 2–14 \%; harmful: up to 62 \%.
- **DPO** gains: modest (+4–38 \% refusal; –24–40 \% harm).
- **SFT** gains: dramatic (+60–96 \% refusal; harmful ≀ 16 \%).

## πŸ“‚ Repository Structure

```text
.
β”œβ”€β”€ data/
β”‚ β”œβ”€β”€ base_prompts.json # 64 prompts
β”‚ β”œβ”€β”€ paraphrased_prompts.json # 64 prompts Γ— 4 variants
β”‚ β”œβ”€β”€ dpo_train.jsonl # 580 DPO triples
β”‚ └── sft_train.jsonl # 580 SFT doubles
β”œβ”€β”€ scripts/
β”‚ β”œβ”€β”€ run_inference.py
β”‚ β”œβ”€β”€ run_inference_lora.py
β”‚ β”œβ”€β”€ annotate_outputs.py
β”‚ β”œβ”€β”€ evaluation.ipynb
β”‚ β”œβ”€β”€ train_dpo.py
β”‚ └── train_sft.py
β”œβ”€β”€ figures/
β”‚ β”œβ”€β”€ refusal_harmful_rates.pdf
β”‚ └── paraphrase_types.pdf
β”œβ”€β”€ 2025-05-09_Luebbers_report.pdf
β”œβ”€β”€ requirements.txt
└── README.md
```

## πŸ› οΈ Quickstart

Tested on

```bash
torch==2.6.0,
transformers==4.51.3
datasets==3.5.0
accelerate==1.6.0
bitsandbytes==0.45.5
matplotlib==3.10.1
trl==0.17.0
peft==0.15.2
```

1. **Install dependencies:**

```bash
pip install -r requirements.txt
```

2. **Get model access:**


3. **Run inference:**

possible models:
"pythia": "EleutherAI/pythia-6.9b"
"mistral": "mistralai/Mistral-7B-v0.1"
"llama": "meta-llama/Meta-Llama-3.1-8B"

```bash
python scripts/run_inference.py \
--model llama
```

and adapters either "sft" or "dpo"

```bash
python scripts/run_inference_lora.py \
--model llama \
--adapter dpo
```

4. **Annotate outputs:**
You need to specify the input and output files in the script

```bash
python scripts/annotate_outputs.py
```

5. **Inspect results** with scripts/evaluation.ipynb

## πŸ“‘ Key Findings

Paraphrase-aware SFT yields the largest safety gains with minimal compute.
Even with only 580 examples, SFT yields near-perfect refusal on all three models.

| Method | Avg. Refusal ↑ | Avg. Harm ↓ |
| :------: | :------------: | :---------: |
| Baseline | 6 \% | 41 \% |
| DPO | 17 \% | 22 \% |
| SFT | 89 \% | 8 \% |

![Model Alignment Results](figures/refusal_harmful_rates.pdf)

## πŸ“– Citing This Work

```bibtex
@article{lubbers2025refusal,
title={Evaluating Refusal Robustness under Adversarial Paraphrasing},
author={Luebbers, Christopher L.},
year={2025},
howpublished={\url{https://github.com/cluebbers/adverserial-paraphrasing}}
}
```

---

Feel free to explore, adapt, or extend this toolkit for your own red-teaming and alignment research!