https://github.com/Yu-Fangxu/COLD-Attack

[ICML 2024] COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability
https://github.com/Yu-Fangxu/COLD-Attack

Last synced: 5 months ago
JSON representation

[ICML 2024] COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability

Host: GitHub
URL: https://github.com/Yu-Fangxu/COLD-Attack
Owner: Yu-Fangxu
Created: 2024-02-08T15:33:36.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2024-09-15T02:13:29.000Z (8 months ago)
Last Synced: 2024-09-15T08:05:23.259Z (8 months ago)
Language: Python
Homepage:
Size: 7.13 MB
Stars: 83
Watchers: 4
Forks: 15
Open Issues: 1
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

awesome-MLSecOps - COLD-Attack

README

        # COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability

We study the **controllable** jailbreaks on large language models (LLMs). Specifically, we focus on how to enforce control on LLM attacks. In this work, we formally formulate the controllable attack generation problem, and build a novel connection between this problem and controllable text generation, a well-explored topic of natural language processing. Based on this connection, we adapt the [Energy-based Constrained Decoding with Langevin Dynamics (COLD)](https://proceedings.neurips.cc/paper_files/paper/2022/hash/3e25d1aff47964c8409fd5c8dc0438d7-Abstract-Conference.html), a state-of-the-art, highly efficient algorithm in controllable text generation, and introduce the COLD-Attack framework which unifies and automates the search of adversarial LLM attacks under a variety of control requirements such as fluency, stealthiness, sentiment, and left-right-coherence. The controllability enabled by COLD-Attack leads to diverse new jailbreak scenarios including:

1. Fluent suffix attacks (standard attack setting which append the adversarial prompt to the original malicious user query).

2. Paraphrase attack with and without sentiment steering (revising a user query adversarially with minimal paraphrasing).

3. Attack with left-right-coherence (inserting stealthy attacks in context with left-right-coherence).

More details can be found in our paper:

Xingang Guo*, Fangxu Yu*, Huan Zhang, Lianhui Qin, Bin Hu, "[COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability](https://arxiv.org/abs/2402.08679)" (* Equal contribution)

## COLD-Attack

![plot](./imgs/COLD_attack_diagram.png)

As illustrated in the above diagram, our COLD-Attack framework includes three main steps: 

1. **Energy function formulation**: specify energy functions properly to capture the attack constraints such as fluency, stealthiness, sentiment, and left-right-coherence.

2. **Langevin dynamics sampling**: run Langevin dynamics recursively for $N$ steps to obtain a good energy-based model governing the adversarial attack logits $\tilde{\mathbf{y}}^N$. 

3. **Decoding process**: leverage an LLM-guided decoding process to covert the continuous logit $\tilde{\mathbf{y}}^N$ into discrete text attacks $\mathbf{y}$. 

## Selected Examples 

Here are some examples that generated by COLD-Attack:

![plot](./imgs/selected_samples_2.png)

## Jailbreaking Performance

We evaluate the performance of COLD-Attack on four popular white-box LLMs: Vicuna-7b-v1.5 (Vicuna), Llama-2-7b-Chat-hf (Llama2), Guanaco-7b (Guanaco), and Mistral-7b-Instruct-v0.2 (Mistral). In addition, we use the following three main evaluation metrics:

1. Attack Successful Rate (**ASR**): the percentage of instructions that elicit corresponding harmful outputs using sub-string matching method.

2. GPT-4 based ASR (**ASR-G**): We develop a prompt template and utilize GPT-4 to assess whether a response accurately fulfills the malicious instruction. Based on our observations, ASR-G has shown higher correlation with human annotations, providing a more reliable measure of attack effectiveness.

3. Perplexity (**PPL**): We use PPL to evaluate the fluency of the generated prompts and use Vicuna-7b to do the PPL calculation.

To ensure the generated adversarial prompts meet specific criteria, we apply controls over various features, including sentiment and vocabulary. We evaluate how well these controls work using a metric called **Succ**, which represents the percentage of samples that successfully adhere to our set requirements. Additionally, a range of NLP-related evaluation metrics including **BERTScore**, **BLEU**, and **ROUGE** are applied to evaluate the quality of generated controllable attacks.

### Fluent suffix attack

| Models   | ASR ↑  | ASR-G ↑ | PPL ↓ |

|----------|--------|---------|-------|

| Vicuna   | 100.0  | 86.00   | 32.96 |

| Guanaco  | 96.00  | 84.00   | 30.55 |

| Mistral  | 92.00  | 90.00   | 26.24 |

| Llama2   | 92.00  | 66.00   | 24.83 |

### Paraphrase attack 

| Models   | ASR ↑  | ASR-G ↑ | PPL ↓ | BLEU ↑ | ROUGE ↑ | BERTScore ↑ |

|----------|--------|---------|-------|--------|---------|--------------|

| Vicuna   | 96.00  | 80.00   | 31.11 | 0.52   | 0.57    | 0.72         |

| Guanaco  | 98.00  | 78.00   | 29.23 | 0.47   | 0.55    | 0.74         |

| Mistral  | 98.00  | 90.00   | 37.21 | 0.41   | 0.55    | 0.72         |

| Llama2   | 86.00  | 74.00   | 39.26 | 0.60   | 0.54    | 0.71         |

### Left-right-coherence control

| Models                  | ASR ↑ | ASR-G ↑ | Succ ↑ | PPL ↓ |

|-------------------------|-------|---------|--------|-------|

| **Sentiment Constraint**|       |         |        |       |

| Vicuna                  | 90.00 | 96.00   | 84.00  | 66.48 |

| Guanaco                 | 96.00 | 94.00   | 82.00  | 74.05 |

| Mistral                 | 92.00 | 96.00   | 92.00  | 67.61 |

| Llama2                  | 80.00 | 88.00   | 64.00  | 59.53 |

| **Lexical Constraint**  |       |         |        |       |

| Vicuna                  | 92.00 | 100.00  | 82.00  | 76.69 |

| Guanaco                 | 92.00 | 96.00   | 82.00  | 99.03 |

| Mistral                 | 94.00 | 84.00   | 92.00  | 96.06 |

| Llama2                  | 88.00 | 86.00   | 68.00  | 68.23 |

| **Format Constraint**   |       |         |        |       |

| Vicuna                  | 92.00 | 94.00   | 88.00  | 67.63 |

| Guanaco                 | 92.00 | 94.00   | 72.00  | 72.97 |

| Mistral                 | 94.00 | 86.00   | 84.00  | 44.56 |

| Llama2                  | 80.00 | 86.00   | 72.00  | 57.70 |

| **Style Constraint**    |       |         |        |       |

| Vicuna                  | 94.00 | 96.00   | 80.00  | 81.54 |

| Guanaco                 | 94.00 | 92.00   | 70.00  | 75.25 |

| Mistral                 | 92.00 | 90.00   | 86.00  | 54.50 |

| Llama2                  | 80.00 | 80.00   | 68.00  | 58.93 |

Please see more detaieled evaluation results and discussions in our paper. 

## Code

**1) Download this GitHub**

```

git clone https://github.com/Yu-Fangxu/COLD-Attack.git

```

**2) Setup Environment**

We recommend conda for setting up a reproducible experiment environment.

We include `environment.yaml` for creating a working environment:

```bash

conda env create -f environment.yaml -n cold-attack

```

You will then need to setup NLTK and hugging face:

```bash

conda activate cold-attack

python3 -c "import nltk; nltk.download('stopwords', 'averaged_perceptron_tagger', 'punkt'); "

```

To run the Llama-2 model, you will need to [request access](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)

at Hugging face and setup account login:

```bash

huggingface-cli login --token [Your Hugging face token]

```

**3) Run Command for COLD-Attack**

* Fluent suffix attack

```

bash attack.sh "suffix"

```

* Paraphrase attack

```

bash attack.sh "paraphrase"

```

* Left-right-coherence control

```

bash attack.sh "control"

```


 **If you find our repository helpful to your research, please consider citing:** 


```

@article{guo2024cold,

  title={Cold-attack: Jailbreaking llms with stealthiness and controllability},

  author={Guo, Xingang and Yu, Fangxu and Zhang, Huan and Qin, Lianhui and Hu, Bin},

  journal={arXiv preprint arXiv:2402.08679},

  year={2024}

}

```

```

@article{qin2022cold,

  title={Cold decoding: Energy-based constrained text generation with langevin dynamics},

  author={Qin, Lianhui and Welleck, Sean and Khashabi, Daniel and Choi, Yejin},

  journal={Advances in Neural Information Processing Systems},

  volume={35},

  pages={9538--9551},

  year={2022}

}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/Yu-Fangxu/COLD-Attack

Awesome Lists containing this project

README