An open API service indexing awesome lists of open source software.

https://github.com/pku-alignment/llms-resist-alignment

Repo for paper "Language Models Resist Alignment"
https://github.com/pku-alignment/llms-resist-alignment

ai-safety alignment alpaca llama llama2 llama3 llm llms rlhf safe safe-rlhf vicuna

Last synced: 7 days ago
JSON representation

Repo for paper "Language Models Resist Alignment"

Awesome Lists containing this project

README

          

Language Models Resist Alignment:
Evidence From Data Compression

[🏠 Homepage](https://pku-lm-resist-alignment.github.io/) | [🤗 Code](https://github.com/PKU-Alignment/llms-resist-alignment) | [👍 Models](https://huggingface.co/collections/PKU-Alignment/language-model-resist-alignment-683aa526612e76702e7651ae)

## Abstract
Large language models (LLMs) may exhibit unintended or undesirable behaviors. Recent works have concentrated on aligning LLMs to mitigate harmful outputs. Despite these efforts, some anomalies indicate that even a well-conducted alignment process can be easily circumvented, whether intentionally or accidentally. Does alignment fine-tuning yield have robust effects on models, or are its impacts merely superficial? In this work, we make the first exploration of this phenomenon from both theoretical and empirical perspectives. Empirically, we demonstrate the elasticity of post-alignment models, i.e., the tendency to revert to the behavior distribution formed during the pre-training phase upon further fine-tuning. Leveraging compression theory, we formally deduce that fine-tuning disproportionately undermines alignment relative to pre-training, potentially by orders of magnitude. We validate the presence of elasticity through experiments on models of varying types and scales. Specifically, we find that model performance declines rapidly before reverting to the pre-training distribution, after which the rate of decline drops significantly. Furthermore, we further reveal that elasticity positively correlates with the increased model size and the expansion of pre-training data. Our findings underscore the need to address the inherent elasticity of LLMs to mitigate their resistance to alignment.

### Table of Contents

- [Abstract](#abstract)
- [Language Models Resist Alignment](#language-models-resist-alignment)
- [Takeaways](#takeaways)
- [Main Theorem](#main-theorem)
- [The *Elasticity* of Language Models](#the-elasticity-of-language-models)
- [*Elasticity* and the Hooke's Law.](#elasticity-and-the-hookes-law)
- [Experiment Results](#experiment-results)
- [Setting I: Existence of Language Models' *Resistance*](#setting-i-existence-of-language-models-resistance)
- [Setting II: Existence of Rebound](#setting-ii-existence-of-rebound)
- [Setting III: Internal Factor of *Rebound*](#setting-iii-internal-factor-of-rebound)
- [Rebound Increases with Model Size](#rebound-increases-with-model-size)
- [Rebound Increases with Pre-training Data Volume](#rebound-increases-with-pre-training-data-volume)
- [Tutorial For Reproducing Experiment Results](#tutorial-for-reproducing-experiment-results)
- [Installation](#installation)
- [Training](#training)
- [Acknowledgment](#acknowledgment)
- [License](#license)

## Language Models Resist Alignment

LLMs have shown remarkable capabilities. However, due to the inevitable biases and harmful content present in training datasets, LLMs often exhibit behaviors that deviate from human intentions, a phenomenon we refer to as *model misalignment*. Training-based alignment methods, including supervised fine-tuning (SFT), reinforcement learning with human feedback (RLHF), and other derivatives, are the dominant approaches for aligning models. These methods aim to optimize model behavior by rejecting harmful distributions, ensuring LLMs remain consistent with human intentions and values.



However, these alignment methods do not truly penetrate the model representations but merely perform *superficial alignment*. Recent studies have shown that highly safety-aligned models can become unsafe again with minimal fine-tuning. Furthermore, fine-tuning aligned LLMs on non-malicious datasets may also weaken models' safety mechanisms.

Why is alignment so fragile?

In this work, we make the first exploration of the possible mechanism behind the counterintuitive phenomenon: the existence of an alignment resistance mechanism in language models. This mechanism may limit the alignment process of LLMs to superficial adjustments. It could allow the reversal or revocation of alignment through a series of technical measures, a concept we refer to as *inverse alignment*. **What drives language models to resist alignment?** **How does this mechanism lead to *inverse alignment*?**

### Takeaways

- **(Phenomenon)** We uncover that language models exhibit *elasticity*, as illustrated in the main figure and theorem. It encompasses **resistance**: pre-trained models tend to retain their original distribution; and **rebound**: the deeper alignment of models, the faster they return to the pre-trained distribution under reverse finetuning. Moreover, The model's change in compression rates across different datasets is inversely proportional to their sizes, which is analogous to the deformation behavior of a series of springs.

- **(Mechanism)** We systematically model the training and alignment process of language models through compression theorem. We elaborate on the compression protocol of language models to explore their training and alignment processes, laying a foundation for subsequent research on *elasticity*.

- **(Validation)** We experimentally observe consistent **resistance** and **rebound** phenomena across various LLMs. This highlights the universality of *elasticity* and the need for systematic approaches to achieve robust and deep alignment.

## Main Theorem

### The *Elasticity* of Language Models



The main theorem shows that as the perturbation increases, the normalized compression rates of the model for decrease and increase and the rate of changes is strongly correlated with the size of the datasets. Unlike the proportional changes in compression rates across different datasets, the language model seems to *prefer* the dataset with a larger volume, leading to biased model behavior after the perturbation.

### *Elasticity* and the Hooke's Law.

The inverse proportionality result in the main theorem provides a potential invariant in the model training and alignment process: after perturbation, the rate of change in the compression rates of different datasets is inversely proportional to their sizes, with the absolute value of the product being a constant. This constant characterizes the impact of the perturbation on the model and indirectly describes the model's resistance to perturbations, or its *elasticity*.

The *elasticity* of the model can be intuitively analogized to a series system of springs. Consider two massless springs in series, with spring constants and , respectively. When the entire system undergoes deformation due to an external force , the system reaches a stable state, and the elastic force exerted by each spring is equal to . According to Hooke's Law, the elongation and of each spring is inversely proportional to its spring constant. Thus, in this system, we have:



In the language model setting, after integrating the main theorem to , we obtain across different datasets, which is equivalent to the change in the KL divergence between the model's distribution and the distributions of the individual datasets, is inversely proportional to the size of the datasets . Here, we only consider the absolute value of . Analogous to the series spring model, the *elasticity* in LLMs satisfies:



where corresponds to in the spring model, while corresponds to the spring constant , thus leading to the *elasticity* of LLMs.

✨ ***For more details, please see our [paper](https://arxiv.org/pdf/2406.06144) and [website](https://pku-lm-resist-alignment.github.io/)!*** 🚀

## Experiment Results
In the previous sections, we proved that LLMs have *elasticity*. This section will analyze two specific phenomenons of it:

- **Resistance for Pre-Trained Models:** Models tend to maintain the original distribution and resist alignment
- **Rebound for Post-Trained Models:** Fine-tuning in the opposite direction of post-training (*e.g.*, safe *vs* unsafe) causes post-trained models to return quickly to the pre-training distribution

### Setting I: Existence of Language Models' *Resistance*

We verify the existence of *resistance* by arguing that *forward alignment* is harder than *inverse alignment* for pre-trained models. Specifically, we first perform one epoch of SFT on a pre-trained LLM with parameters , saving the slices . Subsequently, without loss of generality, we collect the responses of slices and (where ) on hold-out prompts, forming datasets and . As shown in Figure 1, we define *forward alignment* (**_Path A_**) as the process of training on , and *inverse alignment* (**_Path B_**) as the process of training on .



As shown in the experimental results table, the training loss of *inverse alignment* consistently remains lower than that of *forward alignment*, regardless of which slice pair is selected. This observation holds true across all models and datasets in our experiments. All experimental results demonstrate that *inverse alignment* is easier than *forward alignment* across diverse models and datasets, validating the existence of *resistance*.



### Setting II: Existence of Rebound

We verify the existence of *rebound* by demonstrating that for post-trained models, the more *positive* the post-trained models' performance becomes, the more *negative* it turns after *inverse alignment*. We validate tasks involving two opposing characteristics (*e.g.*, safe and unsafe). We first train slices based on a pre-trained model using positive data (*e.g.*, safe) of various volumes. Then we perform inverse fine-tuning on these slices using negative data (*i.e.*, unsafe).



For models fine-tuned with a larger amount of positive sample data, their performance drops quicker under only a small amount of negative sample fine-tuning. Subsequently, the performance decline slows down and tends to stabilize. This result also confirms the previous conclusion: the initial rapid decline of model's performance is due to **rebound**, as the model is far from the pre-trained distribution; while the later stabilization of the countermeasure is due to resistance, as the model is already close to the pre-trained distribution.

To assess the generalizability of the **rebound** phenomenon, we perform additional ablation studies focusing on alignment algorithms, evaluation metrics, and fine-tuning directions. The results consistently validate the presence of the **rebound** phenomenon across language models.



### Setting III: Internal Factor of *Rebound*

All models trained in this experiment can be found at [👍 Models](https://huggingface.co/collections/PKU-Alignment/language-model-resist-alignment-683aa526612e76702e7651ae)

#### Rebound Increases with Model Size

To investigate how the rebound phenomenon varies with model size, we conducted experiments on Qwen models with parameter scales of 0.5B, 4B, and 7B. The experimental results show that as the model parameter size increases, the initial performance decline due to negative data fine-tuning is faster, while the subsequent decline is slower. This indicates that as the parameter size increases, there is an increase in rebound in response to both positive and negative data, further suggesting a positive correlation between model *elasticity* and parameter scale.



#### Rebound Increases with Pre-training Data Volume

To verify that rebound increases with the growth of pre-training data, we vary pre-training slices (2.0T, 2.5T, and 3.0T) released by TinyLlama and conduct the same experiment. When the pre-training data volume increases, the initial performance decline due to negative data fine-tuning is faster, while the subsequent decline is slower. It demonstrates that larger pre-training data volumes reinforce the rebound of LLMs, which is consistent with the inference proposed in the main theorem.



## Tutorial For Reproducing Experiment Results
### Installation
Clone the source code from GitHub:

```bash
git clone https://github.com/PKU-Alignment/llms-resist-alignment.git
```

**Native Runner:** Setup a conda environment using [`conda`](https://github.com/conda/conda) / [`mamba`](https://github.com/mamba-org/mamba):

```bash
conda env create --file conda-recipe.yaml # or `mamba env create --file conda-recipe.yaml`
```

### Training

Follow the instructions in section [Installation](#installation) to setup the training environment properly.

```bash
conda activate resist-alignment
export WANDB_API_KEY="..." # your W&B API key here
```

Supervised Fine-Tuning (SFT)

```bash
bash scripts/sft-imdb.sh \
--train_datasets \
--model_name_or_path \
--output_dir output/sft
```

NOTE: You may need to update some of the parameters in the script according to your machine setup, such as the number of GPUs for training, the training batch size, etc.

## Acknowledgment

This repository benefits from [Llama2](https://llama.meta.com/llama2/), [TinyLlama](https://github.com/jzhang38/TinyLlama), [Stanford Alpaca](https://github.com/tatsu-lab/stanford_alpaca), [DeepSpeed](https://github.com/microsoft/DeepSpeed), [DeepSpeed-Chat](https://github.com/microsoft/DeepSpeedExamples/tree/HEAD/applications/DeepSpeed-Chat), and [Safe-RLHF](https://github.com/PKU-Alignment/safe-rlhf).

Thanks for their wonderful works and their efforts to further promote LLM research.

This work is supported and funded by the Peking University.







## License

This repo is released under Apache License 2.0.