{"id":14964584,"url":"https://github.com/pku-alignment/llms-resist-alignment","last_synced_at":"2025-10-01T01:30:59.909Z","repository":{"id":243821928,"uuid":"812578782","full_name":"PKU-Alignment/llms-resist-alignment","owner":"PKU-Alignment","description":"Repo for paper \"Language Models Resist Alignment\"","archived":false,"fork":false,"pushed_at":"2024-06-09T16:16:30.000Z","size":2904,"stargazers_count":4,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-10-29T08:03:47.837Z","etag":null,"topics":["ai-safety","alignment","alpaca","llama","llama2","llama3","llm","llms","rlhf","safe","safe-rlhf","vicuna"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/PKU-Alignment.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-06-09T09:51:58.000Z","updated_at":"2024-10-29T06:59:52.000Z","dependencies_parsed_at":"2024-06-11T11:13:30.610Z","dependency_job_id":"e0e232d7-3430-4efd-802b-28acfbf99b3d","html_url":"https://github.com/PKU-Alignment/llms-resist-alignment","commit_stats":null,"previous_names":["pku-alignment/llms-resist-alignment"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PKU-Alignment%2Fllms-resist-alignment","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PKU-Alignment%2Fllms-resist-alignment/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PKU-Alignment%2Fllms-resist-alignment/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PKU-Alignment%2Fllms-resist-alignment/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/PKU-Alignment","download_url":"https://codeload.github.com/PKU-Alignment/llms-resist-alignment/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":230788232,"owners_count":18280301,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-safety","alignment","alpaca","llama","llama2","llama3","llm","llms","rlhf","safe","safe-rlhf","vicuna"],"created_at":"2024-09-24T13:33:27.665Z","updated_at":"2025-10-01T01:30:59.903Z","avatar_url":"https://github.com/PKU-Alignment.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003ch1 align=\"center\"\u003eLanguage Models Resist Alignment: \u003cbr\u003e Evidence From Data Compression\u003c/h1\u003e\n\n\n[🏠 Homepage](https://pku-lm-resist-alignment.github.io/) | [🤗 Code](https://github.com/PKU-Alignment/llms-resist-alignment) | [👍 Models](https://huggingface.co/collections/PKU-Alignment/language-model-resist-alignment-683aa526612e76702e7651ae)\n\n\n## Abstract \nLarge language models (LLMs) may exhibit unintended or undesirable behaviors. Recent works have concentrated on aligning LLMs to mitigate harmful outputs. Despite these efforts, some anomalies indicate that even a well-conducted alignment process can be easily circumvented, whether intentionally or accidentally. Does alignment fine-tuning yield have robust effects on models, or are its impacts merely superficial? In this work, we make the first exploration of this phenomenon from both theoretical and empirical perspectives. Empirically, we demonstrate the elasticity of post-alignment models, i.e., the tendency to revert to the behavior distribution formed during the pre-training phase upon further fine-tuning.  Leveraging compression theory, we formally deduce that fine-tuning disproportionately undermines alignment relative to pre-training, potentially by orders of magnitude. We validate the presence of elasticity through experiments on models of varying types and scales. Specifically, we find that model performance declines rapidly before reverting to the pre-training distribution, after which the rate of decline drops significantly. Furthermore, we further reveal that elasticity positively correlates with the increased model size and the expansion of pre-training data. Our findings underscore the need to address the inherent elasticity of LLMs to mitigate their resistance to alignment.\n\n\n### Table of Contents \u003c!-- omit in toc --\u003e\n\n- [Abstract](#abstract)\n- [Language Models Resist Alignment](#language-models-resist-alignment)\n  - [Takeaways](#takeaways)\n- [Main Theorem](#main-theorem)\n  - [The *Elasticity* of Language Models](#the-elasticity-of-language-models)\n  - [*Elasticity* and the Hooke's Law.](#elasticity-and-the-hookes-law)\n- [Experiment Results](#experiment-results)\n  - [Setting I: Existence of Language Models' *Resistance*](#setting-i-existence-of-language-models-resistance)\n  - [Setting II: Existence of Rebound](#setting-ii-existence-of-rebound)\n  - [Setting III: Internal Factor of *Rebound*](#setting-iii-internal-factor-of-rebound)\n    - [Rebound Increases with Model Size](#rebound-increases-with-model-size)\n    - [Rebound Increases with Pre-training Data Volume](#rebound-increases-with-pre-training-data-volume)\n- [Tutorial For Reproducing Experiment Results](#tutorial-for-reproducing-experiment-results)\n  - [Installation](#installation)\n  - [Training](#training)\n- [Acknowledgment](#acknowledgment)\n- [License](#license)\n\n\n## Language Models Resist Alignment\n\nLLMs have shown remarkable capabilities. However, due to the inevitable biases and harmful content present in training datasets, LLMs often exhibit behaviors that deviate from human intentions, a phenomenon we refer to as *model misalignment*. Training-based alignment methods, including supervised fine-tuning (SFT), reinforcement learning with human feedback (RLHF), and other derivatives, are the dominant approaches for aligning models. These methods aim to optimize model behavior by rejecting harmful distributions, ensuring LLMs remain consistent with human intentions and values.\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"images/figure1.png\" width=\"90%\"/\u003e\n\u003c/div\u003e\n\n\nHowever, these alignment methods do not truly penetrate the model representations but merely perform *superficial alignment*. Recent studies have shown that highly safety-aligned models can become unsafe again with minimal fine-tuning. Furthermore, fine-tuning aligned LLMs on non-malicious datasets may also weaken models' safety mechanisms. \n\n\n\u003ch3 align=\"center\"\u003eWhy is alignment so fragile?  \u003c/h3\u003e\n\nIn this work, we make the first exploration of the possible mechanism behind the counterintuitive phenomenon: the existence of an alignment resistance mechanism in language models. This mechanism may limit the alignment process of LLMs to superficial adjustments. It could allow the reversal or revocation of alignment through a series of technical measures, a concept we refer to as *inverse alignment*. **What drives language models to resist alignment?** **How does this mechanism lead to *inverse alignment*?**\n\n### Takeaways\n\n- **(Phenomenon)** We uncover that language models exhibit *elasticity*, as illustrated in the main figure and theorem. It encompasses **resistance**: pre-trained models tend to retain their original distribution; and **rebound**: the deeper alignment of models, the faster they return to the pre-trained distribution under reverse finetuning. Moreover, The model's change in compression rates across different datasets is inversely proportional to their sizes, which is analogous to the deformation behavior of a series of springs.\n\n- **(Mechanism)** We systematically model the training and alignment process of language models through compression theorem. We elaborate on the compression protocol of language models to explore their training and alignment processes, laying a foundation for subsequent research on *elasticity*.\n\n- **(Validation)** We experimentally observe consistent **resistance** and **rebound** phenomena across various LLMs. This highlights the universality of *elasticity* and the need for systematic approaches to achieve robust and deep alignment.\n\n\n\n## Main Theorem\n\n### The *Elasticity* of Language Models\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"images/main-theorem.png\" width=\"90%\"/\u003e\n\u003c/div\u003e\n\n\nThe main theorem shows that as the perturbation increases, the normalized compression rates of the model for \u003cimg src=\"https://latex.codecogs.com/svg.latex?\\mathcal{D}_1\" /\u003e decrease and \u003cimg src=\"https://latex.codecogs.com/svg.latex?\\mathcal{D}_2\" /\u003e increase and the rate of changes is strongly correlated with the size of the datasets. Unlike the proportional changes in compression rates across different datasets, the language model seems to *prefer* the dataset with a larger volume, leading to biased model behavior after the perturbation.\n\n### *Elasticity* and the Hooke's Law.\n\nThe inverse proportionality result in the main theorem provides a potential invariant in the model training and alignment process: after perturbation, the rate of change in the compression rates of different datasets is inversely proportional to their sizes, with the absolute value of the product being a constant. This constant characterizes the impact of the perturbation on the model and indirectly describes the model's resistance to perturbations, or its *elasticity*.\n\n\nThe *elasticity* of the model can be intuitively analogized to a series system of springs. Consider two massless springs in series, with spring constants \u003cimg src=\"https://latex.codecogs.com/svg.latex?k_1\" /\u003e and \u003cimg src=\"https://latex.codecogs.com/svg.latex?k_2\" /\u003e, respectively. When the entire system undergoes deformation due to an external force \u003cimg src=\"https://latex.codecogs.com/svg.latex?F\" /\u003e, the system reaches a stable state, and the elastic force exerted by each spring is equal to \u003cimg src=\"https://latex.codecogs.com/svg.latex?F\" /\u003e. According to Hooke's Law, the elongation \u003cimg src=\"https://latex.codecogs.com/svg.latex?\\Delta%20l_1\" /\u003e and \u003cimg src=\"https://latex.codecogs.com/svg.latex?\\Delta%20l_2\" /\u003e of each spring is inversely proportional to its spring constant. Thus, in this system, we have:\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"images/equation1.png\" width=\"20%\"/\u003e\n\u003c/div\u003e\n\nIn the language model setting, after integrating the main theorem to \u003cimg src=\"https://latex.codecogs.com/svg.latex?l\" /\u003e, we obtain \u003cimg src=\"https://latex.codecogs.com/svg.latex?\\Delta\\gamma_{p_{\\theta}}^{\\mathcal{D}_i/\\mathcal{D}}\" /\u003e across different datasets, which is equivalent to the change in the KL divergence \u003cimg src=\"https://latex.codecogs.com/svg.latex?\\Delta%20D_{\\text{KL}}(\\mathcal{P}_{p_{\\theta}}||\\mathcal{P}_{\\mathcal{D}_{i}})\" /\u003e between the model's distribution and the distributions of the individual datasets, is inversely proportional to the size of the datasets \u003cimg src=\"https://latex.codecogs.com/svg.latex?|\\mathcal{D}_i|\" /\u003e. Here, we only consider the absolute value of \u003cimg src=\"https://latex.codecogs.com/svg.latex?\\Delta%20D_{\\text{KL}}\" /\u003e. Analogous to the series spring model, the *elasticity* \u003cimg src=\"https://latex.codecogs.com/svg.latex?F\" /\u003e in LLMs satisfies:\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"images/equation2.png\" width=\"20%\"/\u003e\n\u003c/div\u003e\n\nwhere \u003cimg src=\"https://latex.codecogs.com/svg.latex?\\Delta%20D_{\\text{KL}}\" /\u003e corresponds to \u003cimg src=\"https://latex.codecogs.com/svg.latex?\\Delta%20l\" /\u003e in the spring model, while \u003cimg src=\"https://latex.codecogs.com/svg.latex?|\\mathcal{D}|\" /\u003e corresponds to the spring constant \u003cimg src=\"https://latex.codecogs.com/svg.latex?k\" /\u003e, thus leading to the *elasticity* of LLMs.\n\n\n\n\n✨ ***For more details, please see our [paper](https://arxiv.org/pdf/2406.06144) and [website](https://pku-lm-resist-alignment.github.io/)!*** 🚀\n\n## Experiment Results\nIn the previous sections, we proved that LLMs have *elasticity*. This section will analyze two specific phenomenons of it:\n\n- **Resistance for Pre-Trained Models:** Models tend to maintain the original distribution and resist alignment\n- **Rebound for Post-Trained Models:** Fine-tuning in the opposite direction of post-training (*e.g.*, safe *vs* unsafe) causes post-trained models to return quickly to the pre-training distribution\n\n\n### Setting I: Existence of Language Models' *Resistance*\n\nWe verify the existence of *resistance* by arguing that *forward alignment* is harder than *inverse alignment* for pre-trained models. Specifically, we first perform one epoch of SFT on a pre-trained LLM with parameters \u003cimg src=\"https://latex.codecogs.com/svg.latex?\\theta_0\" /\u003e, saving the slices \u003cimg src=\"https://latex.codecogs.com/svg.latex?\\{\\theta_1,%20\\theta_2,%20\\ldots,%20\\theta_n\\}\" /\u003e. Subsequently, without loss of generality, we collect the responses of slices \u003cimg src=\"https://latex.codecogs.com/svg.latex?\\theta_{k}\" /\u003e and \u003cimg src=\"https://latex.codecogs.com/svg.latex?\\theta_{l}\" /\u003e (where \u003cimg src=\"https://latex.codecogs.com/svg.latex?k%20\u003c%20l\" /\u003e) on hold-out prompts, forming datasets \u003cimg src=\"https://latex.codecogs.com/svg.latex?D_{k}\" /\u003e and \u003cimg src=\"https://latex.codecogs.com/svg.latex?D_{l}\" /\u003e. As shown in Figure 1, we define *forward alignment* (**_Path A_**) as the process of training \u003cimg src=\"https://latex.codecogs.com/svg.latex?\\theta_{k}\" /\u003e on \u003cimg src=\"https://latex.codecogs.com/svg.latex?D_{l}\" /\u003e, and *inverse alignment* (**_Path B_**) as the process of training \u003cimg src=\"https://latex.codecogs.com/svg.latex?\\theta_{l}\" /\u003e on \u003cimg src=\"https://latex.codecogs.com/svg.latex?D_{k}\" /\u003e.\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"images/setting1.png\" width=\"75%\"/\u003e\n\u003c/div\u003e\n\nAs shown in the experimental results table, the training loss of *inverse alignment* consistently remains lower than that of *forward alignment*, regardless of which slice pair is selected. This observation holds true across all models and datasets in our experiments. All experimental results demonstrate that *inverse alignment* is easier than *forward alignment* across diverse models and datasets, validating the existence of *resistance*.\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"results/exp_setting1.png\" width=\"95%\"/\u003e\n\u003c/div\u003e\n\n\n\n### Setting II: Existence of Rebound\n\nWe verify the existence of *rebound* by demonstrating that for post-trained models, the more *positive* the post-trained models' performance becomes, the more *negative* it turns after *inverse alignment*. We validate tasks involving two opposing characteristics (*e.g.*, safe and unsafe). We first train slices \u003cimg src=\"https://latex.codecogs.com/svg.latex?\\{\\theta_1,%20\\theta_2,%20...,%20\\theta_n\\}\" /\u003e based on a pre-trained model \u003cimg src=\"https://latex.codecogs.com/svg.latex?\\theta_0\" /\u003e using positive data (*e.g.*, safe) of various volumes. Then we perform inverse fine-tuning on these slices using negative data (*i.e.*, unsafe).\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"images/setting2.png\" width=\"75%\"/\u003e\n\u003c/div\u003e\n\nFor models fine-tuned with a larger amount of positive sample data, their performance drops quicker under only a small amount of negative sample fine-tuning. Subsequently, the performance decline slows down and tends to stabilize. This result also confirms the previous conclusion: the initial rapid decline of model's performance is due to **rebound**, as the model is far from the pre-trained distribution; while the later stabilization of the countermeasure is due to resistance, as the model is already close to the pre-trained distribution. \n\nTo assess the generalizability of the **rebound** phenomenon, we perform additional ablation studies focusing on alignment algorithms, evaluation metrics, and fine-tuning directions. The results consistently validate the presence of the **rebound** phenomenon across language models.\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"results/exp_setting2.png\" width=\"95%\"/\u003e\n\u003c/div\u003e\n\n### Setting III: Internal Factor of *Rebound*\n\nAll models trained in this experiment can be found at [👍 Models](https://huggingface.co/collections/PKU-Alignment/language-model-resist-alignment-683aa526612e76702e7651ae)\n\n\n\n#### Rebound Increases with Model Size\n\nTo investigate how the rebound phenomenon varies with model size, we conducted experiments on Qwen models with parameter scales of 0.5B, 4B, and 7B. The experimental results show that as the model parameter size increases, the initial performance decline due to negative data fine-tuning is faster, while the subsequent decline is slower. This indicates that as the parameter size increases, there is an increase in rebound in response to both positive and negative data, further suggesting a positive correlation between model *elasticity* and parameter scale.\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"results/exp_setting31.png\" width=\"95%\"/\u003e\n\u003c/div\u003e\n\n#### Rebound Increases with Pre-training Data Volume\n\nTo verify that rebound increases with the growth of pre-training data, we vary pre-training slices (2.0T, 2.5T, and 3.0T) released by TinyLlama and conduct the same experiment. When the pre-training data volume increases, the initial performance decline due to negative data fine-tuning is faster, while the subsequent decline is slower. It demonstrates that larger pre-training data volumes reinforce the rebound of LLMs, which is consistent with the inference proposed in the main theorem.\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"results/exp_setting32.png\" width=\"95%\"/\u003e\n\u003c/div\u003e\n\n## Tutorial For Reproducing Experiment Results\n### Installation\nClone the source code from GitHub:\n\n```bash\ngit clone https://github.com/PKU-Alignment/llms-resist-alignment.git\n```\n\n**Native Runner:** Setup a conda environment using [`conda`](https://github.com/conda/conda) / [`mamba`](https://github.com/mamba-org/mamba):\n\n```bash\nconda env create --file conda-recipe.yaml  # or `mamba env create --file conda-recipe.yaml`\n```\n\n### Training\n\nFollow the instructions in section [Installation](#installation) to setup the training environment properly.\n\n```bash\nconda activate resist-alignment\nexport WANDB_API_KEY=\"...\"  # your W\u0026B API key here\n```\n\nSupervised Fine-Tuning (SFT)\n\n```bash\nbash scripts/sft-imdb.sh \\\n    --train_datasets \u003cyour-dataset\u003e \\\n    --model_name_or_path \u003cyour-model-name-or-checkpoint-path\u003e \\\n    --output_dir output/sft\n```\n\nNOTE: You may need to update some of the parameters in the script according to your machine setup, such as the number of GPUs for training, the training batch size, etc. \n\n\n## Acknowledgment\n\nThis repository benefits from [Llama2](https://llama.meta.com/llama2/), [TinyLlama](https://github.com/jzhang38/TinyLlama), [Stanford Alpaca](https://github.com/tatsu-lab/stanford_alpaca), [DeepSpeed](https://github.com/microsoft/DeepSpeed), [DeepSpeed-Chat](https://github.com/microsoft/DeepSpeedExamples/tree/HEAD/applications/DeepSpeed-Chat), and [Safe-RLHF](https://github.com/PKU-Alignment/safe-rlhf).\n\n\n\nThanks for their wonderful works and their efforts to further promote LLM research.\n\nThis work is supported and funded by the Peking University.\n\n\u003ctable width=\"50%\" cellspacing=\"0\" cellpadding=\"0\"\u003e\n  \u003ctr align=\"center\" valign=\"middle\"\u003e\n    \u003ctd width=\"40%\"\u003e\n      \u003ca href=\"https://www.ai.pku.edu.cn/\"\u003e\n        \u003cimg src=\"logo/pku-ai.png\" width=\"100%\"/\u003e\n      \u003c/a\u003e\n    \u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\n\n## License\n\nThis repo is released under Apache License 2.0.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpku-alignment%2Fllms-resist-alignment","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpku-alignment%2Fllms-resist-alignment","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpku-alignment%2Fllms-resist-alignment/lists"}