Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/XuandongZhao/weak-to-strong
Weak-to-Strong Jailbreaking on Large Language Models
https://github.com/XuandongZhao/weak-to-strong
Last synced: 2 months ago
JSON representation
Weak-to-Strong Jailbreaking on Large Language Models
- Host: GitHub
- URL: https://github.com/XuandongZhao/weak-to-strong
- Owner: XuandongZhao
- License: mit
- Created: 2024-01-28T19:48:07.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2024-02-21T19:56:48.000Z (9 months ago)
- Last Synced: 2024-02-21T20:49:18.258Z (9 months ago)
- Language: Python
- Size: 675 KB
- Stars: 24
- Watchers: 3
- Forks: 5
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- Awesome-LLMSecOps - Weak-to-Strong Generalization - to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision | ![GitHub stars](https://img.shields.io/github/stars/XuandongZhao/weak-to-strong?style=social) | (PoC)
README
# Weak-to-Strong Jailbreaking on Large Language Models
[arXiv page](https://arxiv.org/abs/2401.17256)
[Huggingface page](https://huggingface.co/papers/2401.17256)
## Introduction
Although significant efforts have been dedicated to aligning large language models (LLMs), red-teaming reports suggest that these carefully aligned LLMs could still be jailbroken through adversarial prompts, tuning, or decoding. Upon examining the jailbreaking vulnerability of aligned LLMs, we observe that the decoding distributions of jailbroken and aligned models differ only in the initial generations. This observation motivates us to propose the weak-to-strong jailbreaking attack, where adversaries can utilize smaller unsafe/aligned LLMs (e.g., 7B) to guide jailbreaking against significantly larger aligned LLMs (e.g., 70B). To jailbreak, one only needs to additionally decode two smaller LLMs once, which involves minimal computation and latency compared to decoding the larger LLMs.
You can see the following figure for a brief illustration of our attack.
![img](./fig/pipeline.png)We summarize different jailbreaking methods' strengths and weaknesses in the following table.
## Structure
- `data/`: Contains the data used for the experiments.
- `run.py`: Contains the scripts used to run the experiments.
- `generate.py`: Contains the scripts used to generate the results.
- `eval_asr.py`: Contains the scripts used to evaluate the attack success rate.
- `eval_gpt.py`: Contains the scripts used to evaluate the GPT4 scores.
- `eval_harm.py`: Contains the scripts used to evaluate the Harm scores.For getting the unsafe small model, please refer to this repo: https://github.com/BeyonderXX/ShadowAlignment
## Running the experiments
```bash
python run.py --beta 1.50 --batch_size 16 --output_file "[OUTPUT FILE NAME]" --att_file "./data/advbench.txt'
```
Need to confige the bad model path in `run.py` firstly.## Evaluating the results
Find the examples in `eval_asr.py`, `eval_gpt.py`, and `eval_harm.py` to evaluate the results.
## Citation
If you find the code useful, please cite the following paper:```
@article{zhao2024weak,
title={Weak-to-Strong Jailbreaking on Large Language Models},
author={Zhao, Xuandong and Yang, Xianjun and Pang, Tianyu and Du, Chao and Li, Lei and Wang, Yu-Xiang and Wang, William Yang},
journal={arXiv preprint arXiv:2401.17256},
year={2024}
}
```