https://github.com/robustnlp/cipherchat

A framework to evaluate the generalization capability of safety alignment for LLMs
https://github.com/robustnlp/cipherchat

alignment chatgpt gpt-4-0613 jailbreak large-language-models llm security

Last synced: 2 months ago
JSON representation

A framework to evaluate the generalization capability of safety alignment for LLMs

Host: GitHub
URL: https://github.com/robustnlp/cipherchat
Owner: RobustNLP
License: mit
Created: 2023-08-10T05:55:17.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2024-12-31T01:47:49.000Z (6 months ago)
Last Synced: 2025-03-12T08:37:40.192Z (4 months ago)
Topics: alignment, chatgpt, gpt-4-0613, jailbreak, large-language-models, llm, security
Language: Python
Homepage:
Size: 16.4 MB
Stars: 586
Watchers: 10
Forks: 64
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-ChatGPT-repositories - CipherChat - A framework to evaluate the generalization capability of safety alignment for LLMs (NLP)

README

CipherChat 🔐

A novel framework CipherChat to systematically examine the generalizability of safety alignment to non-natural languages – ciphers.

If you have any questions, please feel free to email the first author: [Youliang Yuan](https://github.com/YouliangYuan).

## 👉 Paper
For more details, please refer to our paper [ICLR 2024](https://openreview.net/forum?id=MbfAK4s61A).

LOVE💗 and Peace🌊

RESEARCH USE ONLY✅ NO MISUSE❌

## Our results
We provide our results (query-response pairs) in `experimental_results`, these files can be loaded by `torch.load()`. Then, you can get a list: the first element is the config and the rest of the elements are the query-response pairs.
```
result_data = torch.load(filename)
config = result_data[0]
pairs = result_data[1:]
```

## 🛠️ Usage
✨An example run:
```
python3 main.py \
--model_name gpt-4-0613 \
--data_path data/data_en_zh.dict \
--encode_method caesar \
--instruction_type Crimes_And_Illegal_Activities \
--demonstration_toxicity toxic \
--language en
```
## 🔧 Argument Specification
1. `--model_name`: The name of the model to evaluate.

2. `--data_path`: Select the data to run.

3. `--encode_method`: Select the cipher to use.

4. `--instruction_type`: Select the domain of data.

5. `--demonstration_toxicity`: Select the toxic or safe demonstrations.

6. `--language`: Select the language of the data.

## 💡Framework

Our approach presumes that since human feedback and safety alignments are presented in natural language, using a human-unreadable cipher can potentially bypass the safety alignments effectively. Intuitively, we first teach the LLM to comprehend the cipher clearly by designating the LLM as a cipher expert, and elucidating the rules of enciphering and deciphering, supplemented with several demonstrations. We then convert the input into a cipher, which is less likely to be covered by the safety alignment of LLMs, before feeding it to the LLMs. We finally employ a rule-based decrypter to convert the model output from a cipher format into the natural language form.

## 📃Results
The query-responses pairs in our experiments are all stored in the form of a list in the "experimental_results" folder, and torch.load() can be used to load data.

### 🌰Case Study

### 🫠Ablation Study

### 🦙Other Models

[![Star History Chart](https://api.star-history.com/svg?repos=RobustNLP/CipherChat&type=Date)](https://star-history.com/#RobustNLP/CipherChat&Date)

Community Discussion:
- Twitter: [AIDB](https://twitter.com/ai_database/status/1691655307892830417), [Jiao Wenxiang](https://twitter.com/WenxiangJiao/status/1691363450604457984)

## Citation

If you find our paper&tool interesting and useful, please feel free to give us a star and cite us through:
```bibtex
@inproceedings{
yuan2024cipherchat,
title={{GPT}-4 Is Too Smart To Be Safe: Stealthy Chat with {LLM}s via Cipher},
author={Youliang Yuan and Wenxiang Jiao and Wenxuan Wang and Jen-tse Huang and Pinjia He and Shuming Shi and Zhaopeng Tu},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=MbfAK4s61A}
}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome