Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/RobustNLP/CipherChat
A framework to evaluate the generalization capability of safety alignment for LLMs
https://github.com/RobustNLP/CipherChat
alignment chatgpt gpt-4-0613 jailbreak large-language-models llm security
Last synced: 3 months ago
JSON representation
A framework to evaluate the generalization capability of safety alignment for LLMs
- Host: GitHub
- URL: https://github.com/RobustNLP/CipherChat
- Owner: RobustNLP
- License: mit
- Created: 2023-08-10T05:55:17.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-07-24T14:17:15.000Z (6 months ago)
- Last Synced: 2024-08-02T01:24:19.195Z (5 months ago)
- Topics: alignment, chatgpt, gpt-4-0613, jailbreak, large-language-models, llm, security
- Language: Python
- Homepage:
- Size: 16.4 MB
- Stars: 551
- Watchers: 9
- Forks: 61
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-chatgpt - RobustNLP/CipherChat - CipherChat is a Python framework for evaluating the generalization capability of safety alignment for Language Models (LLMs). (SDK, Libraries, Frameworks / Python)
- Awesome-LLMSecOps - CipherChat
README
CipherChat 🔐
A novel framework CipherChat to systematically examine the generalizability of safety alignment to non-natural languages – ciphers.
If you have any questions, please feel free to email the first author: [Youliang Yuan](https://github.com/YouliangYuan).
## 👉 Paper
For more details, please refer to our paper [ICLR 2024](https://openreview.net/forum?id=MbfAK4s61A).
LOVE💗 and Peace🌊
RESEARCH USE ONLY✅ NO MISUSE❌
## Our results
We provide our results (query-response pairs) in `experimental_results`, these files can be loaded by `torch.load()`. Then, you can get a list: the first element is the config and the rest of the elements are the query-response pairs.
```
result_data = torch.load(filename)
config = result_data[0]
pairs = result_data[1:]
```## 🛠️ Usage
✨An example run:
```
python3 main.py \
--model_name gpt-4-0613 \
--data_path data/data_en_zh.dict \
--encode_method caesar \
--instruction_type Crimes_And_Illegal_Activities \
--demonstration_toxicity toxic \
--language en
```
## 🔧 Argument Specification
1. `--model_name`: The name of the model to evaluate.2. `--data_path`: Select the data to run.
3. `--encode_method`: Select the cipher to use.
4. `--instruction_type`: Select the domain of data.
5. `--demonstration_toxicity`: Select the toxic or safe demonstrations.
6. `--language`: Select the language of the data.
## 💡Framework
Our approach presumes that since human feedback and safety alignments are presented in natural language, using a human-unreadable cipher can potentially bypass the safety alignments effectively. Intuitively, we first teach the LLM to comprehend the cipher clearly by designating the LLM as a cipher expert, and elucidating the rules of enciphering and deciphering, supplemented with several demonstrations. We then convert the input into a cipher, which is less likely to be covered by the safety alignment of LLMs, before feeding it to the LLMs. We finally employ a rule-based decrypter to convert the model output from a cipher format into the natural language form.
## 📃Results
The query-responses pairs in our experiments are all stored in the form of a list in the "experimental_results" folder, and torch.load() can be used to load data.
### 🌰Case Study
### 🫠Ablation Study
### 🦙Other Models
[![Star History Chart](https://api.star-history.com/svg?repos=RobustNLP/CipherChat&type=Date)](https://star-history.com/#RobustNLP/CipherChat&Date)
Community Discussion:
- Twitter: [AIDB](https://twitter.com/ai_database/status/1691655307892830417), [Jiao Wenxiang](https://twitter.com/WenxiangJiao/status/1691363450604457984)## Citation
If you find our paper&tool interesting and useful, please feel free to give us a star and cite us through:
```bibtex
@inproceedings{
yuan2024cipherchat,
title={{GPT}-4 Is Too Smart To Be Safe: Stealthy Chat with {LLM}s via Cipher},
author={Youliang Yuan and Wenxiang Jiao and Wenxuan Wang and Jen-tse Huang and Pinjia He and Shuming Shi and Zhaopeng Tu},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=MbfAK4s61A}
}```