Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/SafeAILab/RAIN
[ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning
https://github.com/SafeAILab/RAIN
ai-safety alignment large-language-models
Last synced: 2 months ago
JSON representation
[ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning
- Host: GitHub
- URL: https://github.com/SafeAILab/RAIN
- Owner: SafeAILab
- License: bsd-2-clause
- Created: 2023-10-08T01:12:23.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-05-23T08:09:41.000Z (8 months ago)
- Last Synced: 2024-08-02T10:27:28.067Z (5 months ago)
- Topics: ai-safety, alignment, large-language-models
- Language: Python
- Homepage: https://arxiv.org/abs/2309.07124
- Size: 293 KB
- Stars: 74
- Watchers: 1
- Forks: 4
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-RLAIF - Code
README
# ☔️ RAIN: Your Language Models Can Align Themselves without Finetuning
[![arXiv](https://img.shields.io/badge/arXiv-paper-b31b1b.svg)](https://arxiv.org/abs/2309.07124) [![License](https://img.shields.io/badge/License-BSD_2--Clause-orange.svg)](https://opensource.org/licenses/BSD-2-Clause) [![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-green.svg)](https://github.com/SafeAILab/RAIN/issues) [![Contributions welcome](https://img.shields.io/badge/Contributions-welcome-brightgreen.svg?style=flat)](https://github.com/SafeAILab/RAIN/pulls)## Introduction
**RAIN** is an innovative inference method that, by integrating self-evaluation and rewind mechanisms, enables frozen large language models to directly produce responses consistent with human preferences without requiring additional alignment data or model fine-tuning, thereby offering an effective solution for AI safety.
## Main Results
### HH dataset
The following figure displays the experimental results on the [Anthropic’s Helpful and Harmless (HH) dataset](https://arxiv.org/abs/2204.05862), showing helpfulness vs. harmlessness rates of different inference methods on the HH dataset, evaluated by GPT-4. **Left:** [LLaMA](https://arxiv.org/abs/2302.13971) (7B, 13B, 30B, 65B). **Right:** [LLaMA-2](https://arxiv.org/abs/2307.09288) (7B, 13B, 70B).
### AdvBench dataset
The following figure displays the experimental results on the [AdvBench](https://arxiv.org/abs/2307.15043) under [Greedy Coordinate Gradient (GCG) attack](https://arxiv.org/abs/2307.15043). White-box attacks optimize specific attack suffixes by leveraging the gradient of each model, while transfer attacks utilize Vicuna 7B and 13B to optimize a universal attack suffix using a combination of two models’ gradients and subsequently employ it to attack other models.
### TruthfulQA dataset
The following figure displays the experimental results on the [TruthfulQA dataset](https://arxiv.org/abs/2109.07958) with [LLaMA-2-chat 13B](https://arxiv.org/abs/2307.09288). We fine-tune two GPT-3 models by requesting the service from OpenAI to separately assess whether the model’s responses are truthful and informative.
### Time efficiency
Curious about the time overhead to vanilla inference? Here it is! Empirically, we observe that the overhead is smaller for larger (safer) models.
## Setup & Installation
```bash
conda env create -f rain.yaml
```## Running
### HH dataset
```bash
cd HH
python allocation.py --nump p
```The parameter "nump" represents the number of processes. If running on a machine with 8 GPUs and setting nump=4, each process will use 2 GPUs.
### AdvBench
```bash
cd adv
```You can use GCG to generate adversarial suffixes or employ other attack algorithms. Save the attack results as "yourdata.json" with the following format:
```json
[
{
"goal": "instruction or question",
"controls": "Adversarial suffix"
},
]
``````bash
python allocation.py --dataset yourdata.json --nump p
```### TruthfulQA dataset
```bash
cd truth
python allocation.py --nump p
```## Reference
For technical details and full experimental results, please check [the paper](https://browse.arxiv.org/pdf/2309.07124.pdf).
```
@inproceedings{li2024rain,
author = {Yuhui Li and Fangyun Wei and Jinjing Zhao and Chao Zhang and Hongyang Zhang},
title = {RAIN: Your Language Models Can Align Themselves without Finetuning},
booktitle = {International Conference on Learning Representations},
year = {2024}
}
```## Contact
Please contact Yuhui Li at [email protected] if you have any question on the codes. If you find this repository useful, please consider giving ⭐.