{"id":13541201,"url":"https://github.com/SafeAILab/RAIN","last_synced_at":"2025-04-02T08:30:59.269Z","repository":{"id":198985069,"uuid":"701923649","full_name":"SafeAILab/RAIN","owner":"SafeAILab","description":"[ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning","archived":false,"fork":false,"pushed_at":"2024-05-23T08:09:41.000Z","size":300,"stargazers_count":81,"open_issues_count":4,"forks_count":4,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-11-03T06:33:09.434Z","etag":null,"topics":["ai-safety","alignment","large-language-models"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2309.07124","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-2-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/SafeAILab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-10-08T01:12:23.000Z","updated_at":"2024-10-16T06:51:34.000Z","dependencies_parsed_at":null,"dependency_job_id":"bf6f78c1-3010-46fa-aa8e-cec8236841aa","html_url":"https://github.com/SafeAILab/RAIN","commit_stats":null,"previous_names":["safeailab/rain"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SafeAILab%2FRAIN","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SafeAILab%2FRAIN/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SafeAILab%2FRAIN/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SafeAILab%2FRAIN/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/SafeAILab","download_url":"https://codeload.github.com/SafeAILab/RAIN/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246781937,"owners_count":20832937,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-safety","alignment","large-language-models"],"created_at":"2024-08-01T10:00:41.305Z","updated_at":"2025-04-02T08:30:58.870Z","avatar_url":"https://github.com/SafeAILab.png","language":"Python","funding_links":[],"categories":["Papers"],"sub_categories":["2023"],"readme":"# ☔️ RAIN: Your Language Models Can Align Themselves without Finetuning\n[![arXiv](https://img.shields.io/badge/arXiv-paper-b31b1b.svg)](https://arxiv.org/abs/2309.07124) [![License](https://img.shields.io/badge/License-BSD_2--Clause-orange.svg)](https://opensource.org/licenses/BSD-2-Clause) [![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-green.svg)](https://github.com/SafeAILab/RAIN/issues) [![Contributions welcome](https://img.shields.io/badge/Contributions-welcome-brightgreen.svg?style=flat)](https://github.com/SafeAILab/RAIN/pulls)\n\n## Introduction\n\n**RAIN** is an innovative inference method that, by integrating self-evaluation and rewind mechanisms, enables frozen large language models to directly produce responses consistent with human preferences without requiring additional alignment data or model fine-tuning, thereby offering an effective solution for AI safety.\n\n## Main Results\n\n### HH dataset\n\nThe following figure displays the experimental results on the [Anthropic’s Helpful and Harmless (HH) dataset](https://arxiv.org/abs/2204.05862), showing helpfulness vs. harmlessness rates of different inference methods on the HH dataset, evaluated by GPT-4. **Left:** [LLaMA](https://arxiv.org/abs/2302.13971) (7B, 13B, 30B, 65B). **Right:** [LLaMA-2](https://arxiv.org/abs/2307.09288) (7B, 13B, 70B).\n\n\n\u003cdiv align=center\u003e\n\u003cimg src=\"./figs/hh.png\" alt=\"Results\" style=\"zoom:33%;\" width=\"80%\" height=\"80%\" /\u003e\n\u003c/div\u003e\n\n### AdvBench dataset\nThe following figure displays the experimental results on the [AdvBench](https://arxiv.org/abs/2307.15043) under [Greedy Coordinate Gradient (GCG) attack](https://arxiv.org/abs/2307.15043). White-box attacks optimize specific attack suffixes by leveraging the gradient of each model, while transfer attacks utilize Vicuna 7B and 13B to optimize a universal attack suffix using a combination of two models’ gradients and subsequently employ it to attack other models.\n\n\u003cdiv align=center\u003e\n\u003cimg src=\"./figs/adv.png\" alt=\"Results\" style=\"zoom:33%;\" width=\"80%\" height=\"80%\" /\u003e\n\u003c/div\u003e\n\n### TruthfulQA dataset\nThe following figure displays the experimental results on the [TruthfulQA dataset](https://arxiv.org/abs/2109.07958) with [LLaMA-2-chat 13B](https://arxiv.org/abs/2307.09288). We fine-tune two GPT-3 models by requesting the service from OpenAI to separately assess whether the model’s responses are truthful and informative.\n\n\u003cdiv align=center\u003e\n\u003cimg src=\"./figs/truth.png\" alt=\"Results\" style=\"zoom:33%;\" width=\"40%\" height=\"40%\" /\u003e\n\u003c/div\u003e\n\n### Time efficiency\nCurious about the time overhead to vanilla inference? Here it is! Empirically, we observe that the overhead is smaller for larger (safer) models.\n\n\u003cdiv align=center\u003e\n\u003cimg src=\"./figs/time.png\" alt=\"Results\" style=\"zoom:33%;\" width=\"60%\" height=\"60%\" /\u003e\n\u003c/div\u003e\n\n## Setup \u0026 Installation\n\n```bash\nconda env create -f rain.yaml\n```\n\n## Running\n\n### HH dataset\n\n```bash\ncd HH\npython allocation.py --nump p\n```\n\nThe parameter \"nump\" represents the number of processes. If running on a machine with 8 GPUs and setting nump=4, each process will use 2 GPUs.\n\n### AdvBench\n\n```bash\ncd adv\n```\n\nYou can use GCG to generate adversarial suffixes or employ other attack algorithms. Save the attack results as \"yourdata.json\" with the following format:\n\n```json\n[\n     {\n        \"goal\": \"instruction or question\",\n        \"controls\": \"Adversarial suffix\"\n    },\n]\n```\n\n```bash\npython allocation.py --dataset yourdata.json  --nump p\n```\n\n### TruthfulQA dataset\n\n```bash\ncd truth\npython allocation.py  --nump p\n```\n\n## Reference\nFor technical details and full experimental results, please check [the paper](https://browse.arxiv.org/pdf/2309.07124.pdf).\n```\n@inproceedings{li2024rain, \n\tauthor = {Yuhui Li and Fangyun Wei and Jinjing Zhao and Chao Zhang and Hongyang Zhang}, \n\ttitle = {RAIN: Your Language Models Can Align Themselves without Finetuning}, \n\tbooktitle = {International Conference on Learning Representations},\n\tyear = {2024}\n}\n```\n\n## Contact\nPlease contact Yuhui Li at yuhui.li@stu.pku.edu.cn if you have any question on the codes. If you find this repository useful, please consider giving ⭐.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FSafeAILab%2FRAIN","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FSafeAILab%2FRAIN","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FSafeAILab%2FRAIN/lists"}