https://github.com/poloclub/llm-landscape
NeurIPS'24 - LLM Safety Landscape
https://github.com/poloclub/llm-landscape
llm llm-landscape llm-safety llm-safety-landscape safety-basin
Last synced: 5 months ago
JSON representation
NeurIPS'24 - LLM Safety Landscape
- Host: GitHub
- URL: https://github.com/poloclub/llm-landscape
- Owner: poloclub
- License: mit
- Created: 2024-10-12T16:18:49.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-02-25T20:26:16.000Z (8 months ago)
- Last Synced: 2025-02-25T21:29:23.683Z (8 months ago)
- Topics: llm, llm-landscape, llm-safety, llm-safety-landscape, safety-basin
- Language: Python
- Homepage: https://shengyun-peng.github.io/papers/llm-safety-landscape
- Size: 812 KB
- Stars: 18
- Watchers: 1
- Forks: 4
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models [NeurIPS'24]
[](https://arxiv.org/abs/2405.17374)
You can visualize the safety and capability landscapes of your own LLM!
- Plot the **safety basin** of your own model: if you make small, random tweaks to the model's weights, it stays as safe as the original model within a certain range. However, when these tweaks get large enough, there’s a tipping point where the model’s safety suddenly breaks down.
- Harmful finetuning attacks (HFA) compromise safety by dragging the model away from the safety basin.
- This safety landscape also shows that the system prompt plays a huge role in keeping the model safe, and that this protection extends to slightly tweaked versions of the model within the safety basin.
- When we test the model’s safety with jailbreaking prompts, we see that these prompts are very sensitive to even small changes in the model's weights.
![]()
## Research Paper
[**Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models**](https://arxiv.org/abs/2405.17374)ShengYun Peng1,
Pin-Yu Chen2,
Matthew Hull1,
Duen Horng Chau1
1Georgia Tech,
2IBM ResearchIn *NeurIPS 2024*.
## Quick Start
You can plot the 1D and 2D LLM landscapes and compute the VISAGE score for your own models. We are using Llama2-7b-chat as an example. Please modify the yaml file under `/config` for customized experiments.### Setup
```bash
make .done_venv
```### Compute direction
```bash
make direction
```It consume ~27G on a single A100 GPU. The computed direction is stored at `experiments/advbench/1D_random/llama2/dirs1.pt`.
### Visualize landscape and compute VISAGE score
```bash
make landscape
```Change `NGPU` in Makefile to the number of devices on your hardware.
Change `batch_size` at `config/dataset/default.yaml` to avoid CUDA OOM.
Model generations are saved at `experiments/advbench/1D_random/llama2/output.jsonl`.
The landscape visualization is saved at `experiments/advbench/1D_random/llama2/1D_random_llama2_landscape.png`.
## Citation
```bibtex
@article{peng2024navigating,
title={Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models},
author={Peng, ShengYun and Chen, Pin-Yu and Hull, Matthew and Chau, Duen Horng},
journal={arXiv preprint arXiv:2405.17374},
year={2024}
}
```## Contact
If you have any questions, feel free to [open an issue](https://github.com/poloclub/llm-landscape/issues/new) or contact [Anthony Peng](https://shengyun-peng.github.io/).