https://github.com/poloclub/llm-landscape

NeurIPS'24 - LLM Safety Landscape
https://github.com/poloclub/llm-landscape

llm llm-landscape llm-safety llm-safety-landscape safety-basin

Last synced: 5 months ago
JSON representation

NeurIPS'24 - LLM Safety Landscape

Host: GitHub
URL: https://github.com/poloclub/llm-landscape
Owner: poloclub
License: mit
Created: 2024-10-12T16:18:49.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-02-25T20:26:16.000Z (8 months ago)
Last Synced: 2025-02-25T21:29:23.683Z (8 months ago)
Topics: llm, llm-landscape, llm-safety, llm-safety-landscape, safety-basin
Language: Python
Homepage: https://shengyun-peng.github.io/papers/llm-safety-landscape
Size: 812 KB
Stars: 18
Watchers: 1
Forks: 4
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models [NeurIPS'24]

[![arxiv badge](https://img.shields.io/badge/arXiv-2405.17374-red)](https://arxiv.org/abs/2405.17374)

You can visualize the safety and capability landscapes of your own LLM!

- Plot the **safety basin** of your own model: if you make small, random tweaks to the model's weights, it stays as safe as the original model within a certain range. However, when these tweaks get large enough, there’s a tipping point where the model’s safety suddenly breaks down.
- Harmful finetuning attacks (HFA) compromise safety by dragging the model away from the safety basin.
- This safety landscape also shows that the system prompt plays a huge role in keeping the model safe, and that this protection extends to slightly tweaked versions of the model within the safety basin.
- When we test the model’s safety with jailbreaking prompts, we see that these prompts are very sensitive to even small changes in the model's weights.

Demo

## Research Paper
[**Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models**](https://arxiv.org/abs/2405.17374)

ShengYun Peng¹,
Pin-Yu Chen²,
Matthew Hull¹,
Duen Horng Chau¹

¹Georgia Tech,
²IBM Research

In *NeurIPS 2024*.

## Quick Start
You can plot the 1D and 2D LLM landscapes and compute the VISAGE score for your own models. We are using Llama2-7b-chat as an example. Please modify the yaml file under `/config` for customized experiments.

### Setup
```bash
make .done_venv
```

### Compute direction
```bash
make direction
```

It consume ~27G on a single A100 GPU. The computed direction is stored at `experiments/advbench/1D_random/llama2/dirs1.pt`.

### Visualize landscape and compute VISAGE score
```bash
make landscape
```

Change `NGPU` in Makefile to the number of devices on your hardware.

Change `batch_size` at `config/dataset/default.yaml` to avoid CUDA OOM.

Model generations are saved at `experiments/advbench/1D_random/llama2/output.jsonl`.

The landscape visualization is saved at `experiments/advbench/1D_random/llama2/1D_random_llama2_landscape.png`.

## Citation
```bibtex
@article{peng2024navigating,
title={Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models},
author={Peng, ShengYun and Chen, Pin-Yu and Hull, Matthew and Chau, Duen Horng},
journal={arXiv preprint arXiv:2405.17374},
year={2024}
}
```

## Contact
If you have any questions, feel free to [open an issue](https://github.com/poloclub/llm-landscape/issues/new) or contact [Anthony Peng](https://shengyun-peng.github.io/).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/poloclub/llm-landscape

Awesome Lists containing this project

README