https://github.com/aengusl/latent-adversarial-training
https://github.com/aengusl/latent-adversarial-training
Last synced: about 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/aengusl/latent-adversarial-training
- Owner: aengusl
- License: mit
- Created: 2024-06-18T13:24:23.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2024-09-29T02:41:05.000Z (8 months ago)
- Last Synced: 2024-10-29T04:34:26.467Z (7 months ago)
- Language: Jupyter Notebook
- Size: 3.24 MB
- Stars: 26
- Watchers: 1
- Forks: 8
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-llm-unlearning - GitHub
README
# Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs
Abhay Sheshadri,* [[email protected]]([email protected]);
Aidan Ewart,* [[email protected]]([email protected]);
Phillip Guo,* [[email protected]]([email protected]);
Aengus Lynch,* [[email protected]]([email protected]);
Cindy Wu,* [[email protected]]([email protected]);
Vivek Hebbar*;
Henry Sleight;
Asa Cooper Stickland;
Ethan Perez;
Dylan Hadfield-Menell;
Stephen Casper, [[email protected]]([email protected])See our [models on Hugging Face Hub:](https://huggingface.co/LLM-LAT).
Read the paper on arXiv: [Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs](https://arxiv.org/abs/2407.15549).
Chat with our robust refusal model ([https://huggingface.co/LLM-LAT/robust-llama3-8b-instruct](https://huggingface.co/LLM-LAT/robust-llama3-8b-instruct)) at [https://www.abhayesian.com/lat-chat](https://www.abhayesian.com/lat-chat).
```
@article{sheshadri2024targeted,
title={Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs},
author={Sheshadri, Abhay and Ewart, Aidan and Guo, Phillip and Lynch, Aengus and Wu, Cindy and Hebbar, Vivek and Sleight, Henry and Stickland, Asa Cooper and Perez, Ethan and Hadfield-Menell, Dylan and Casper, Stephen},
journal={arXiv preprint arXiv:2407.15549},
year={2024}
}
```See also preliminary work: [Defending Against Unforeseen Failure Modes with Latent Adversarial Training](https://arxiv.org/abs/2403.05030).
## This repository
This repository contains code for implementing latent adversarial attacks
and latent adversarial training (LAT) in LLMs.
![]()
To perform targeted latent adversarial training (LAT) in LLMs, we perturb the latent activations
in an LLM’s residual stream to elicit specific failure modes from the model. Then, we fine-tune
LLMs on the target task under these perturbations. We use this approach to improve robustness to
jailbreaks, remove backdoors without access to the trigger, and unlearn
undesirable knowledge.## Setup
After you clone and navigate to the repository:
```angular2html
pip install -r requirements.txt
bash install_tasks_from_github.sh
```## Ready to go with the notebooks
Find notebooks for latent space attacks, jaiblreak robustness,
backdoor removal, harry potter unlearning, and wmdp unlearning
in the ```/notebooks``` folder.