https://github.com/aengusl/latent-adversarial-training

Last synced: about 2 months ago
JSON representation

Host: GitHub
URL: https://github.com/aengusl/latent-adversarial-training
Owner: aengusl
License: mit
Created: 2024-06-18T13:24:23.000Z (11 months ago)
Default Branch: main
Last Pushed: 2024-09-29T02:41:05.000Z (8 months ago)
Last Synced: 2024-10-29T04:34:26.467Z (7 months ago)
Language: Jupyter Notebook
Size: 3.24 MB
Stars: 26
Watchers: 1
Forks: 8
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-llm-unlearning - GitHub

README

        # Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

Abhay Sheshadri,* [[email protected]]([email protected]); 

Aidan Ewart,* [[email protected]]([email protected]); 

Phillip Guo,* [[email protected]]([email protected]); 

Aengus Lynch,* [[email protected]]([email protected]);

Cindy Wu,* [[email protected]]([email protected]);

Vivek Hebbar*;

Henry Sleight;

Asa Cooper Stickland;

Ethan Perez;

Dylan Hadfield-Menell;

Stephen Casper, [[email protected]]([email protected])

See our [models on Hugging Face Hub:](https://huggingface.co/LLM-LAT).

Read the paper on arXiv: [Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs](https://arxiv.org/abs/2407.15549).

Chat with our robust refusal model ([https://huggingface.co/LLM-LAT/robust-llama3-8b-instruct](https://huggingface.co/LLM-LAT/robust-llama3-8b-instruct)) at [https://www.abhayesian.com/lat-chat](https://www.abhayesian.com/lat-chat).

```

@article{sheshadri2024targeted,

  title={Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs},

  author={Sheshadri, Abhay and Ewart, Aidan and Guo, Phillip and Lynch, Aengus and Wu, Cindy and Hebbar, Vivek and Sleight, Henry and Stickland, Asa Cooper and Perez, Ethan and Hadfield-Menell, Dylan and Casper, Stephen},

  journal={arXiv preprint arXiv:2407.15549},

  year={2024}

}

```

See also preliminary work: [Defending Against Unforeseen Failure Modes with Latent Adversarial Training](https://arxiv.org/abs/2403.05030).

## This repository

This repository contains code for implementing latent adversarial attacks 

and latent adversarial training (LAT) in LLMs. 

  

  To perform targeted latent adversarial training (LAT) in LLMs, we perturb the latent activations

in an LLM’s residual stream to elicit specific failure modes from the model. Then, we fine-tune

LLMs on the target task under these perturbations. We use this approach to improve robustness to

jailbreaks, remove backdoors without access to the trigger, and unlearn

undesirable knowledge.

## Setup

After you clone and navigate to the repository:

```angular2html

pip install -r requirements.txt

bash install_tasks_from_github.sh

```

## Ready to go with the notebooks

Find notebooks for latent space attacks, jaiblreak robustness, 

backdoor removal, harry potter unlearning, and wmdp unlearning 

in the ```/notebooks``` folder.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/aengusl/latent-adversarial-training

Awesome Lists containing this project

README