https://github.com/lennart-finke/classifier-interp
Training Sparse Autoencoders on Prompt-Guard
https://github.com/lennart-finke/classifier-interp
ai-safety jailbreak sae sparse-autoencoders
Last synced: 4 months ago
JSON representation
Training Sparse Autoencoders on Prompt-Guard
- Host: GitHub
- URL: https://github.com/lennart-finke/classifier-interp
- Owner: lennart-finke
- Created: 2025-08-15T13:40:32.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2025-10-05T01:31:29.000Z (8 months ago)
- Last Synced: 2025-10-05T03:28:47.147Z (8 months ago)
- Topics: ai-safety, jailbreak, sae, sparse-autoencoders
- Language: HTML
- Homepage: https://finke.dev/promptguard
- Size: 3.79 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Training Sparse Autoencoders on Prompt-Guard
[Caution] This repository is intended to handle, though does not host, jailbreak prompts, which often contain malicious, unsafe, or inappropriate text.
We train SAEs on [Prompt-Guard-86M](https://huggingface.co/meta-llama/Prompt-Guard-86M), using the great [`dictionary_learning` package](https://github.com/saprmarks/dictionary_learning). The same methodology can be applied on any Huggingface-compatible classifier.
A guide on how to use this repo is provided in `reproducing.MD`.
Model weights are included in this repository via Git LFS.
### Preprint
The preprint is available [here](https://github.com/lennart-finke/classifier-interp/blob/main/paper.pdf?raw=true).
### Contributing
Feel free to propose changes, do PRs or raise issues.
### Thanks
This project was conducted as coursework at ETH, with supervision from Prof. Dr. Elliott Ash and David Zollikofer. Many thanks also to Samuel Marks, Adam Karvonen, and Aaron Mueller for writing the dictionary learning package.
### Citation
If you'd like to cite this work, we recommend
```tex
@misc{finke2025training,
title={Autoencoders for a Harmfulness Text Classifier},
url={https://github.com/lennart-finke/classifier-interp},
author={Finke, Lennart and Zollikofer, David}, year={2025}
}
```