https://github.com/lennart-finke/classifier-interp

Training Sparse Autoencoders on Prompt-Guard
https://github.com/lennart-finke/classifier-interp

ai-safety jailbreak sae sparse-autoencoders

Last synced: 4 months ago
JSON representation

Training Sparse Autoencoders on Prompt-Guard

Host: GitHub
URL: https://github.com/lennart-finke/classifier-interp
Owner: lennart-finke
Created: 2025-08-15T13:40:32.000Z (10 months ago)
Default Branch: main
Last Pushed: 2025-10-05T01:31:29.000Z (8 months ago)
Last Synced: 2025-10-05T03:28:47.147Z (8 months ago)
Topics: ai-safety, jailbreak, sae, sparse-autoencoders
Language: HTML
Homepage: https://finke.dev/promptguard
Size: 3.79 MB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Training Sparse Autoencoders on Prompt-Guard
[Caution] This repository is intended to handle, though does not host, jailbreak prompts, which often contain malicious, unsafe, or inappropriate text.

We train SAEs on [Prompt-Guard-86M](https://huggingface.co/meta-llama/Prompt-Guard-86M), using the great [`dictionary_learning` package](https://github.com/saprmarks/dictionary_learning). The same methodology can be applied on any Huggingface-compatible classifier.
A guide on how to use this repo is provided in `reproducing.MD`.

Model weights are included in this repository via Git LFS.

### Preprint
The preprint is available [here](https://github.com/lennart-finke/classifier-interp/blob/main/paper.pdf?raw=true).

### Contributing
Feel free to propose changes, do PRs or raise issues.

### Thanks
This project was conducted as coursework at ETH, with supervision from Prof. Dr. Elliott Ash and David Zollikofer. Many thanks also to Samuel Marks, Adam Karvonen, and Aaron Mueller for writing the dictionary learning package.

### Citation
If you'd like to cite this work, we recommend
```tex
@misc{finke2025training,
title={Autoencoders for a Harmfulness Text Classifier},
url={https://github.com/lennart-finke/classifier-interp},
author={Finke, Lennart and Zollikofer, David}, year={2025}
}
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/lennart-finke/classifier-interp

Awesome Lists containing this project

README