https://github.com/jackhhao/llm-warden

A simple jailbreak detection tool for safeguarding LLMs.
https://github.com/jackhhao/llm-warden

Last synced: 3 months ago
JSON representation

A simple jailbreak detection tool for safeguarding LLMs.

Host: GitHub
URL: https://github.com/jackhhao/llm-warden
Owner: jackhhao
License: mit
Created: 2023-09-30T03:51:00.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2023-09-30T04:34:46.000Z (about 2 years ago)
Last Synced: 2025-01-06T00:36:06.077Z (11 months ago)
Language: Python
Size: 896 KB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome_ai_agents - Llm-Warden - A simple jailbreak detection tool for safeguarding LLMs. (Building / Tools)

README

# LLM Warden

A simple jailbreak detection tool for safeguarding LLMs. Available as a fine-tuned model on HuggingFace at [jackhhao/jailbreak-classifier](https://huggingface.co/jackhhao/jailbreak-classifier).

## Description
Jailbreaking is a technique that involves creating prompts to bypass standard safety/moderation controls for LLMs. If succesful, it can lead to dangerous downstream attacks and unrestricted output. This tool serves as a way to proactively detect and defend against such attacks.

## Getting Started

### Dependencies

* Python 3

### Installation

To install, run `pip install -r requirements.txt`.

## Usage

There are three options available to start using this model:
1. Use the HuggingFace inference pipeline
2. Use the Cohere API
3. Train and run the model locally

### Using the inference pipeline
Simply run the following snippet:
```python
from transformers import pipeline

pipe = pipeline("text-classification", model="jackhhao/jailbreak-classifier")

print(pipe("is this a jailbreak?"))
```

### Using Cohere
1. Obtain a trial API key from [the Cohere dashboard](https://dashboard.cohere.com/api-keys).
2. Create a `.env` file (example one provided) with the API key.
3. Go to `cohere_client.py` and replace the classifier input with your own examples.

### Running locally
1. Run `train.py` (uses the data under `data/`).
2. Run `classify.py`, replacing the classifier input with your own examples if desired.

## Roadmap
* Create CLI tool for easy input + prediction
* Build Streamlit app to classify prompts via UI (& switch between models)
* Add moderation score / toxicity as additional model feature

## License

This project is licensed under the MIT License - see the LICENSE file for details.

## Contact

Jack Hao -

## Acknowledgments

Thanks to the Cohere team for providing such an easy-to-use & powerful API!

And shout-out to the HuggingFace team for hosting a great platform for open-source datasets & models :)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jackhhao/llm-warden

Awesome Lists containing this project

README