https://github.com/benderscript/promptinjectionbench

Prompt Injection Attacks against GPT-4, Gemini, Azure, Azure with Jailbreak
https://github.com/benderscript/promptinjectionbench

attack azure gemini gpt-4 injection openai prompt

Last synced: 4 months ago
JSON representation

Prompt Injection Attacks against GPT-4, Gemini, Azure, Azure with Jailbreak

Host: GitHub
URL: https://github.com/benderscript/promptinjectionbench
Owner: BenderScript
License: apache-2.0
Created: 2024-01-09T17:29:13.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-03-13T09:52:44.000Z (over 1 year ago)
Last Synced: 2024-09-19T23:29:30.505Z (10 months ago)
Topics: attack, azure, gemini, gpt-4, injection, openai, prompt
Language: Python
Homepage:
Size: 34.4 MB
Stars: 13
Watchers: 3
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Prompt Injection Benchmarking

Welcome to the ultimate repository for prompt injection benchmarking. Whether you've been eagerly awaiting this or will soon realize its importance, this is the one to watch.

## 1. Analysing OpenAI, Azure OpenAI GPT-4 and Gemini Pro Jailbreak Detection

This repository includes Python code to analyze the Hugging Face Jailbreak dataset against models such as OpenAI’s GPT-4 and Gemini Pro. The code automates the process of sending prompts from the dataset to these models and tabulating the results

## 2. Updated (as of October 2024) Results for Gemini-1.5 Pro, OpenAI GPT-4o and AzureOpenAI GPT-4o

It's time for updated benchmarking results! With the release of these new models, including Gemini-1.5 Pro, OpenAI GPT-4o, and Azure OpenAI GPT-4o, we’ve seen significant improvements. However, the most groundbreaking advancement from a benchmarking perspective is the shift towards Structured Outputs. This change eliminates the need for pattern matching or convoluted prompt engineering tricks to extract a simple "True"/"False" answer. Instead, by passing a Pydantic class JSON schema, we now obtain more reliable and trustworthy results.

### 2.1 Latest Runs

OpenAI has definitely demonstrated significant improvement detecting prompt injection. Unfortunately the same cannot be said about Gemini.

**Total Prompts:** 139

| Run | Date | OpenAI (GPT-4o) | Gemini (Gemini-1.5 Pro) | Azure (GPT-4o) |
|-------|------------|-----------------|-------------------------|----------------|
| run-1 | 2024-10-03 | 136 | 51 | 136 |
| run-2 | 2024-10-03 | 136 | 49 | 139 |
| run-3 | 2024-10-03 | 135 | 50 | 139 |
| run-4 | 2024-10-04 | 136 | 48 | 139 |
| run-5 | 2024-10-07 | 135 | 49 | 139 |

**Total Prompts:** 139

#### 2.1.1 Gemini: no-content moderation

There’s an important discussion to be had regarding what defines a jailbreak attack and whether content moderation filters should be applied in such cases. To address this, in these runs I completly disabled Gemini content moderation but it still has a hard time flagging prompt injection. I noticed that there are
quite a few blocks categorized as reason "other". Reading Github bugs and discussions it seems this is Gemini's equivalent of "HTTP Internal Server Error", therefore I should not count these as prompt injection detection. Example, Gemini was blocking "what is your name" or "thats cool" with block reason "other".

**Total Prompts:** 139

| Run | Date | OpenAI (GPT-4o) | Gemini (Gemini-1.5 Pro) | Azure (GPT-4o) |
|-------|------------|-----------------|-------------------------|----------------|
| run-1 | 2024-10-05 | 139 | 48 | 139 |
| run-2 | 2024-10-05 | 136 | 49 | 139 |
| run-3 | 2024-10-07 | 135 | 49 | 139 |

#### 2.1.2 Gemini: Block-Other enabled

Although I consider counting block:other as an unfair advatange because the result of prompt detection is still 'false', Gemini still has trouble keeping up with other models.

**Total Prompts:** 139

### 2.2 Anthropic?

As of this writing they do not support structured outputs, just JSON mode, therefore if I have extra cycles will include a model that has support for it.

## 3. Improvements in version 0.2

### Code into proper packages

```bash
project-root/
├── llms/ # Directory for LLM-related functionalities and code
├── models/ # Directory for model definitions and pre-trained models
├── static/ # Static files (e.g., CSS, JavaScript, images)
└── server.py # Main server script
```

- llms: Classes used to interact with models. Send prompts, receive and parse responses
- models: Pydantic models related to application state, LLM state, amongst others, *not* LLM models
- static: UI related

### Simplification

Code is more robust to errors even tough it is much smaller

### Models all around

Created Pydantic classes to store LLM state, APP state and prompts

### Async Everywhere

Everything was converted to async, no more synchronous calls.

### Saving Results

Results are saved in file results.txt and models used, prompts, whether injection was detected.

### REST Interface

Introduced a REST Interface to start-analysis and control whether content moderations should be disabled.

### Docker container Support

Create an official image and pushed to DockerHub. Provided scripts to also build the image locally.

## 4. What is a Prompt Jailbreak attack?

A 'jailbreak' occurs when a language model is manipulated to generate harmful, inappropriate, or otherwise restricted content. This can include hate speech, misinformation, or any other type of undesirable output.

A quick story: Once, I jailbroke my iPhone to sideload apps. It worked—until I rebooted it, turning my iPhone into a very expensive paperweight.

In the world of LLMs, the term 'jailbreak' is used loosely. Some attacks are silly and insignificant, while others pose serious challenges to the model’s safety mechanisms.

## 5. Excited to See (past) the Results?

The table below summarizes the number of injection attack prompts from the dataset and the models' detection rates.

### Previous Runs

**Total Prompts:** 139

| Prompts | GPT-4 | Gemini-1.0 | Azure OpenAI | Azure OpenAI w/ Jailbreak Detection |
|---------------|-------|------------|--------------|-------------------------------------|
| Detected | 133 | 53 | | 136 |
| Not Attack | 6 | 4 | | 3 |
| Missed Attack | 0 | 82 | | 0 |

### Previous runs

| Prompts | GPT-4 | Gemini-1.0 | Azure OpenAI| Azure OpenAI w/ Jailbreak Detection |
|---------------|-------|------------| ------------|-------------------------------------|
| 139 | | | | |
| Detected | 131 | 49 | | 134 |
| Not Attack | TBD | TBD | | |
| Missed Attack | TBD | TBD | | |

many more but...no space

### Important Details

- Prompts can be blocked by built-in or explicit safety filters. Therefore the code becomes nuanced and to gather proper results you need to catch certain exceptions (see code)

## 6. Running the server

### 6.1 Create a `.env`

Irrespective if you are running as a docker container or through `python server.py` you need to create a .env file that contains your OpenAI API, Azure and Google Keys.
If you do not have all these keys the code will skip that LLM, no need to worry.

```env
OPENAI_API_KEY=your key>
OPENAI_MODEL_NAME=gpt-4o
#
GOOGLE_API_KEY=
GOOGLE_MODEL_NAME=gemini-1.5-pro
# Azure
AZURE_OPENAI_API_KEY=your key>
AZURE_OPENAI_MODEL_NAME=gpt-4
AZURE_OPENAI_ENDPOINT=your endpoint>
AZURE_OPENAI_API_VERSION=
AZURE_OPENAI_DEPLOYMENT=
```

### 6.2 Docker

The easiest way is to use the pre-built image.

```bash
docker run -d -e IN_DOCKER=true --env-file ./.env --name prompt-bench-container -p 9123:9123 brisacoder/prompt-bench:latest
```

then start the analysis with:

```bash
curl -X POST http://localhost:9123/start-analysis \
-H "Content-Type: application/json" \
-d '{
"analysis": "start",
"gemini": {
"flag_safety_block_as_injection": true
}
}'
```

```powershell
curl -X POST http://localhost:9123/start-analysis `
-H "Content-Type: application/json" `
-d '{"analysis": "start", "gemini": {"flag_safety_block_as_injection": true}}'
```

## 6.3 Results

The server will store everything in the file `results.txt`. This includes LLMs, models used, any parameters passed to the REST interface and the individual prompt detection.

### 6.4 Running the Server natively

To run the server manually, install all dependencies.

```bash
pip3 install -r requirements.txt
```

Run:

```bash
python3 server.py
```

## 7.Running the Analysis (Legacy)

Updates soon

If Everything goes well, you should see the following page at

![Landing page](images/landing1.png)

This script loads the dataset, iterates through prompts, sends them to ChatGPT-4, and detects potential injection attacks in the generated responses.

## Testing

See the demo below where the App checks a prompt with a malicious URL and injection.

![Demo](images/prompt_bench_demo.gif)

## Skipping "Benign" Prompts

In the interest of time, the code skips prompts labeled as "benign." This helps focus the analysis on potentially harmful prompts where injection attacks might occur.

## License

This code is provided under the [Apache License 2.0](LICENSE). If you liked this code, cite it and let me know.

---

For more information about OpenAI's GPT-4 model and the Hugging Face Jailbreak dataset, please refer to the official documentation and sources:

- [OpenAI GPT-4](https://openai.com/gpt-4)
- [Hugging Face Jailbreak Dataset](https://huggingface.co/datasets/jackhhao/jailbreak-classification)
- [Azure Jailbreak Risk Detection](https://learn.microsoft.com/en-us/azure/ai-services/content-safety/concepts/jailbreak-detection)
- [Gemini API](https://ai.google.dev/tutorials/python_quickstart)
- [Gemini SAfety Settings](https://ai.google.dev/gemini-api/docs/safety-settings)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/benderscript/promptinjectionbench

Awesome Lists containing this project

README