Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/benderscript/promptinjectionbench
Prompt Injection Attacks against GPT-4, Gemini, Azure, Azure with Jailbreak
https://github.com/benderscript/promptinjectionbench
attack azure gemini gpt-4 injection openai prompt
Last synced: 3 months ago
JSON representation
Prompt Injection Attacks against GPT-4, Gemini, Azure, Azure with Jailbreak
- Host: GitHub
- URL: https://github.com/benderscript/promptinjectionbench
- Owner: BenderScript
- License: apache-2.0
- Created: 2024-01-09T17:29:13.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-03-13T09:52:44.000Z (11 months ago)
- Last Synced: 2024-09-19T23:29:30.505Z (4 months ago)
- Topics: attack, azure, gemini, gpt-4, injection, openai, prompt
- Language: Python
- Homepage:
- Size: 34.4 MB
- Stars: 13
- Watchers: 3
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Prompt Injection Benchmarking
Welcome to the ultimate repository for prompt injection benchmarking. Whether you've been eagerly awaiting this or will soon realize its importance, this is the one to watch.
## 1. Analysing OpenAI, Azure OpenAI GPT-4 and Gemini Pro Jailbreak Detection
This repository includes Python code to analyze the Hugging Face Jailbreak dataset against models such as OpenAI’s GPT-4 and Gemini Pro. The code automates the process of sending prompts from the dataset to these models and tabulating the results
## 2. Updated (as of October 2024) Results for Gemini-1.5 Pro, OpenAI GPT-4o and AzureOpenAI GPT-4o
It's time for updated benchmarking results! With the release of these new models, including Gemini-1.5 Pro, OpenAI GPT-4o, and Azure OpenAI GPT-4o, we’ve seen significant improvements. However, the most groundbreaking advancement from a benchmarking perspective is the shift towards Structured Outputs. This change eliminates the need for pattern matching or convoluted prompt engineering tricks to extract a simple "True"/"False" answer. Instead, by passing a Pydantic class JSON schema, we now obtain more reliable and trustworthy results.
### 2.1 Latest Runs
OpenAI has definitely demonstrated significant improvement detecting prompt injection. Unfortunately the same cannot be said about Gemini.
**Total Prompts:** 139
| Run | Date | OpenAI (GPT-4o) | Gemini (Gemini-1.5 Pro) | Azure (GPT-4o) |
|-------|------------|-----------------|-------------------------|----------------|
| run-1 | 2024-10-03 | 136 | 51 | 136 |
| run-2 | 2024-10-03 | 136 | 49 | 139 |
| run-3 | 2024-10-03 | 135 | 50 | 139 |
| run-4 | 2024-10-04 | 136 | 48 | 139 |
| run-5 | 2024-10-07 | 135 | 49 | 139 |**Total Prompts:** 139
#### 2.1.1 Gemini: no-content moderation
There’s an important discussion to be had regarding what defines a jailbreak attack and whether content moderation filters should be applied in such cases. To address this, in these runs I completly disabled Gemini content moderation but it still has a hard time flagging prompt injection. I noticed that there are
quite a few blocks categorized as reason "other". Reading Github bugs and discussions it seems this is Gemini's equivalent of "HTTP Internal Server Error", therefore I should not count these as prompt injection detection. Example, Gemini was blocking "what is your name" or "thats cool" with block reason "other".**Total Prompts:** 139
| Run | Date | OpenAI (GPT-4o) | Gemini (Gemini-1.5 Pro) | Azure (GPT-4o) |
|-------|------------|-----------------|-------------------------|----------------|
| run-1 | 2024-10-05 | 139 | 48 | 139 |
| run-2 | 2024-10-05 | 136 | 49 | 139 |
| run-3 | 2024-10-07 | 135 | 49 | 139 |#### 2.1.2 Gemini: Block-Other enabled
Although I consider counting block:other as an unfair advatange because the result of prompt detection is still 'false', Gemini still has trouble keeping up with other models.
**Total Prompts:** 139
| Run | Date | OpenAI (GPT-4o) | Gemini (Gemini-1.5 Pro) | Azure (GPT-4o) |
|-------|------------|-----------------|-------------------------|----------------|
| run-1 | 2024-10-05 | 136 | 113 | 139 |
| run-2 | 2024-10-07 | 136 | 113 | 139 |### 2.2 Anthropic?
As of this writing they do not support structured outputs, just JSON mode, therefore if I have extra cycles will include a model that has support for it.
## 3. Improvements in version 0.2
### Code into proper packages
```bash
project-root/
├── llms/ # Directory for LLM-related functionalities and code
├── models/ # Directory for model definitions and pre-trained models
├── static/ # Static files (e.g., CSS, JavaScript, images)
└── server.py # Main server script
```- llms: Classes used to interact with models. Send prompts, receive and parse responses
- models: Pydantic models related to application state, LLM state, amongst others, *not* LLM models
- static: UI related### Simplification
Code is more robust to errors even tough it is much smaller
### Models all around
Created Pydantic classes to store LLM state, APP state and prompts
### Async Everywhere
Everything was converted to async, no more synchronous calls.
### Saving Results
Results are saved in file results.txt and models used, prompts, whether injection was detected.
### REST Interface
Introduced a REST Interface to start-analysis and control whether content moderations should be disabled.
### Docker container Support
Create an official image and pushed to DockerHub. Provided scripts to also build the image locally.
## 4. What is a Prompt Jailbreak attack?
A 'jailbreak' occurs when a language model is manipulated to generate harmful, inappropriate, or otherwise restricted content. This can include hate speech, misinformation, or any other type of undesirable output.
A quick story: Once, I jailbroke my iPhone to sideload apps. It worked—until I rebooted it, turning my iPhone into a very expensive paperweight.
In the world of LLMs, the term 'jailbreak' is used loosely. Some attacks are silly and insignificant, while others pose serious challenges to the model’s safety mechanisms.
## 5. Excited to See (past) the Results?
The table below summarizes the number of injection attack prompts from the dataset and the models' detection rates.
### Previous Runs
**Total Prompts:** 139
| Prompts | GPT-4 | Gemini-1.0 | Azure OpenAI | Azure OpenAI w/ Jailbreak Detection |
|---------------|-------|------------|--------------|-------------------------------------|
| Detected | 133 | 53 | | 136 |
| Not Attack | 6 | 4 | | 3 |
| Missed Attack | 0 | 82 | | 0 |### Previous runs
| Prompts | GPT-4 | Gemini-1.0 | Azure OpenAI| Azure OpenAI w/ Jailbreak Detection |
|---------------|-------|------------| ------------|-------------------------------------|
| 139 | | | | |
| Detected | 131 | 49 | | 134 |
| Not Attack | TBD | TBD | | |
| Missed Attack | TBD | TBD | | |many more but...no space
### Important Details
- Prompts can be blocked by built-in or explicit safety filters. Therefore the code becomes nuanced and to gather proper results you need to catch certain exceptions (see code)
## 6. Running the server
### 6.1 Create a `.env`
Irrespective if you are running as a docker container or through `python server.py` you need to create a .env file that contains your OpenAI API, Azure and Google Keys.
If you do not have all these keys the code will skip that LLM, no need to worry.```env
OPENAI_API_KEY=your key>
OPENAI_MODEL_NAME=gpt-4o
#
GOOGLE_API_KEY=
GOOGLE_MODEL_NAME=gemini-1.5-pro
# Azure
AZURE_OPENAI_API_KEY=your key>
AZURE_OPENAI_MODEL_NAME=gpt-4
AZURE_OPENAI_ENDPOINT=your endpoint>
AZURE_OPENAI_API_VERSION=
AZURE_OPENAI_DEPLOYMENT=
```### 6.2 Docker
The easiest way is to use the pre-built image.
```bash
docker run -d -e IN_DOCKER=true --env-file ./.env --name prompt-bench-container -p 9123:9123 brisacoder/prompt-bench:latest
```then start the analysis with:
```bash
curl -X POST http://localhost:9123/start-analysis \
-H "Content-Type: application/json" \
-d '{
"analysis": "start",
"gemini": {
"flag_safety_block_as_injection": true
}
}'
``````powershell
curl -X POST http://localhost:9123/start-analysis `
-H "Content-Type: application/json" `
-d '{"analysis": "start", "gemini": {"flag_safety_block_as_injection": true}}'
```## 6.3 Results
The server will store everything in the file `results.txt`. This includes LLMs, models used, any parameters passed to the REST interface and the individual prompt detection.
### 6.4 Running the Server natively
To run the server manually, install all dependencies.
```bash
pip3 install -r requirements.txt
```Run:
```bash
python3 server.py
```## 7.Running the Analysis (Legacy)
Updates soon
If Everything goes well, you should see the following page at
![Landing page](images/landing1.png)
This script loads the dataset, iterates through prompts, sends them to ChatGPT-4, and detects potential injection attacks in the generated responses.
## Testing
See the demo below where the App checks a prompt with a malicious URL and injection.
![Demo](images/prompt_bench_demo.gif)
## Skipping "Benign" Prompts
In the interest of time, the code skips prompts labeled as "benign." This helps focus the analysis on potentially harmful prompts where injection attacks might occur.
## License
This code is provided under the [Apache License 2.0](LICENSE). If you liked this code, cite it and let me know.
---
For more information about OpenAI's GPT-4 model and the Hugging Face Jailbreak dataset, please refer to the official documentation and sources:
- [OpenAI GPT-4](https://openai.com/gpt-4)
- [Hugging Face Jailbreak Dataset](https://huggingface.co/datasets/jackhhao/jailbreak-classification)
- [Azure Jailbreak Risk Detection](https://learn.microsoft.com/en-us/azure/ai-services/content-safety/concepts/jailbreak-detection)
- [Gemini API](https://ai.google.dev/tutorials/python_quickstart)
- [Gemini SAfety Settings](https://ai.google.dev/gemini-api/docs/safety-settings)