https://github.com/agencyenterprise/PromptInject
PromptInject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of LLMs to adversarial prompt attacks. 🏆 Best Paper Awards @ NeurIPS ML Safety Workshop 2022
https://github.com/agencyenterprise/PromptInject
adversarial-attacks agi agi-alignment ai-alignment ai-safety chain-of-thought gpt-3 language-models large-language-models machine-learning ml-safety prompt-engineering
Last synced: 29 days ago
JSON representation
PromptInject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of LLMs to adversarial prompt attacks. 🏆 Best Paper Awards @ NeurIPS ML Safety Workshop 2022
- Host: GitHub
- URL: https://github.com/agencyenterprise/PromptInject
- Owner: agencyenterprise
- License: mit
- Created: 2022-10-25T11:42:12.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-02-26T14:55:14.000Z (about 1 year ago)
- Last Synced: 2024-05-11T13:20:52.389Z (12 months ago)
- Topics: adversarial-attacks, agi, agi-alignment, ai-alignment, ai-safety, chain-of-thought, gpt-3, language-models, large-language-models, machine-learning, ml-safety, prompt-engineering
- Language: Python
- Homepage:
- Size: 222 KB
- Stars: 274
- Watchers: 9
- Forks: 27
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
- awesome-MLSecOps - PromptInject
- Awesome-Prompt-Engineering - [Github
README
# PromptInject
[**Paper: Ignore Previous Prompt: Attack Techniques For Language Models**](https://arxiv.org/abs/2211.09527)
## Abstract
> Transformer-based large language models (LLMs) provide a powerful foundation for natural language tasks in large-scale customer-facing applications. However, studies that explore their vulnerabilities emerging from malicious user interaction are scarce. By proposing PROMPTINJECT, a prosaic alignment framework for mask-based iterative adversarial prompt composition, we examine how GPT-3, the most widely deployed language model in production, can be easily misaligned by simple handcrafted inputs. In particular, we investigate two types of attacks -- goal hijacking and prompt leaking -- and demonstrate that even low-aptitude, but sufficiently ill-intentioned agents, can easily exploit GPT-3’s stochastic nature, creating long-tail risks.

Figure 1: Diagram showing how adversarial user input can derail model instructions. In both attacks,
the attacker aims to change the goal of the original prompt. In *goal hijacking*, the new goal is to print
a specific target string, which may contain malicious instructions, while in *prompt leaking*, the new
goal is to print the application prompt. Application Prompt (gray box) shows the original prompt,
where `{user_input}` is substituted by the user input. In this example, a user would normally input
a phrase to be corrected by the application (blue boxes). *Goal Hijacking* and *Prompt Leaking* (orange
boxes) show malicious user inputs (left) for both attacks and the respective model outputs (right)
when the attack is successful.## Install
Run:
pip install git+https://github.com/agencyenterprise/PromptInject
## Usage
See [notebooks/Example.ipynb](notebooks/Example.ipynb) for an example.
## Cite
Bibtex:
@misc{ignore_previous_prompt,
doi = {10.48550/ARXIV.2211.09527},
url = {https://arxiv.org/abs/2211.09527},
author = {Perez, Fábio and Ribeiro, Ian},
keywords = {Computation and Language (cs.CL), Artificial Intelligence (cs.AI), FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {Ignore Previous Prompt: Attack Techniques For Language Models},
publisher = {arXiv},
year = {2022}
}## Contributing
We appreciate any additional request and/or contribution to `PromptInject`. The [issues](/issues) tracker is used to keep a list of features and bugs to be worked on. Please see our [contributing documentation](/CONTRIBUTING.md) for some tips on getting started.