Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/sinanw/llm-security-prompt-injection
This project investigates the security of large language models by performing binary classification of a set of input prompts to discover malicious prompts. Several approaches have been analyzed using classical ML algorithms, a trained LLM model, and a fine-tuned LLM model.
https://github.com/sinanw/llm-security-prompt-injection
cybersecurity llm-prompting llm-security prompt-injection transformers-models
Last synced: 16 days ago
JSON representation
This project investigates the security of large language models by performing binary classification of a set of input prompts to discover malicious prompts. Several approaches have been analyzed using classical ML algorithms, a trained LLM model, and a fine-tuned LLM model.
- Host: GitHub
- URL: https://github.com/sinanw/llm-security-prompt-injection
- Owner: sinanw
- License: mit
- Created: 2023-11-21T23:05:13.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2023-12-18T23:50:31.000Z (12 months ago)
- Last Synced: 2024-08-08T05:07:05.357Z (4 months ago)
- Topics: cybersecurity, llm-prompting, llm-security, prompt-injection, transformers-models
- Language: Jupyter Notebook
- Homepage:
- Size: 2.75 MB
- Stars: 25
- Watchers: 3
- Forks: 5
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome_gpt_super_prompting - sinanw/llm-security-prompt-injection - Security-focused prompt injection repository. (🔐 Secure Prompting / Hall Of Fame:)
- Awesome-LLMSecOps - llm-security-prompt-injection - tuned LLM model. | (DATA)
README
# Security of Large Language Models (LLM) - Prompt Injection Classification
In this project, we investigate the security of large language models in terms of [prompt injection attacks](https://www.techopedia.com/definition/prompt-injection-attack). Primarily, we perform binary classification of a dataset of input prompts in order to discover malicious prompts that represent injections.*In short: prompt injections aim at manipulating the LLM using crafted input prompts to steer the model into ignoring previous instructions and, thus, performing unintended actions.*
To do so, we analyzed several AI-driven mechanisms to do the classification task, we particularly examined 1) classical ML algorithms, 2) a pre-trained LLM model, and 3) a fine-tuned LLM model.
## Data Set (Deepset Prompt Injection Dataset)
The dataset used in this demo is: [Prompt Injection Dataset](https://huggingface.co/datasets/deepset/prompt-injections) provided by [deepset](https://www.deepset.ai/), an AI company specialized in offering tools to build NLP-driven applications using LLMs.
- The dataset combines hundreds of samples of both normal and manipulated prompts labeled as injections.
- It contains prompts mainly in English, along with some other prompts translated into other languages, primarily German.
- The original data set is already split into training and holdout subsets. We maintained this split across the multiple experiments to compare results using a unified testing benchmark.### METHOD 1 - Classification Using Traditional ML
> Corresponding notebook: [ml-classification.ipynb](https://github.com/sinanw/llm-security-prompt-injection/blob/main/notebooks/1-ml-classification.ipynb)Analysis steps:
1. Loading the dataset from HuggingFace library and exploring it.
2. Tokenizing prompt texts and generating embeddings using the [multilingual BERT (Bidirectional Encoder Representations from Transformers)](https://huggingface.co/bert-base-multilingual-uncased) LLM model.
3. Training the following ML algorithms on the downstream prompt classification task:
- [Naive Bayes](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html)
- [Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
- [Support Vector Machine](https://scikit-learn.org/stable/modules/svm.html)
- [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
4. Analyzing and comparing the performance of classification models.
5. Investigating incorrect predictions of the best-performing model.#### Results:
| | Accuracy | Precision | Recall | F1 Score |
|----------------------|----------|-----------|----------|----------|
| Naive Bayes | 88.79% | 87.30% | 91.67% | 89.43% |
| Logistic Regression | 96.55% | 100.00% | 93.33% | 96.55% |
| Support Vector Machine | 95.69% | 100.00% | 91.67% | 95.65% |
| Random Forest | 89.66% | 100.00% | 80.00% | 88.89% |### METHOD 2 - Classification Using a Pre-trained LLM (XLM-RoBERTa)
> Corresponding notebook: [llm-classification-pretrained.ipynb](https://github.com/sinanw/llm-security-prompt-injection/blob/main/notebooks/2-llm-classification-pretrained.ipynb)Analysis steps:
1. Loading the dataset from HuggingFace library.
2. Loading the pre-trained [XLM-RoBERTa](https://huggingface.co/xlm-roberta-large) model, the multilingual version of RoBERTa, the enhanced version of BERT, from HuggingFace library.
3. Using HuggingFace [zero-shot classification](https://huggingface.co/tasks/zero-shot-classification) pipeline and XLM-RoBERTa to perform prompt classification on the testing dataset (without fine-tuning).
4. Analyzing classification results and model performance.#### Results:
| | Accuracy | Precision | Recall | F1 Score |
|--------------|----------|-----------|----------|----------|
| Testing Data | 55.17% | 55.13% | 71.67% | 62.32% |### METHOD 3 - Classification Using a Fine-tuned LLM (XLM-RoBERTa)
> Corresponding notebook: [llm-classification-finetuned.ipynb](https://github.com/sinanw/llm-security-prompt-injection/blob/main/notebooks/3-llm-classification-finetuned.ipynb)Analysis steps:
1. Loading the dataset from HuggingFace library.
2. Loading the pre-trained [XLM-RoBERTa](https://huggingface.co/xlm-roberta-large) model, the multilingual version of RoBERTa, the enhanced version of BERT, from HuggingFace library.
3. Fine-tuning XLM-RoBERTa to perform prompt classification on the training dataset.
4. Analyzing the fine-tuning accuracy across 5 epochs on the testing dataset.
5. Analyzing the final model accuracy and its performance, and comparing it with previous experiments.#### Results:
| Epoch | Accuracy | Precision | Recall | F1 |
|-------|----------|-----------|----------|---------|
| 1 | 62.93% | 100.00% | 28.33% | 44.16% |
| 2 | 91.38% | 100.00% | 83.33% | 90.91% |
| 3 | 93.10% | 100.00% | 86.67% | 92.86% |
| 4 | 96.55% | 100.00% | 93.33% | 96.55% |
| 5 | 97.41% | 100.00% | 95.00% | 97.44% |