Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/dapurv5/awesome-red-teaming-llms

Repository accompanying the paper https://arxiv.org/abs/2407.14937
https://github.com/dapurv5/awesome-red-teaming-llms

List: awesome-red-teaming-llms

adversarial-attacks ai-safety ai-security awesome awesome-list llm-safety llm-security red-teaming

Last synced: 4 days ago
JSON representation

Repository accompanying the paper https://arxiv.org/abs/2407.14937

Awesome Lists containing this project

README

        

# Awesome Red-Teaming LLMs [![Awesome](https://awesome.re/badge.svg)](https://awesome.re)
> A comprehensive guide to understanding Attacks, Defenses and Red-Teaming for Large Language Models (LLMs).


Red-Teaming LLMs


[![Twitter Thread](https://img.shields.io/badge/Thread-000000?style=for-the-badge&logo=X&logoColor=white)](https://twitter.com/verma_apurv5/status/1815751139729519011)
[![arXiv](https://img.shields.io/badge/arXiv-2404.09562-b31b1b?style=for-the-badge&logo=arXiv&logoColor=white)](https://arxiv.org/pdf/2407.14937)

## Contents
- [Attacks](#attacks)
- [Jailbreak Attack](#jailbreak-attack)
- [Direct Attack](#direct-attack)
- [Infusion Attack](#infusion-attack)
- [Inference Attack](#inference-attack)
- [Training Time Attack](#training-time-attack)
- [Defenses](#defenses)
- [Other Surveys](#other-surveys)
- [Red-Teaming](#red-teaming)

## Attacks
![Taxonomy](taxonomy.png)

### Jailbreak Attack

| Title | Link |
|-------|------|
| "Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models | [Link](https://www.semanticscholar.org/paper/%22Do-Anything-Now%22%3A-Characterizing-and-Evaluating-on-Shen-Chen/1104d766527dead44a40532e8a89444d9cef5c65) |
| Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study | [Link](https://www.semanticscholar.org/paper/Jailbreaking-ChatGPT-via-Prompt-Engineering%3A-An-Liu-Deng/fc50a6202e2f675604543c1ae4ef22ec74f61ad5) |
| ChatGPT Jailbreak Reddit | [Link](https://www.reddit.com/r/ChatGPTJailbreak/) |
| Anomalous Tokens | [Link](https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation) |
| Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs Through a Global Prompt Hacking Competition | [Link](https://aclanthology.org/2023.emnlp-main.302/) |
| Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks | [Link](https://www.semanticscholar.org/paper/Exploiting-Programmatic-Behavior-of-LLMs%3A-Dual-Use-Kang-Li/0cf694b8f85ab2e11d45595de211a15cfbadcd22) |
| Jailbroken: How Does LLM Safety Training Fail? | [Link](https://www.semanticscholar.org/paper/Jailbroken%3A-How-Does-LLM-Safety-Training-Fail-Wei-Haghtalab/929305892d4ddae575a0fc23227a8139f7681632) |
| Tricking LLMs into Disobedience: Formalizing, Analyzing, and Detecting Jailbreaks | [Link](https://arxiv.org/pdf/2305.14965) |
| Adversarial Prompting in LLMs | [Link](https://www.promptingguide.ai/risks/adversarial) |
| Exploiting Novel GPT-4 APIs | [Link](https://www.semanticscholar.org/paper/Exploiting-Novel-GPT-4-APIs-Pelrine-Taufeeque/ac2a659cc6f0e3635a9c1351c9963b47817205fb) |
| Large Language Models are Vulnerable to Bait-and-Switch Attacks for Generating Harmful Content | [Link](https://www.semanticscholar.org/paper/Large-Language-Models-are-Vulnerable-to-Attacks-for-Bianchi-Zou/05e0c57f912cec9597021855bac28306c97e36fd) |
| Using Hallucinations to Bypass GPT4's Filter | [Link](https://www.semanticscholar.org/paper/Using-Hallucinations-to-Bypass-GPT4's-Filter-Lemkin/d6ecc2c23e257108f49ef756147acadbb4ca5678) |

### Direct Attack

#### Automated Attacks
| Title | Link |
|-------|------|
| A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily | [Link](https://www.semanticscholar.org/paper/A-Wolf-in-Sheep's-Clothing%3A-Generalized-Nested-can-Ding-Kuang/c4ff1be5c254b60b96b7455eefcc4ec9583f82ed) |
| FLIRT: Feedback Loop In-context Red Teaming | [Link](https://arxiv.org/pdf/2308.04265) |
| Jailbreaking Black Box Large Language Models in Twenty Queries | [Link](https://www.semanticscholar.org/paper/Jailbreaking-Black-Box-Large-Language-Models-in-Chao-Robey/4637f79ddfaf923ce569996ffa5b6cda1996faa1) |
| Red Teaming Language Models with Language Models | [Link](https://arxiv.org/abs/2202.03286) |
| Tree of Attacks: Jailbreaking Black-Box LLMs Automatically | [Link](https://www.semanticscholar.org/paper/Tree-of-Attacks%3A-Jailbreaking-Black-Box-LLMs-Mehrotra-Zampetakis/14e8cf5a5e6a7b35e618b08f5cf06f572b3a54e0) |
| Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks | [Link](https://www.semanticscholar.org/paper/Leveraging-the-Context-through-Multi-Round-for-Cheng-Georgopoulos/1b95053af03b5a06809a4967c6cf5ca137bbcde4) |
| Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking | [Link](https://www.semanticscholar.org/paper/Cognitive-Overload%3A-Jailbreaking-Large-Language-Xu-Wang/54c9a97637822c9e1956b1ec70b0c9a0f2338d2c) |
| GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher | [Link](https://www.semanticscholar.org/paper/GPT-4-Is-Too-Smart-To-Be-Safe%3A-Stealthy-Chat-with-Yuan-Jiao/897940fb5dd4d739b88c4659c4565d05f48d06b8) |
| GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts | [Link](https://www.semanticscholar.org/paper/GPTFUZZER%3A-Red-Teaming-Large-Language-Models-with-Yu-Lin/d4177489596748e43aa571f59556097f2cc4c8be) |
| AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications | [Link](https://aclanthology.org/2023.emnlp-industry.37/) |
| Prompt Injection attack against LLM-integrated Applications | [Link](https://arxiv.org/pdf/2306.05499) |
| Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction | [Link](https://www.semanticscholar.org/paper/Making-Them-Ask-and-Answer%3A-Jailbreaking-Large-in-Liu-Zhang/f6fa682b62c7981402336ca57da1196ccbf3fc54) |
| MART: Improving LLM Safety with Multi-round Automatic Red-Teaming | [Link](https://www.semanticscholar.org/paper/MART%3A-Improving-LLM-Safety-with-Multi-round-Ge-Zhou/709af143f78bc62413c50ea1a7ee75b0702c4f59) |
| Query-Efficient Black-Box Red Teaming via Bayesian Optimization | [Link](https://www.semanticscholar.org/paper/Query-Efficient-Black-Box-Red-Teaming-via-Bayesian-Lee-Lee/cc3bfea86ed457079363598ae38af11dd3b00e47) |
| Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMs | [Link](https://www.semanticscholar.org/paper/Make-Them-Spill-the-Beans!-Coercive-Knowledge-from-Zhang-Shen/89641466373aa9ce2976e3f384b0791a7bd0931c) |
| CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models | [Link](https://www.semanticscholar.org/paper/CodeChameleon%3A-Personalized-Encryption-Framework-Lv-Wang/72f51c3ef967f7905e3194296cf6fd8337b1a437) |

##### Transferable Attacks
| Title | Link |
|-------|------|
| Universal and Transferable Adversarial Attacks on Aligned Language Models | [Link](https://www.semanticscholar.org/paper/Universal-and-Transferable-Adversarial-Attacks-on-Zou-Wang/47030369e97cc44d4b2e3cf1be85da0fd134904a) |
| ACG: Accelerated Coordinate Gradient | [Link](https://blog.haizelabs.com/posts/acg/) |
| PAL: Proxy-Guided Black-Box Attack on Large Language Models | [Link](https://www.semanticscholar.org/paper/PAL%3A-Proxy-Guided-Black-Box-Attack-on-Large-Models-Sitawarin-Mu/ef5da08aad746173b7ddf589068c6abf00205fea) |
| AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models | [Link](https://arxiv.org/abs/2310.15140) |
| Open Sesame! Universal Black Box Jailbreaking of Large Language Models | [Link](https://www.semanticscholar.org/paper/Open-Sesame!-Universal-Black-Box-Jailbreaking-of-Lapid-Langberg/f846c0c59608f0a8ff18f4c52adba87bf49dc229) |
| Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks | [Link](https://www.semanticscholar.org/paper/Jailbreaking-Leading-Safety-Aligned-LLMs-with-Andriushchenko-Croce/88d5634a52645f6b05a03536be1f26a2b9bba232) |
| MASTERKEY: Automated Jailbreaking of Large Language Model Chatbots | [Link](https://www.semanticscholar.org/paper/MASTERKEY%3A-Automated-Jailbreaking-of-Large-Language-Deng-Liu/6987c95f7054d2653178ac93df52aa3c0b99fcf5) |
| AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models | [Link](https://arxiv.org/abs/2310.04451) |
| AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs | [Link](https://www.semanticscholar.org/paper/AmpleGCG%3A-Learning-a-Universal-and-Transferable-of-Liao-Sun/4ad33969188555b8303b375e18f5c117a68387c6) |
| Universal Adversarial Triggers Are Not Universal | [Link](https://arxiv.org/pdf/2404.16020) |

#### Inversion Attacks

##### Data Inversion
| Title | Link |
|-------|------|
| Scalable Extraction of Training Data from (Production) Language Models | [Link](https://www.semanticscholar.org/paper/Scalable-Extraction-of-Training-Data-from-Language-Nasr-Carlini/fc7ee1828030a818f52518022a39f6a3ada60222) |
| Explore, Establish, Exploit: Red Teaming Language Models from Scratch | [Link](https://www.semanticscholar.org/paper/Explore%2C-Establish%2C-Exploit%3A-Red-Teaming-Language-Casper-Lin/1db819afb3604c4bfd1e5a0cb2ee9ab9dec52642) |
| Extracting Training Data from Large Language Models | [Link](https://arxiv.org/abs/2012.07805) |
| Bag of Tricks for Training Data Extraction from Language Models | [Link](https://www.semanticscholar.org/paper/Bag-of-Tricks-for-Training-Data-Extraction-from-Yu-Pang/0abd29e67b1e52e922660c315bcdeedd9b1eab7e) |
| Controlling the Extraction of Memorized Data from Large Language Models via Prompt-Tuning | [Link](https://www.semanticscholar.org/paper/Controlling-the-Extraction-of-Memorized-Data-from-Ozdayi-Peris/e4bbcf6c84bfcdbeafecf75f2b0b98eaa1020e63) |
| Practical Membership Inference Attacks against Fine-tuned Large Language Models via Self-prompt Calibration | [Link](https://www.semanticscholar.org/paper/Practical-Membership-Inference-Attacks-against-via-Fu-Wang/6bf34b4a1937ca5ae692594eda880ff671b8ee57) |
| Membership Inference Attacks against Language Models via Neighbourhood Comparison | [Link](https://aclanthology.org/2023.findings-acl.719/) |

##### Model Inversion
| Title | Link |
|-------|------|
| A Methodology for Formalizing Model-Inversion Attacks | [Link](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7536387) |
| Sok: Model inversion attack landscape: Taxonomy, challenges, and future roadmap. | [Link](https://ieeexplore.ieee.org/document/10221914) |
| Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures | [Link](https://dl.acm.org/doi/pdf/10.1145/2810103.2813677) |
| Model Leeching: An Extraction Attack Targeting LLMs | [Link](https://www.semanticscholar.org/paper/Model-Leeching%3A-An-Extraction-Attack-Targeting-LLMs-Birch-Hackett/ab2066233ea2da540f44118d989d66db5687752a) |
| Killing One Bird with Two Stones: Model Extraction and Attribute Inference Attacks against BERT-based APIs | [Link](https://www.semanticscholar.org/paper/Killing-One-Bird-with-Two-Stones%3A-Model-Extraction-Chen-He/373936d00c4a357579c4d375de0ce439e4e54d5f) |
| Model Extraction and Adversarial Transferability, Your BERT is Vulnerable! | [Link](https://www.semanticscholar.org/paper/Model-Extraction-and-Adversarial-Transferability%2C-He-Lyu/16a8e329c06b4c6f61762da7fa77a84bf3e12dca) |

###### Prompt Inversion
| Title | Link |
|-------|------|
| Language Model Inversion | [Link](https://arxiv.org/abs/2311.13647) |
| Effective Prompt Extraction from Language Models | [Link](https://www.semanticscholar.org/paper/Effective-Prompt-Extraction-from-Language-Models-Zhang-Ippolito/b9df0d4631f9fab1432c152765e243ae4cd667f4) |
| Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts | [Link](https://arxiv.org/abs/2311.09127) |

##### Embedding Inversion
| Title | Link |
|-------|------|
| Text Embedding Inversion Security for Multilingual Language Models | [Link](https://www.semanticscholar.org/paper/Text-Embedding-Inversion-Security-for-Multilingual-Chen-Lent/3ff5bc7c832da386211f6231058994b67ae6d600) |
| Sentence Embedding Leaks More Information than You Expect: Generative Embedding Inversion Attack to Recover the Whole Sentence | [Link](https://aclanthology.org/2023.findings-acl.881/) |
| Text Embeddings Reveal (Almost) As Much As Text | [Link](https://www.semanticscholar.org/paper/Text-Embeddings-Reveal-(Almost)-As-Much-As-Text-Morris-Kuleshov/d4c4f46b63e4812f0268d99b6528aa6a0c404377) |

#### Side Channel Attacks
| Title | Link |
|-------|------|
| Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation | [Link](https://www.semanticscholar.org/paper/Catastrophic-Jailbreak-of-Open-source-LLMs-via-Huang-Gupta/ac27dd71af3ee93e1129482ceececbae7dd0d0e8) |
| What Was Your Prompt? A Remote Keylogging Attack on AI Assistants | [Link](https://www.semanticscholar.org/paper/What-Was-Your-Prompt-A-Remote-Keylogging-Attack-on-Weiss-Ayzenshteyn/bb5393126610ab89983b29d8934b45f67a16241d) |
| Privacy Side Channels in Machine Learning Systems | [Link](https://www.semanticscholar.org/paper/Privacy-Side-Channels-in-Machine-Learning-Systems-Debenedetti-Severi/d43af65e38afebd68797683c6e01d02bb1ba7963) |
| Stealing Part of a Production Language Model | [Link](https://www.semanticscholar.org/paper/Stealing-Part-of-a-Production-Language-Model-Carlini-Paleka/b232f468de0b1d4ff1c2dfe5dbb03ec093160c48) |
| Logits of API-Protected LLMs Leak Proprietary Information | [Link](https://www.semanticscholar.org/paper/Logits-of-API-Protected-LLMs-Leak-Proprietary-Finlayson-Ren/5f2b88d1c0d98f3f2973221657ca5237a185cc37) |

### Infusion Attack
| Title | Link |
|-------|------|
| Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection | [Link](https://www.semanticscholar.org/paper/Not-What-You've-Signed-Up-For%3A-Compromising-with-Greshake-Abdelnabi/705e49afd92130f2bc1e0d4d0b1f6cb14e88803f) |
| Adversarial Demonstration Attacks on Large Language Models | [Link](https://www.semanticscholar.org/paper/Adversarial-Demonstration-Attacks-on-Large-Language-Wang-Liu/1abfc211793c683972ded8d3268475e3ee7a88b0) |
| Poisoning Web-Scale Training Datasets is Practical | [Link](https://arxiv.org/abs/2302.10149) |
| Universal Vulnerabilities in Large Language Models: In-context Learning Backdoor Attacks | [Link](https://www.semanticscholar.org/paper/Universal-Vulnerabilities-in-Large-Language-Models%3A-Zhao-Jia/ce09d7a0bf35ee6a2d857c472efd8d480b9fa122) |
| BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models | [Link](https://www.semanticscholar.org/paper/BadChain%3A-Backdoor-Chain-of-Thought-Prompting-for-Xiang-Jiang/f8d7b0245480646abd257bac60ee37804981a0d7) |
| Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations | [Link](https://www.semanticscholar.org/paper/Jailbreak-and-Guard-Aligned-Language-Models-with-Wei-Wang/6b135e922a0c673aeb0b05c5aeecdb6c794791c6) |
| Many-shot jailbreaking | [Link](https://www.anthropic.com/research/many-shot-jailbreaking) |

### Inference Attack

#### Latent Space Attack
| Title | Link |
|-------|------|
| Backdoor Activation Attack: Attack Large Language Models using Activation Steering for Safety-Alignment | [Link](https://www.semanticscholar.org/paper/Backdoor-Activation-Attack%3A-Attack-Large-Language-Wang-Shu/d030be820dd5e4739461f246ce248fba2df33f0a) |
| Test-Time Backdoor Attacks on Multimodal Large Language Models | [Link](https://www.semanticscholar.org/paper/Test-Time-Backdoor-Attacks-on-Multimodal-Large-Lu-Pang/9f12a20f62238f5206520e52e83e2ccd1da17f03) |
| Open the Pandora's Box of LLMs: Jailbreaking LLMs through Representation Engineering | [Link](https://www.semanticscholar.org/paper/Open-the-Pandora's-Box-of-LLMs%3A-Jailbreaking-LLMs-Li-Zheng/b843fd79f0ddfd1a3e5ff3bd182715429e28aa35) |
| Weak-to-Strong Jailbreaking on Large Language Models | [Link](https://www.semanticscholar.org/paper/Weak-to-Strong-Jailbreaking-on-Large-Language-Zhao-Yang/88d59e31575f5b3dd88a2c2033b55f628c2adbc9) |
| Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space | [Link](https://www.semanticscholar.org/paper/Soft-Prompt-Threats%3A-Attacking-Safety-Alignment-and-Schwinn-Dobre/9a7187386eb6ea93d41d7e2baa88accedc702fc2) |

#### Decoding Attack
| Title | Link |
|-------|------|
| Fast Adversarial Attacks on Language Models In One GPU Minute | [Link](https://www.semanticscholar.org/paper/Fast-Adversarial-Attacks-on-Language-Models-In-One-Sadasivan-Saha/e519699816d358783f41d4bd50fd3465d9fa51bd) |

#### Tokenizer Attack
| Title | Link |
|-------|------|
| Training-free Lexical Backdoor Attacks on Language Models | [Link](https://www.semanticscholar.org/paper/Training-free-Lexical-Backdoor-Attacks-on-Language-Huang-Zhuo/5d896fb2f0da16060f22ed43e582464605237f28) |

### Training Time Attack

#### Backdoor Attack

##### Preference Tuning Stage
| Title | Link |
|-------|------|
| Universal Jailbreak Backdoors from Poisoned Human Feedback | [Link](https://www.semanticscholar.org/paper/Universal-Jailbreak-Backdoors-from-Poisoned-Human-Rando-Tram%C3%A8r/90de1938a64d117d61b9e7149d2981df49b81433) |
| Best-of-Venom: Attacking RLHF by Injecting Poisoned Preference Data | [Link](https://www.semanticscholar.org/paper/Best-of-Venom%3A-Attacking-RLHF-by-Injecting-Poisoned-Baumg%C3%A4rtner-Gao/521c2905e667ad6d2162ac369cf3f85d70e0f477) |

##### Instruction Tuning Stage
| Title | Link |
|-------|------|
| Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models | [Link](https://www.semanticscholar.org/paper/Instructions-as-Backdoors%3A-Backdoor-Vulnerabilities-Xu-Ma/82fe948f18ca0138d035f553286c5e4b712dbdbe) |
| On the Exploitability of Instruction Tuning | [Link](https://www.semanticscholar.org/paper/On-the-Exploitability-of-Instruction-Tuning-Shu-Wang/f5fa0b3c2ecbf17ba922932432bed46a1447ed23) |
| Poisoning Language Models During Instruction Tuning | [Link](https://www.semanticscholar.org/paper/Poisoning-Language-Models-During-Instruction-Tuning-Wan-Wallace/13e0f0bf9d6868d6825e13d8f9f25ee04285cd29) |
| Learning to Poison Large Language Models During Instruction Tuning | [Link](https://www.semanticscholar.org/paper/Learning-to-Poison-Large-Language-Models-During-Qiang-Zhou/44dc41803f49f7511f674ecb091d7a5c69fd5db2) |
| Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection | [Link](https://www.semanticscholar.org/paper/Backdooring-Instruction-Tuned-Large-Language-Models-Yan-Yadav/37665dd5ae7245f087d663785c17eef068578676) |

##### Adapters and Model Weights
| Title | Link |
|-------|------|
| The Philosopher's Stone: Trojaning Plugins of Large Language Models | [Link](https://www.semanticscholar.org/paper/The-Philosopher's-Stone%3A-Trojaning-Plugins-of-Large-Dong-Xue/ea12b4bff088bb3829e7277e516842e552a63be4) |
| Privacy Backdoors: Stealing Data with Corrupted Pretrained Models | [Link](https://www.semanticscholar.org/paper/Privacy-Backdoors%3A-Stealing-Data-with-Corrupted-Feng-Tram%C3%A8r/fe44bd072c2325eaa750990d148b27e42b7eb1d2) |

#### Alignment Erasure
| Title | Link |
|-------|------|
| Removing RLHF Protections in GPT-4 via Fine-Tuning | [Link](https://www.semanticscholar.org/paper/Removing-RLHF-Protections-in-GPT-4-via-Fine-Tuning-Zhan-Fang/ccae9fcb1f344e56a3f7cb05a4b49a6e658f9dd2) |
| Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models | [Link](https://www.semanticscholar.org/paper/Shadow-Alignment%3A-The-Ease-of-Subverting-Language-Yang-Wang/84b7c486c56bd3880cb8eb01de9ae90ba3ebdaed) |
| Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! | [Link](https://arxiv.org/abs/2310.03693) |
| LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B | [Link](https://www.semanticscholar.org/paper/LoRA-Fine-tuning-Efficiently-Undoes-Safety-Training-Lermen-Rogers-Smith/d1b5151231a790c7a60f620e21860593dae9a1c5) |
| Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections | [Link](https://www.semanticscholar.org/paper/Stealthy-and-Persistent-Unalignment-on-Large-Models-Cao-Cao/b88535ea9368753c91967bb7e997c06b1ac6aaec) |
| Language Model Unalignment: Parametric Red-Teaming to Expose Hidden Harms and Biases | [Link](https://arxiv.org/abs/2310.14303) |
| Large Language Model Unlearning | [Link](https://arxiv.org/pdf/2310.10683) |

#### Gradient-Based Attacks
| Title | Link |
|-------|------|
| Gradient-Based Language Model Red Teaming | [Link](https://www.semanticscholar.org/paper/Gradient-Based-Language-Model-Red-Teaming-Wichers-Denison/409e0616a0fc02dd0ee8d5ae061944a98e9bd5a9) |
| Red Teaming Language Models with Language Models | [Link](https://aclanthology.org/2022.emnlp-main.225/) |
| Rapid Optimization for Jailbreaking LLMs via Subconscious Exploitation and Echopraxia | [Link](https://www.semanticscholar.org/paper/Rapid-Optimization-for-Jailbreaking-LLMs-via-and-Shen-Cheng/f75f401f046d508753d6b207f3f19414f489bd08) |
| AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models | [Link](https://www.semanticscholar.org/paper/AutoDAN%3A-Interpretable-Gradient-Based-Adversarial-Zhu-Zhang/1227c2fcb8437441b7d72a29a4bc9eef1f5275d2) |
| RLPrompt: Optimizing Discrete Text Prompts with Reinforcement Learning | [Link](https://aclanthology.org/2022.emnlp-main.222/) |
| Neural Exec: Learning (and Learning from) Execution Triggers for Prompt Injection Attacks | [Link](https://www.semanticscholar.org/paper/Neural-Exec%3A-Learning-(and-Learning-from)-Execution-Pasquini-Strohmeier/2f6de8291c9a803faa7f7a33c74f4a2a3debd83b) |
| COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability | [Link](https://www.semanticscholar.org/paper/COLD-Attack%3A-Jailbreaking-LLMs-with-Stealthiness-Guo-Yu/b7ef6182f617ef3e7cc9682f562f794115a4c62c) |
| Automatically Auditing Large Language Models via Discrete Optimization | [Link](https://www.semanticscholar.org/paper/Automatically-Auditing-Large-Language-Models-via-Jones-Dragan/2f94f03fdac62d05f0f416b7b3855d1f597afee9) |
| Automatic and Universal Prompt Injection Attacks against Large Language Models | [Link](https://www.semanticscholar.org/paper/Automatic-and-Universal-Prompt-Injection-Attacks-Liu-Yu/0a6a350653369dc92fde4cf9992951534ed1f169) |
| Unveiling the Implicit Toxicity in Large Language Models | [Link](https://aclanthology.org/2023.emnlp-main.84/) |
| Hijacking Large Language Models via Adversarial In-Context Learning | [Link](https://www.semanticscholar.org/paper/Hijacking-Large-Language-Models-via-Adversarial-Qiang-Zhou/6d68b5c1eaf03aba857476a9825acf3e48edd840) |
| Boosting Jailbreak Attack with Momentum | [Link](https://www.semanticscholar.org/paper/Boosting-Jailbreak-Attack-with-Momentum-Zhang-Wei/9f2ea0e770e154bb00b2276596afd148a7facfe8) |

## Defenses

| Study | Category | Short Description | Free | Extrinsic |
|-------|----------|-------------------|------|-----------|
| [OpenAI Moderation Endpoint](https://platform.openai.com/docs/guides/moderation/overview) | Guardrail | OpenAI Moderations Endpoint | ❌ | ✅ |
| [A New Generation of Perspective API: Efficient Multilingual Character-level Transformers](https://api.semanticscholar.org/CorpusID:247058801) | Guardrail | Perspective API's Toxicity API | ❌ | ✅ |
| [Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations](https://api.semanticscholar.org/CorpusID:266174345) | Guardrail | Llama Guard | ✅ | ✅ |
| [Guardrails AI: Adding guardrails to large language models.](https://github.com/guardrails-ai/guardrails) | Guardrail | Guardrails AI Validators | ✅ | ✅ |
| [NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails](https://api.semanticscholar.org/CorpusID:264146531) | Guardrail | NVIDIA Nemo Guardrail | ✅ | ✅ |
| [RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content](https://api.semanticscholar.org/CorpusID:268536710) | Guardrail | RigorLLM (Safe Suffix + Prompt Augmentation + Aggregation) | ✅ | ✅ |
| [Robust Safety Classifier for Large Language Models: Adversarial Prompt Shield](https://api.semanticscholar.org/CorpusID:264833136) | Guardrail | Adversarial Prompt Shield Classifier | ✅ | ✅ |
| [WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs](https://api.semanticscholar.org/CorpusID:270737916) | Guardrail | WildGuard | ✅ | ✅ |
| [SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks](https://api.semanticscholar.org/CorpusID:263671542) | Prompting | SmoothLLM (Prompt Augmentation + Aggregation) | ✅ | ✅ |
| [Defending ChatGPT against jailbreak attack via self-reminders](https://api.semanticscholar.org/CorpusID:266289038) | Prompting | Self-Reminder | ✅ | ✅ |
| [Intention Analysis Prompting Makes Large Language Models A Good Jailbreak Defender](https://api.semanticscholar.org/CorpusID:266977251) | Prompting | Intention Analysis Prompting | ✅ | ✅ |
| [Defending LLMs against Jailbreaking Attacks via Backtranslation](https://api.semanticscholar.org/CorpusID:268032484) | Prompting | Backtranslation | ✅ | ✅ |
| [Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks](https://api.semanticscholar.org/CorpusID:267320750) | Prompting | Safe Suffix | ✅ | ✅ |
| [Studious Bob Fight Back Against Jailbreaking via Prompt Adversarial Tuning](https://api.semanticscholar.org/CorpusID:267617138) | Prompting | Safe Prefix | ✅ | ✅ |
| [Jailbreaker in Jail: Moving Target Defense for Large Language Models](https://api.semanticscholar.org/CorpusID:263620259) | Prompting | Prompt Augmentation + Auxiliary model | ✅ | ✅ |
| [Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing](https://api.semanticscholar.org/CorpusID:267938320) | Prompting | Prompt Augmentation + Aggregation | ✅ | ✅ |
| [Round Trip Translation Defence against Large Language Model Jailbreaking Attacks](https://api.semanticscholar.org/CorpusID:267770468) | Prompting | Prompt Paraphrasing | ✅ | ✅ |
| [Detecting Language Model Attacks with Perplexity](https://api.semanticscholar.org/CorpusID:261245172) | Prompting | Perplexity Based Defense | ✅ | ✅ |
| [Tiny Refinements Elicit Resilience: Toward Efficient Prefix-Model Against LLM Red-Teaming](https://api.semanticscholar.org/CorpusID:269929930) | Prompting | Rewrites input prompt to safe prompt using a sentinel model | ✅ | ✅ |
| [Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks](https://api.semanticscholar.org/CorpusID:270123437) | Prompting | Safe Suffix/Prefix (Requires access to log-probabilities) | ✅ | ✅ |
| [Protecting Your LLMs with Information Bottleneck](https://api.semanticscholar.org/CorpusID:269293591) | Prompting | Information Bottleneck Protector | ✅ | ✅ |
| [Signed-Prompt: A New Approach to Prevent Prompt Injection Attacks Against LLM-Integrated Applications](https://api.semanticscholar.org/CorpusID:266999840) | Prompting/Fine-Tuning | Introduces 'Signed-Prompt' for authorizing sensitive instructions from approved users | ✅ | ✅ |
| [SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding](https://api.semanticscholar.org/CorpusID:267658033) | Decoding | Safety Aware Decoding | ✅ | ✅ |
| [Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning](https://api.semanticscholar.org/CorpusID:267060803) | Model Pruning | Uses WANDA Pruning | ✅ | ❌ |
| [A safety realignment framework via subspace-oriented model fusion for large language models](https://api.semanticscholar.org/CorpusID:269773206) | Model Merging | Subspace-oriented model fusion | ✅ | ❌ |
| [Here's a Free Lunch: Sanitizing Backdoored Models with Model Merge](https://api.semanticscholar.org/CorpusID:268091274) | Model Merging | Model Merging to prevent backdoor attacks | ✅ | ❌ |
| [Steering Without Side Effects: Improving Post-Deployment Control of Language Models](https://api.semanticscholar.org/CorpusID:270703306) | Activation Editing | KL-then-steer to decrease side-effects of steering vectors | ✅ | ❌ |
| [Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation](https://api.semanticscholar.org/CorpusID:263835408) | Alignment | Generation Aware Alignment | ✅ | ❌ |
| [Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing](https://api.semanticscholar.org/CorpusID:270067915) | Alignment | Layer-specific editing | ✅ | ❌ |
| [Safety Alignment Should Be Made More Than Just a Few Tokens Deep](https://api.semanticscholar.org/CorpusID:270371778) | Alignment | Regularized fine-tuning objective for deep safety alignment | ✅ | ❌ |
| [Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization](https://api.semanticscholar.org/CorpusID:265212812) | Alignment | Goal Prioritization during training and inference stage | ✅ | ❌ |
| [AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts](https://api.semanticscholar.org/CorpusID:269009460) | Alignment | Instruction tuning on AEGIS safety dataset | ✅ | ❌ |
| [Vaccine: Perturbation-aware Alignment for Large Language Models against Harmful Fine-tuning Attack](https://www.semanticscholar.org/paper/Vaccine%3A-Perturbation-aware-Alignment-for-Large-Huang-Hu/abdb6e912fe86a60600b438b0e36b502f6412b24) | Alignment | Adding perturbation to embeddings in alignment phase | ✅ | ❌ |
| [Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning](https://www.semanticscholar.org/paper/Lazy-Safety-Alignment-for-Large-Language-Models-Huang-Hu/22151f8690cd06dba9e934cda6121e26dd9a8e7f) | Alignment | Bi-state optimization with constrained drift | ✅ | ❌ |
| [Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning](https://www.semanticscholar.org/paper/Antidote%3A-Post-fine-tuning-Safety-Alignment-for-Huang-Bhattacharya/be4156b6c5b804af6a20e5f723e521df6981b6fc) | Alignment | Removes harmful parameters | ✅ | ❌ |
| [Booster: Tackling Harmful Fine-tuning for Large Language Models via Attenuating Harmful Perturbation](https://www.semanticscholar.org/reader/e002b6ce5283d88ce0c21afbca27b3aea091e78f)| Alignment | Auxiliary loss to attenuate harmful perturbation | ✅ | ❌ |
| [The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions](https://api.semanticscholar.org/CorpusID:269294048) | Fine-Tuning | Training with Instruction Hierarchy | ✅ | ❌ |
| [Immunization against harmful fine-tuning attacks](https://api.semanticscholar.org/CorpusID:268032044) | Fine-Tuning | Immunization Conditions to prevent against harmful fine-tuning | ✅ | ❌ |
| [Mitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignment](https://api.semanticscholar.org/CorpusID:267897454) | Fine-Tuning | Backdoor Enhanced Safety Alignment to prevent against harmful fine-tuning | ✅ | ❌ |
| [Representation noising effectively prevents harmful fine-tuning on LLMs](https://api.semanticscholar.org/CorpusID:269982864) | Fine-Tuning | Representation Noising to prevent against harmful fine-tuning | ✅ | ❌ |
| [Differentially Private Fine-tuning of Language Models](https://api.semanticscholar.org/CorpusID:238743879) | Fine-Tuning | Differentially Private fine-tuning | ✅ | ❌ |
| [Large Language Models Can Be Good Privacy Protection Learners](https://api.semanticscholar.org/CorpusID:263620236) | Fine-Tuning | Privacy Protection Language Models | ✅ | ❌ |
| [Defending Against Unforeseen Failure Modes with Latent Adversarial Training](https://api.semanticscholar.org/CorpusID:268297448) | Fine-Tuning | Latent Adversarial Training | ✅ | ❌ |
| [From Shortcuts to Triggers: Backdoor Defense with Denoised PoE](https://api.semanticscholar.org/CorpusID:258866191) | Fine-Tuning | Denoised Product-of-Experts for protecting against various kinds of backdoor triggers | ✅ | ❌ |
| [Detoxifying Large Language Models via Knowledge Editing](https://api.semanticscholar.org/CorpusID:268553537) | Fine-Tuning | Detoxifying by Knowledge Editing of Toxic Layers | ✅ | ❌ |
| [GradSafe: Detecting Jailbreak Prompts for LLMs via Safety-Critical Gradient Analysis](https://api.semanticscholar.org/CorpusID:267770418) | Inspection | Safety-critical parameter gradients analysis | ✅ | ❌ |
| [Certifying LLM Safety against Adversarial Prompting](https://doi.org/10.48550/arXiv.2309.02705) | Certification | Erase-and-check framework | ✅ | ✅ |
| [PoisonedRAG: Knowledge Poisoning Attacks to Retrieval-Augmented Generation of Large Language Models](https://api.semanticscholar.org/CorpusID:267626957) | Certification | Isolate-then-Aggregate to protect against PoisonedRAGAttack | ✅ | ✅ |
| [Quantitative Certification of Bias in Large Language Models](https://api.semanticscholar.org/CorpusID:270094829) | Certification | Bias Certification of LLMs | ✅ | ✅ |
| [garak: A Framework for Security Probing Large Language Models](https://api.semanticscholar.org/CorpusID:270559825) | Model Auditing | Garak LLM Vulnerability Scanner | ✅ | ✅ |
| [giskard: The Evaluation & Testing framework for LLMs & ML models](https://github.com/Giskard-AI/giskard) | Model Auditing | Evaluate Performance, Bias issues in AI applications | ✅ | ✅ |

---

## Other Surveys
| Title | Link |
|-------|------|
| SoK: Prompt Hacking of Large Language Models | [Link](https://www.semanticscholar.org/paper/SoK%3A-Prompt-Hacking-of-Large-Language-Models-Rababah-Wu/9259d06eeaae42b05ad22ba76f0a1cbb216ad63a) |

## Red-Teaming
| Title | Link |
|-------|------|
| Red-Teaming for Generative AI: Silver Bullet or Security Theater? | [Link](https://ojs.aaai.org/index.php/AIES/article/view/31647) |

If you like our work, please consider citing. If you would like to add your work to our taxonomy please open a pull request.

#### BibTex
```bibtex

@article{verma2024operationalizing,
title={Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)},
author={Verma, Apurv and Krishna, Satyapriya and Gehrmann, Sebastian and Seshadri, Madhavan and Pradhan, Anu and Ault, Tom and Barrett, Leslie and Rabinowitz, David and Doucette, John and Phan, NhatHai},
journal={arXiv preprint arXiv:2407.14937},
year={2024}
}
```