awesome-red-teaming-llms

Repository accompanying the paper https://arxiv.org/abs/2407.14937
https://github.com/dapurv5/awesome-red-teaming-llms

Last synced: 1 day ago
JSON representation

Attacks
- Training Time Attack
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link - Execution-Pasquini-Strohmeier/2f6de8291c9a803faa7f7a33c74f4a2a3debd83b) |
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
- Direct Attack
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link - As-Much-As-Text-Morris-Kuleshov/d4c4f46b63e4812f0268d99b6528aa6a0c404377) |
  - Link
  - Link
  - Link
  - Link
  - Link
- Infusion Attack
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
- Jailbreak Attack
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
- Inference Attack
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
  - Link
Defenses
- Training Time Attack
  - OpenAI Moderation Endpoint
  - A New Generation of Perspective API: Efficient Multilingual Character-level Transformers
  - Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
  - Guardrails AI: Adding guardrails to large language models.
  - NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails
  - RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content
  - Robust Safety Classifier for Large Language Models: Adversarial Prompt Shield
  - Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks
  - Studious Bob Fight Back Against Jailbreaking via Prompt Adversarial Tuning
  - Jailbreaker in Jail: Moving Target Defense for Large Language Models
  - Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing
  - WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
  - SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
  - Defending ChatGPT against jailbreak attack via self-reminders - Reminder | ✅ | ✅ |
  - Intention Analysis Prompting Makes Large Language Models A Good Jailbreak Defender
  - Defending LLMs against Jailbreaking Attacks via Backtranslation
  - Round Trip Translation Defence against Large Language Model Jailbreaking Attacks
  - Tiny Refinements Elicit Resilience: Toward Efficient Prefix-Model Against LLM Red-Teaming
  - Detecting Language Model Attacks with Perplexity
  - Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks - probabilities) | ✅ | ✅ |
  - Protecting Your LLMs with Information Bottleneck
  - Signed-Prompt: A New Approach to Prevent Prompt Injection Attacks Against LLM-Integrated Applications - Tuning | Introduces 'Signed-Prompt' for authorizing sensitive instructions from approved users | ✅ | ✅ |
  - SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
  - Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning
  - A safety realignment framework via subspace-oriented model fusion for large language models - oriented model fusion | ✅ | ❌ |
  - Here's a Free Lunch: Sanitizing Backdoored Models with Model Merge
  - Steering Without Side Effects: Improving Post-Deployment Control of Language Models - then-steer to decrease side-effects of steering vectors | ✅ | ❌ |
  - Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation
  - Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization
  - Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing - specific editing | ✅ | ❌ |
  - Safety Alignment Should Be Made More Than Just a Few Tokens Deep - tuning objective for deep safety alignment | ✅ | ❌ |
  - AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts
  - Vaccine: Perturbation-aware Alignment for Large Language Models against Harmful Fine-tuning Attack
  - Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning
  - Booster: Tackling Harmful Fine-tuning for Large Language Models via Attenuating Harmful Perturbation
  - The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions - Tuning | Training with Instruction Hierarchy | ✅ | ❌ |
  - Immunization against harmful fine-tuning attacks - Tuning | Immunization Conditions to prevent against harmful fine-tuning | ✅ | ❌ |
  - Mitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignment - Tuning | Backdoor Enhanced Safety Alignment to prevent against harmful fine-tuning | ✅ | ❌ |
  - Differentially Private Fine-tuning of Language Models - Tuning | Differentially Private fine-tuning | ✅ | ❌ |
  - Representation noising effectively prevents harmful fine-tuning on LLMs - Tuning | Representation Noising to prevent against harmful fine-tuning | ✅ | ❌ |
  - Large Language Models Can Be Good Privacy Protection Learners - Tuning | Privacy Protection Language Models | ✅ | ❌ |
  - Defending Against Unforeseen Failure Modes with Latent Adversarial Training - Tuning | Latent Adversarial Training | ✅ | ❌ |
  - From Shortcuts to Triggers: Backdoor Defense with Denoised PoE - Tuning | Denoised Product-of-Experts for protecting against various kinds of backdoor triggers | ✅ | ❌ |
  - Detoxifying Large Language Models via Knowledge Editing - Tuning | Detoxifying by Knowledge Editing of Toxic Layers | ✅ | ❌ |
  - GradSafe: Detecting Jailbreak Prompts for LLMs via Safety-Critical Gradient Analysis - critical parameter gradients analysis | ✅ | ❌ |
  - PoisonedRAG: Knowledge Poisoning Attacks to Retrieval-Augmented Generation of Large Language Models - then-Aggregate to protect against PoisonedRAGAttack | ✅ | ✅ |
  - garak: A Framework for Security Probing Large Language Models
  - giskard: The Evaluation & Testing framework for LLMs & ML models
  - Quantitative Certification of Bias in Large Language Models
  - Tiny Refinements Elicit Resilience: Toward Efficient Prefix-Model Against LLM Red-Teaming
  - Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks - probabilities) | ✅ | ✅ |
  - Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning - state optimization with constrained drift | ✅ | ❌ |
  - Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning
  - Mitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignment - Tuning | Backdoor Enhanced Safety Alignment to prevent against harmful fine-tuning | ✅ | ❌ |
  - Safety Alignment Should Be Made More Than Just a Few Tokens Deep - tuning objective for deep safety alignment | ✅ | ❌ |
  - SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
Other Surveys
- Link
Red-Teaming
- Link
- Link

Programming Languages

Categories

Attacks 104 Defenses 56 Red-Teaming 2 Other Surveys 1

Sub Categories

Training Time Attack 84 Direct Attack 50 Jailbreak Attack 12 Inference Attack 7 Infusion Attack 7

Keywords

llm 2 ai 1 foundation-model 1 gpt-3 1 openai 1 agent-evaluation 1 ai-red-team 1 ai-security 1 ai-testing 1 fairness-ai 1 llm-eval 1 llm-evaluation 1 llm-security 1 llmops 1 ml-testing 1 ml-validation 1 mlops 1 rag-evaluation 1 red-team-tools 1 responsible-ai 1 trustworthy-ai 1