Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
awesome-red-teaming-llms
Repository accompanying the paper https://arxiv.org/abs/2407.14937
https://github.com/dapurv5/awesome-red-teaming-llms
Last synced: 5 days ago
JSON representation
-
Attacks
-
Training Time Attack
-
Direct Attack
- Link
- Link
- Link
- Link
- Link
- Link
- Link
- Link
- Link
- Link
- Link
- Link
- Link
- Link
- Link
- Link
- Link
- Link
- Link
- Link
- Link
- Link
- Link
- Link
- Link
- Link
- Link
- Link
- Link
- Link
- Link
- Link
- Link
- Link
- Link
- Link
- Link
- Link
- Link
- Link
- Link
- Link
- Link
- Link
- Link - As-Much-As-Text-Morris-Kuleshov/d4c4f46b63e4812f0268d99b6528aa6a0c404377) |
- Link
- Link
- Link
- Link
-
Infusion Attack
-
Jailbreak Attack
-
Inference Attack
-
-
Defenses
-
Training Time Attack
- OpenAI Moderation Endpoint
- A New Generation of Perspective API: Efficient Multilingual Character-level Transformers
- Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
- Guardrails AI: Adding guardrails to large language models.
- NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails
- RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content
- Robust Safety Classifier for Large Language Models: Adversarial Prompt Shield
- Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks
- Studious Bob Fight Back Against Jailbreaking via Prompt Adversarial Tuning
- Jailbreaker in Jail: Moving Target Defense for Large Language Models
- Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing
- WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
- SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
- Defending ChatGPT against jailbreak attack via self-reminders - Reminder | ✅ | ✅ |
- Intention Analysis Prompting Makes Large Language Models A Good Jailbreak Defender
- Defending LLMs against Jailbreaking Attacks via Backtranslation
- Round Trip Translation Defence against Large Language Model Jailbreaking Attacks
- Tiny Refinements Elicit Resilience: Toward Efficient Prefix-Model Against LLM Red-Teaming
- Detecting Language Model Attacks with Perplexity
- Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks - probabilities) | ✅ | ✅ |
- Protecting Your LLMs with Information Bottleneck
- Signed-Prompt: A New Approach to Prevent Prompt Injection Attacks Against LLM-Integrated Applications - Tuning | Introduces 'Signed-Prompt' for authorizing sensitive instructions from approved users | ✅ | ✅ |
- SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
- Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning
- A safety realignment framework via subspace-oriented model fusion for large language models - oriented model fusion | ✅ | ❌ |
- Here's a Free Lunch: Sanitizing Backdoored Models with Model Merge
- Steering Without Side Effects: Improving Post-Deployment Control of Language Models - then-steer to decrease side-effects of steering vectors | ✅ | ❌ |
- Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation
- Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization
- Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing - specific editing | ✅ | ❌ |
- Safety Alignment Should Be Made More Than Just a Few Tokens Deep - tuning objective for deep safety alignment | ✅ | ❌ |
- AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts
- Vaccine: Perturbation-aware Alignment for Large Language Models against Harmful Fine-tuning Attack
- Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning - state optimization with constrained drift | ✅ | ❌ |
- Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning
- Booster: Tackling Harmful Fine-tuning for Large Language Models via Attenuating Harmful Perturbation
- The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions - Tuning | Training with Instruction Hierarchy | ✅ | ❌ |
- Immunization against harmful fine-tuning attacks - Tuning | Immunization Conditions to prevent against harmful fine-tuning | ✅ | ❌ |
- Mitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignment - Tuning | Backdoor Enhanced Safety Alignment to prevent against harmful fine-tuning | ✅ | ❌ |
- Differentially Private Fine-tuning of Language Models - Tuning | Differentially Private fine-tuning | ✅ | ❌ |
- Representation noising effectively prevents harmful fine-tuning on LLMs - Tuning | Representation Noising to prevent against harmful fine-tuning | ✅ | ❌ |
- Large Language Models Can Be Good Privacy Protection Learners - Tuning | Privacy Protection Language Models | ✅ | ❌ |
- Defending Against Unforeseen Failure Modes with Latent Adversarial Training - Tuning | Latent Adversarial Training | ✅ | ❌ |
- From Shortcuts to Triggers: Backdoor Defense with Denoised PoE - Tuning | Denoised Product-of-Experts for protecting against various kinds of backdoor triggers | ✅ | ❌ |
- Detoxifying Large Language Models via Knowledge Editing - Tuning | Detoxifying by Knowledge Editing of Toxic Layers | ✅ | ❌ |
- GradSafe: Detecting Jailbreak Prompts for LLMs via Safety-Critical Gradient Analysis - critical parameter gradients analysis | ✅ | ❌ |
- PoisonedRAG: Knowledge Poisoning Attacks to Retrieval-Augmented Generation of Large Language Models - then-Aggregate to protect against PoisonedRAGAttack | ✅ | ✅ |
- garak: A Framework for Security Probing Large Language Models
- giskard: The Evaluation & Testing framework for LLMs & ML models
- Quantitative Certification of Bias in Large Language Models
-
-
Other Surveys
-
Training Time Attack
-
-
Red-Teaming
-
Training Time Attack
-
Programming Languages
Categories
Sub Categories
Keywords
llm
2
ai
1
foundation-model
1
gpt-3
1
openai
1
ai-red-team
1
ai-safety
1
ai-security
1
ai-testing
1
ethical-artificial-intelligence
1
evaluation-framework
1
fairness-ai
1
llm-eval
1
llm-evaluation
1
llm-security
1
llmops
1
ml-safety
1
ml-testing
1
ml-validation
1
mlops
1
rag-evaluation
1
red-team-tools
1
responsible-ai
1
trustworthy-ai
1