Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/ydyjya/Awesome-LLM-Safety

A curated list of safety-related papers, articles, and resources focused on Large Language Models (LLMs). This repository aims to provide researchers, practitioners, and enthusiasts with insights into the safety implications, challenges, and advancements surrounding these powerful models.
https://github.com/ydyjya/Awesome-LLM-Safety

List: Awesome-LLM-Safety

Last synced: 2 months ago
JSON representation

A curated list of safety-related papers, articles, and resources focused on Large Language Models (LLMs). This repository aims to provide researchers, practitioners, and enthusiasts with insights into the safety implications, challenges, and advancements surrounding these powerful models.

Awesome Lists containing this project

README

        

# 🛡️Awesome LLM-Safety🛡️[![Awesome](https://awesome.re/badge.svg)](https://awesome.re)


GitHub stars
GitHub forks
GitHub issues
GitHub Last commit


English | [中文](README_cn.md)

## 🤗Introduction

**Welcome to our Awesome-llm-safety repository!** 🥰🥰🥰

**🔥 News**

- 2024.05 update NAACL 2024 Papers Collection, thanks @[zhrli324](https://github.com/zhrli324), @[feqHe](https://github.com/feqHe)!

**🧑‍💻 Our Work**

We've curated a collection of the latest 😋, most comprehensive 😎, and most valuable 🤩 resources on large language model safety (llm-safety).
But we don't stop there; included are also relevant talks, tutorials, conferences, news, and articles.
Our repository is constantly updated to ensure you have the most current information at your fingertips.

> If a resource is relevant to multiple subcategories, we place it under each applicable section. For instance, the "Awesome-LLM-Safety" repository will be listed under each subcategory to which it pertains🤩!.

**✔️ Perfect for Majority**
- For beginners curious about llm-safety, our repository serves as a compass for grasping the big picture and diving into the details.
Classic or influential papers retained in the README provide a beginner-friendly navigation through interesting directions in the field;
- For seasoned researchers, this repository is a tool to keep you informed and fill any gaps in your knowledge.
Within each subtopic, we are diligently updating all the latest content and continuously backfilling with previous work.
Our thorough compilation and careful selection are time-savers for you.

**🧭 How to Use this Guide**
- Quick Start: In the README, users can find a curated list of select information sorted by date, along with links to various consultations.
- In-Depth Exploration: If you have a special interest in a particular subtopic, delve into the "subtopic" folder for more.
Each item, be it an article or piece of news, comes with a brief introduction, allowing researchers to swiftly zero in on relevant content.

**💼 How to Contribution**

If you have completed an insightful work or carefully compiled conference papers, we would love to add your work to the repository.
- For **individual papers**, you can raise an issue, and we will quickly add your paper under the corresponding subtopic.
- If you have **compiled a collection of papers for a conference**, you are welcome to submit a pull request directly.
We would greatly appreciate your contribution.
Please note that these pull requests need to be consistent with our existing format.

**📜Advertisement**

🌱 If you would like more people to read your recent insightful work, please contact me via [email]([email protected]).
I can offer you a promotional spot here for up to one month.

**Let’s start LLM Safety tutorial!**

---

## 🚀Table of Contents

- [🛡️Awesome LLM-Safety🛡️](#️awesome-llm-safety️)
- [🤗Introduction](#introduction)
- [🚀Table of Contents](#table-of-contents)
- [🔐Security & Discussion](#security & discussion)
- [📑Papers](#papers)
- [📖Tutorials, Articles, Presentations and Talks](#tutorials-articles-presentations-and-talks)
- [Other](#other)
- [🔏Privacy](#privacy)
- [📑Papers](#papers-1)
- [📖Tutorials, Articles, Presentations and Talks](#tutorials-articles-presentations-and-talks-1)
- [Other](#other-1)
- [📰Truthfulness \& Misinformation](#truthfulness--misinformation)
- [📑Papers](#papers-2)
- [📖Tutorials, Articles, Presentations and Talks](#tutorials-articles-presentations-and-talks-2)
- [Other](#other-2)
- [😈JailBreak \& Attacks](#jailbreak--attacks)
- [📑Papers](#papers-3)
- [📖Tutorials, Articles, Presentations and Talks](#tutorials-articles-presentations-and-talks-3)
- [Other](#other-3)
- [🛡️Defenses & Mitigation](#️defenses & mitigation)
- [📖Tutorials, Articles, Presentations and Talks](#tutorials-articles-presentations-and-talks-4)
- [Other](#other-4)
- [💯Datasets \& Benchmark](#datasets--benchmark)
- [📑Papers](#papers-4)
- [📖Tutorials, Articles, Presentations and Talks](#tutorials-articles-presentations-and-talks-5)
- [📚Resource📚](#resource)
- [Other](#other-5)
- [🧑‍🏫 Scholars 👩‍🏫](#-scholars-)
- [🧑‍🎓Author](#author)

---

## 🤔AI Safety & Security Discussions
| Date | Link | Publication | Authors |
|:---------:|:--------------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-------:|
| 2024/5/20 | [Managing extreme AI risks amid rapid progress](https://www.science.org/doi/abs/10.1126/science.adn0117) | Yoshua Bengio, Geoffrey Hinton, Andrew Yao, Dawn Song, Pieter Abbeel, Trevor Darrell, Yuval Noah Harari, Ya-Qin Zhang, Lan Xue, Shai Shalev-Shwartz, Gillian Hadfield, Jeff Clune, Tegan Maharaj, Frank Hutter, Atılım Güneş Baydin, Sheila McIlraith, Qiqi Gao, Ashwin Acharya, David Krueger, Anca Dragan, Philip Torr, Stuart Russell, Daniel Kahneman, Jan Brauner, Sören Mindermann | Science |

---

## 🔐Security & Discussion

### 📑Papers
| Date | Institute | Publication | Paper |
|:-----:|:--------------------:|:-----------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
| 20.10 | Facebook AI Research | arxiv | [Recipes for Safety in Open-domain Chatbots](https://arxiv.org/abs/2010.07079) |
| 22.03 | OpenAI | NIPS2022 | [Training language models to follow instructions with human feedback](https://proceedings.neurips.cc/paper_files/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract-Conference.html) |
| 23.07 | UC Berkeley | NIPS2023 | [Jailbroken: How Does LLM Safety Training Fail?](https://arxiv.org/abs/2307.02483) |
| 23.12 | OpenAI | Open AI | [Practices for Governing Agentic AI Systems](https://cdn.openai.com/papers/practices-for-governing-agentic-ai-systems.pdf) |

### 📖Tutorials, Articles, Presentations and Talks

| Date | Type | Title | URL |
|:-----:|:----------------------:|:--------------------:|:-------------------------------------------------------------------------------------:|
| 22.02 | Toxicity Detection API | Perspective API | [link](https://www.perspectiveapi.com/)
[paper](https://arxiv.org/abs/2202.11176) |
| 23.07 | Repository | Awesome LLM Security | [link](https://github.com/corca-ai/awesome-llm-security) |
| 23.10 | Tutorials | Awesome-LLM-Safety | [link](https://github.com/ydyjya/Awesome-LLM-Safety) |
| 24.01 | Tutorials | Awesome-LM-SSP | [link](https://github.com/ThuCCSLab/Awesome-LM-SSP) |

### Other

👉[Latest&Comprehensive Security Paper](.//subtopic/Security&Discussion.md)

---
## 🔏Privacy

### 📑Papers
| Date | Institute | Publication | Paper |
|:-----:|:---------------:|:-----------:|:----------------------------------------------------------------------------------------------------------------------------:|
| 19.12 | Microsoft | CCS2020 | [Analyzing Information Leakage of Updates to Natural Language Models](https://dl.acm.org/doi/abs/10.1145/3372297.3417880) |
| 21.07 | Google Research | ACL2022 | [Deduplicating Training Data Makes Language Models Better](https://aclanthology.org/2022.acl-long.577/) |
| 21.10 | Stanford | ICLR2022 | [Large language models can be strong differentially private learners](https://openreview.net/forum?id=bVuP3ltATMz) |
| 22.02 | Google Research | ICLR2023 | [Quantifying Memorization Across Neural Language Models](https://openreview.net/forum?id=TatRHT_1cK) |
| 22.02 | UNC Chapel Hill | ICML2022 | [Deduplicating Training Data Mitigates Privacy Risks in Language Models](https://proceedings.mlr.press/v162/kandpal22a.html) |

### 📖Tutorials, Articles, Presentations and Talks

| Date | Type | Title | URL |
|:-----:|:---------:|:------------------:|:----------------------------------------------------:|
| 23.10 | Tutorials | Awesome-LLM-Safety | [link](https://github.com/ydyjya/Awesome-LLM-Safety) |
| 24.01 | Tutorials | Awesome-LM-SSP | [link](https://github.com/ThuCCSLab/Awesome-LM-SSP) |

### Other

👉[Latest&Comprehensive Privacy Paper](.//subtopic/Privacy.md)

---
## 📰Truthfulness & Misinformation

### 📑Papers
| Date | Institute | Publication | Paper |
|:-----:|:------------------------------:|:-----------:|:--------------------------------------------------------------------------------------------------------------------------------------------:|
| 21.09 | University of Oxford | ACL2022 | [TruthfulQA: Measuring How Models Mimic Human Falsehoods](https://arxiv.org/abs/2109.07958) |
| 23.11 | Harbin Institute of Technology | arxiv | [A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions](https://arxiv.org/abs/2311.05232) |
| 23.11 | Arizona State University | arxiv | [Can Knowledge Graphs Reduce Hallucinations in LLMs? : A Survey](https://arxiv.org/abs/2311.07914) |

### 📖Tutorials, Articles, Presentations and Talks

| Date | Type | Title | URL |
|:-----:|:----------:|:------------------------:|:-----------------------------------------------------------------:|
| 23.07 | Repository | llm-hallucination-survey | [link](https://github.com/HillZhang1999/llm-hallucination-survey) |
| 23.10 | Repository | LLM-Factuality-Survey | [link](https://github.com/wangcunxiang/LLM-Factuality-Survey) |
| 23.10 | Tutorials | Awesome-LLM-Safety | [link](https://github.com/ydyjya/Awesome-LLM-Safety) |

### Other

👉[Latest&Comprehensive Truthfulness&Misinformation Paper](./subtopic/Truthfulness&Misinformation.md)

---
## 😈JailBreak & Attacks

### 📑Papers
| Date | Institute | Publication | Paper |
|:-----:|:-------------------------------:|:----------------------------:|:-----------------------------------------------------------------------------------------------------------------------------------------:|
| 20.12 | Google | USENIX Security 2021 | [Extracting Training Data from Large Language Models](https://www.usenix.org/conference/usenixsecurity21/presentation/carlini-extracting) |
| 22.11 | AE Studio | NIPS2022(ML Safety Workshop) | [Ignore Previous Prompt: Attack Techniques For Language Models](https://arxiv.org/abs/2211.09527) |
| 23.06 | Google | arxiv | [Are aligned neural networks adversarially aligned?](https://arxiv.org/abs/2306.15447) |
| 23.07 | CMU | arxiv | [Universal and Transferable Adversarial Attacks on Aligned Language Models](https://arxiv.org/abs/2307.15043) |
| 23.10 | University of Pennsylvania | arxiv | [Jailbreaking Black Box Large Language Models in Twenty Queries](https://arxiv.org/abs/2310.08419) |

### 📖Tutorials, Articles, Presentations and Talks

| Date | Type | Title | URL |
|:-----:|:------------------:|:---------------------------------------------------------------------------------:|:--------------------------------------------------------------------------:|
| 23.01 | Community | Reddit/ChatGPTJailbrek | [link](https://www.reddit.com/r/ChatGPTJailbreak) |
| 23.02 | Resource&Tutorials | Jailbreak Chat | [link](https://www.jailbreakchat.com/) |
| 23.10 | Tutorials | Awesome-LLM-Safety | [link](https://github.com/ydyjya/Awesome-LLM-Safety) |
| 23.10 | Article | Adversarial Attacks on LLMs(Author: Lilian Weng) | [link](https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/) |
| 23.11 | Video | [1hr Talk] Intro to Large Language Models
From 45:45(Author: Andrej Karpathy) | [link](https://www.youtube.com/watch?v=zjkBMFhNj_g) |
| 24.09 | Repo | awesome_LLM-harmful-fine-tuning-papers | [link](https://github.com/git-disl/awesome_LLM-harmful-fine-tuning-papers) |

### Other

👉[Latest&Comprehensive JailBreak & Attacks Paper](./subtopic/Jailbreaks&Attack.md)

---
## 🛡️Defenses & Mitigation

### 📑Papers
| Date | Institute | Publication | Paper |
|:-----:|:---------------:|:-----------:|:-----------------------------------------------------------------------------------------------------------------------------:|
| 21.07 | Google Research | ACL2022 | [Deduplicating Training Data Makes Language Models Better](https://aclanthology.org/2022.acl-long.577/) |
| 22.04 | Anthropic | arxiv | [Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback](https://arxiv.org/abs/2204.05862) |

### 📖Tutorials, Articles, Presentations and Talks

| Date | Type | Title | URL |
|:-----:|:----------:|:---------------------:|:-------------------------------------------------------------:|
| 23.10 | Tutorials | Awesome-LLM-Safety | [link](https://github.com/ydyjya/Awesome-LLM-Safety) |

### Other

👉[Latest&Comprehensive Defenses Paper](./subtopic/Defense&Mitigation)

---
## 💯Datasets & Benchmark

### 📑Papers
| Date | Institute | Publication | Paper |
|:-----:|:------------------------:|:-------------------:|:----------------------------------------------------------------------------------------------------------------------------------------:|
| 20.09 | University of Washington | EMNLP2020(findings) | [RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models](https://arxiv.org/abs/2009.11462) |
| 21.09 | University of Oxford | ACL2022 | [TruthfulQA: Measuring How Models Mimic Human Falsehoods](https://arxiv.org/abs/2109.07958) |
| 22.03 | MIT | ACL2022 | [ToxiGen: A Large-Scale Machine-Generated datasets for Adversarial and Implicit Hate Speech Detection](https://arxiv.org/abs/2203.09509) |

### 📖Tutorials, Articles, Presentations and Talks

| Date | Type | Title | URL |
|:-----:|:---------:|:------------------:|:----------------------------------------------------:|
| 23.10 | Tutorials | Awesome-LLM-Safety | [link](https://github.com/ydyjya/Awesome-LLM-Safety) |

### 📚Resource📚
- Toxicity - [RealToxicityPrompts datasets](https://toxicdegeneration.allenai.org/)
- Truthfulness - [TruthfulQA datasets](https://github.com/sylinrl/TruthfulQA)

### Other
👉[Latest&Comprehensive datasets & Benchmark Paper](./subtopic/Datasets&Benchmark.md)

---
## 🧑‍🎓Author

**🤗If you have any questions, please contact our authors!🤗**

✉️: [ydyjya](https://github.com/ydyjya) ➡️ [email protected]

💬: **LLM Safety Discussion**

[Wechat Group](./resource/wechat.png) | [My Wechat](./resource/wechat.png)

---

[![Star History Chart](https://api.star-history.com/svg?repos=ydyjya/Awesome-LLM-Safety&type=Date)](https://star-history.com/#ydyjya/Awesome-LLM-Safety&Date)

**[⬆ Back to ToC](#table-of-contents)**